Exploring the Issues of Markov Decision Process Violations in

Document Sample

```					    EXPLORING MARKOV DECISION
PROCESS VIOLATIONS IN
REINFORCEMENT LEARNING
1   Jordan Fryer – University of Portland
Working with Peter Heeman
OUTLINE

   Background: Reinforcement Learning (RL)
   RL and symbolic reasoning to learn system dialogue
policy
 Background: Markov Decision Processes (MDP)
 The Problem
 Attempting to find absolute convergence
 Simplification process
 Evaluation tools

   Discussion
2
BACKGROUND: REINFORCEMENT LEARNING

   Inputs
   States
   How the agent represents the environment at a certain time
   Actions
   How the agent interacts with the environment
   Cost Function
   A probabilistic mapping of a state-action pair to a value
   Most of the costs may be assigned at terminal state
   Simulated User
   So can system can try out different dialogue behaviors
   Outputs: Optimal Policy
   A mapping of a state to an action
   How it learns
   Iteratively: evaluate current policy and explore                 3
alternatives, and then update policy
BACKGROUND: REINFORCEMENT LEARNING

   Keep track of Q score for each state-action pair
   Cost to get to the end from state following that action
   For each dialogue simulation, take final cost and
propagate it back over the state-action pairs in
the run

Q: 14          Q: 13        Q: 12     Q: 11

a1             a2            a3         a4
S1            S2            S3            S4           S5

Utt: 1            Utt: 1        Utt: 1    Utt: 1    SQ: 10
Total: 14   4
BACKGROUND: MARKOV DECISION PROCESSES

 RL guaranteed to converge for Markov Decision
Processes
 Only use current state to decide what action to do
next
 System + User + Environment must satisfy:
   Pr {st+1 = s’ | st,at, st-1,at-1, …, s0,a0} = Pr {st+1 = s’ | st,at}
   Pr {rt+1 = r | st,at, st-1,at-1, …, s0,a0} = Pr {rt+1 = r| st,at}
   How detailed should states be?
 Too   detailed, becomes brute force and explodes state space
 Too   vague, violates the MDP assumptions
   RL learns solution very fast due to “merging” of states
5
WHY RL FOR DIALOGUE?

   There is a delayed cost to dialogue.
   The correctness of a dialogue is not really known
until the end of the dialogue and the task has been
performed
   Modeling after humans isn’t always correct
   Many things a computer can do that a human can’t
and many things a human can do that a computer
can not
   Hard to handcraft a policy

6
THE PROBLEM: FINDING ABSOLUTE
CONVERGENCE
   RL is guaranteed to converge for MDP in the
limit

 How do we know if we have an MDP violation?
 How long do we have to wait for convergence?

 How do we measure convergence?

   We will use QLearning with ε-greedy (20%)

7
DOMAIN

   Ran on toy Car Domain (from CS550 course)
 Database of 2000 cars (differ in color, year, model …)
 User has one of the 2000 cars in mind
 System asks questions and reports list of cars that
match the user’s car
 State:
 11 Questions: Boolean (asked or not)
 carBucket: Number (bucketized number of cars)

 Done: Boolean (reported cars or not)

   Cost Function:
   1 cost per utterance, 5 cost per extra reported car

8
FULL VERSION OF PROBLEM

 Every 1000 epochs (100 dialogue runs) test the
current policy
 Could not get it to converge

9
SIMPLIFIED VERSION OF PROBLEM

   Let’s simplify problem (common CS approach)
   Removed some attributes, reduced buckets from 4 to
2, force output and exit when only one car left
   Reduced number of state-action pairs
 Were able to use exact user distribution for
testing
 Tool to examine the degree of convergence
 Compare results from multiple policies (existing tool)
 Keep track of the minimum testing score seen while
training a policy
 Calculated the percentage of test sessions that are at
the minimum testing score                                10
E    #    AVE       MIN       C      %     SA        E    #     AVE       MIN      C      %      SA
1   20   7.96066   7.95082   0.1   1.000   182   19000   20   7.95082   7.95082   0.5   0.989   216
2   20   7.95738   7.95082   0.3   0.975   187   20000   20   7.95082   7.95082   0.5   0.990   216
5   20   7.95656   7.95082   0.4   0.933   193   21000   20   7.95082   7.95082   0.5   0.990   216
10   20   7.95328   7.95082   0.5   0.925   200   22000   20   7.95082   7.95082   0.5   0.990   216
25   20   7.95082   7.95082   0.5   0.940   208   23000   20   7.95082   7.95082   0.5   0.991   216
50   20   7.95082   7.95082   0.5   0.950   212   24000   20   7.95082   7.95082   0.5   0.991   216
100   20   7.95164   7.95082   0.5   0.950   213   25000   20   7.95082   7.95082   0.5   0.991   216
200   20   7.95082   7.95082   0.5   0.956   214   26000   20   7.95082   7.95082   0.5   0.991   216
300   20   7.95082   7.95082   0.5   0.961   214   27000   20   7.95082   7.95082   0.5   0.992   216
400   20   7.95082   7.95082   0.5   0.965   214   28000   20   7.95082   7.95082   0.5   0.992   216
500   20   7.95082   7.95082   0.5   0.968   216   29000   20   7.95082   7.95082   0.5   0.992   216
700   20   7.95082   7.95082   0.5   0.971   216   30000   20   7.95082   7.95082   0.5   0.992   216
1000   20   7.95082   7.95082   0.5   0.973   216   31000   20   7.95082   7.95082   0.5   0.992   216
1500   20   7.95082   7.95082   0.5   0.975   216   32000   20   7.95082   7.95082   0.5   0.992   216
2000   20   7.95082   7.95082   0.5   0.977   216   33000   20   7.95082   7.95082   0.5   0.993   216
2500   20   7.95082   7.95082   0.5   0.978   216   34000   20   7.95082   7.95082   0.5   0.993   216
3000   20   7.95082   7.95082   0.5   0.979   216   35000   20   7.95082   7.95082   0.5   0.993   216
4000   20   7.95082   7.95082   0.5   0.980   216   36000   20   7.95082   7.95082   0.5   0.993   216
5000   20   7.95082   7.95082   0.5   0.981   216   37000   20   7.95082   7.95082   0.3   0.993   216
6000   20   7.95082   7.95082   0.5   0.982   216   38000   20   7.95082   7.95082   0.1   0.994   216
7000   20   7.95082   7.95082   0.5   0.983   216   39000   20   7.95082   7.95082   0.1   0.997   216
8000   20   7.95082   7.95082   0.5   0.984   216   40000   20   7.95082   7.95082   0.0   0.999   216
9000   20   7.95082   7.95082   0.5   0.985   216   41000   20   7.95082   7.95082   0.0   0.999   216
10000   20   7.95082   7.95082   0.5   0.985   216   42000   20   7.95082   7.95082   0.0   0.999   216
11000   20   7.95082   7.95082   0.5   0.986   216   43000   20   7.95082   7.95082   0.0   1.000   216
12000   20   7.95082   7.95082   0.5   0.987   216   44000   20   7.95082   7.95082   0.0   1.000   216
13000   20   7.95082   7.95082   0.5   0.987   216   45000   20   7.95082   7.95082   0.0   1.000   216
14000   20   7.95082   7.95082   0.5   0.987   216   46000   20   7.95082   7.95082   0.0   1.000   216
15000   20   7.95082   7.95082   0.5   0.988   216                                                      11
47000   20   7.95082   7.95082   0.0   1.000   216
16000   20   7.95082   7.95082   0.5   0.988   216   48000   20   7.95082   7.95082   0.0   1.000   216
17000   20   7.95082   7.95082   0.5   0.989   216   49000   20   7.95082   7.95082   0.0   1.000   216
18000   20   7.95082   7.95082   0.5   0.989   216   50000   20   7.95082   7.95082   0.0   1.000   216
E   #   AVE    MIN       C   %      SA

1 20 7.96066 7.95082    0.1 1.000   182

2 20 7.95738 7.95082    0.3 0.975   187

500 20 7.95082 7.95082    0.5 0.968   216

43000 20 7.95082 7.95082   0.0 1.000   216

12
SIMPLIFIED VERSION OF PROBLEM

   Bugs found:
   Would converge and then go out of convergence
   Alpha rounding errors
   Would find a minimum score within first 10 epochs
that it could never find again
   States not seen yet in training, but seen in testing, must
choose same action during the test session
   Got convergence
   Convergence achieved when all SA pairs explored

13
A BIT MORE COMPLEXITY

   Make domain more complex:
   Added back all attributes, 2 buckets, no exit constraint
   Absolute convergence before all SA pairs seen

E    #    AVE       MIN       C      %      SA          E    #   AVE        MIN       C      %      SA
125000   7   6.82600   6.82600   0.1   0.997   19380                            …
126000   7   6.82600   6.82600   0.1   0.997   19390   1000000   7   6.82600   6.82600   0.0   1.000   21445
127000   7   6.82600   6.82600   0.1   0.997   19398   1001000   7   6.82600   6.82600   0.0   1.000   21445
128000   7   6.82600   6.82600   0.1   0.997   19401   1002000   7   6.82600   6.82600   0.0   1.000   21446
129000   7   6.82600   6.82600   0.1   0.997   19404   1003000   7   6.82600   6.82600   0.0   1.000   21448
130000   7   6.82600   6.82600   0.1   0.997   19415   1004000   7   6.82600   6.82600   0.0   1.000   21449
131000   7   6.82600   6.82600   0.1   0.997   19428   1005000   7   6.82600   6.82600   0.0   1.000   21449
132000   7   6.82600   6.82600   0.1   0.997   19434   1006000   7   6.82600   6.82600   0.0   1.000   21450
133000   7   6.82600   6.82600   0.1   1.000   19443   1007000   7   6.82600   6.82600   0.0   1.000   21450
134000   7   6.82600   6.82600   0.1   1.000   19450   1008000   7   6.82600   6.82600   0.0   1.000   21451
135000   7   6.82600   6.82600   0.1   1.000   19457   1009000   7   6.82600   6.82600   0.0   1.000   21451
136000   7   6.82600   6.82600   0.1   1.000   19463   1010000   7   6.82600   6.82600   0.0   1.000   21454
137000   7   6.82600   6.82600   0.0   1.000   19468   1011000   7   6.82600   6.82600   0.0   1.000   21455
138000   7   6.82600   6.82600   0.0   1.000   19475   1012000   7   6.82600   6.82600   0.0   1.000   21455
139000   7   6.82600   6.82600   0.0   1.000   19481   1013000   7   6.82600   6.82600   0.0   1.000   21457   14
140000   7   6.82600   6.82600   0.0   1.000   19492   1014000   7   6.82600   6.82600   0.0   1.000   21457
141000   7   6.82600   6.82600   0.0   1.000   19499   1015000   7   6.82600   6.82600   0.0   1.000   21459
…
QTRAIN AND QTEST

   Qtrain: The Q values of an SA pair, used by RL
in training by following the policy and exploring
   Should converge to Q* (values for the optimal policy),
but never know what Q* is
   Our group has also been using Qtest
   The Q values of an SA pair, determined through
testing the optimal policy
   Qtest and Qtrain should converge to Q* and so
should converge to same value

15
QTRAIN AND QTEST

   Helps to show absolute convergence   16
NOW THE FULL VERSION

 Moved up to 4 buckets, no absolute convergence
 Noticed difference in Qtest and Qtrain

 Can further analyze Qtest,
 Different states “merge”
 If MDP, path should not matter

17
VISUALIZING MDP VIOLATION

2.0         CDYM1    2.0

2.0
CYM5

4.255
5.0

CYM15

CDYM15   6.333
6.333

18
MDP VIOLATION

   Why is there an MDP violation?
   CarBuckets: caused states to be treated as equal
when clearly they are not.
   How to remove MDP violation?
 Keep a more accurate history
 Not always possible: State space explodes
   Just keeping track of the order in which questions are
   Barto & Sutton admit that most problems are not
perfect MDPs but that RL can deal with it

19
DISCUSSION

   Does having a MDP violation hurt you?
   Despite non-convergence the 4 buckets did better
than the 2 buckets.
   6.8091 vs 6.8260
 RL can deal with some MDP violation
 Car gas mileage analogy

20
DISCUSSION

   Does this mean we don’t care about MDP
violation?
   Rueckert (REU last year) removed an MDP violation
 did not increase the state space dramatically
 improved the policy learned

 One should be aware of any MDP violations
 Our tool can find them (or some of them)
 Major MDP violations need to be fixed

21
ACKNOWLEDGEMENTS AND QUESTIONS

   Thanks to:
   My fellow interns
   Pat Dickerson & Kim Basney
   Peter Heeman
   Rebecca Lunsford
   Andrew Rueckert
   Ethan Selfridge
   Everyone at OGI
   Questions?

22

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 0 posted: 4/2/2013 language: Unknown pages: 22