Docstoc

Exploring the Issues of Markov Decision Process Violations in

Document Sample
Exploring the Issues of Markov Decision Process Violations in Powered By Docstoc
					    EXPLORING MARKOV DECISION
    PROCESS VIOLATIONS IN
    REINFORCEMENT LEARNING
1   Jordan Fryer – University of Portland
    Working with Peter Heeman
OUTLINE



   Background: Reinforcement Learning (RL)
       RL and symbolic reasoning to learn system dialogue
        policy
 Background: Markov Decision Processes (MDP)
 The Problem
     Attempting to find absolute convergence
     Simplification process
     Evaluation tools

   Discussion
                                                             2
BACKGROUND: REINFORCEMENT LEARNING


   Inputs
       States
           How the agent represents the environment at a certain time
       Actions
           How the agent interacts with the environment
       Cost Function
           A probabilistic mapping of a state-action pair to a value
           Most of the costs may be assigned at terminal state
       Simulated User
           So can system can try out different dialogue behaviors
   Outputs: Optimal Policy
           A mapping of a state to an action
   How it learns
       Iteratively: evaluate current policy and explore                 3
        alternatives, and then update policy
BACKGROUND: REINFORCEMENT LEARNING


   Keep track of Q score for each state-action pair
        Cost to get to the end from state following that action
   For each dialogue simulation, take final cost and
    propagate it back over the state-action pairs in
    the run

         Q: 14          Q: 13        Q: 12     Q: 11

            a1             a2            a3         a4
    S1            S2            S3            S4           S5


     Utt: 1            Utt: 1        Utt: 1    Utt: 1    SQ: 10
                                                         Total: 14   4
BACKGROUND: MARKOV DECISION PROCESSES


 RL guaranteed to converge for Markov Decision
  Processes
 Only use current state to decide what action to do
  next
 System + User + Environment must satisfy:
       Pr {st+1 = s’ | st,at, st-1,at-1, …, s0,a0} = Pr {st+1 = s’ | st,at}
       Pr {rt+1 = r | st,at, st-1,at-1, …, s0,a0} = Pr {rt+1 = r| st,at}
   How detailed should states be?
     Too   detailed, becomes brute force and explodes state space
     Too   vague, violates the MDP assumptions
   RL learns solution very fast due to “merging” of states
                                                                               5
WHY RL FOR DIALOGUE?


   There is a delayed cost to dialogue.
       The correctness of a dialogue is not really known
        until the end of the dialogue and the task has been
        performed
   Modeling after humans isn’t always correct
       Many things a computer can do that a human can’t
        and many things a human can do that a computer
        can not
   Hard to handcraft a policy


                                                              6
THE PROBLEM: FINDING ABSOLUTE
CONVERGENCE
   RL is guaranteed to converge for MDP in the
    limit

 How do we know if we have an MDP violation?
 How long do we have to wait for convergence?

 How do we measure convergence?



   We will use QLearning with ε-greedy (20%)


                                                  7
DOMAIN

   Ran on toy Car Domain (from CS550 course)
     Database of 2000 cars (differ in color, year, model …)
     User has one of the 2000 cars in mind
     System asks questions and reports list of cars that
      match the user’s car
     State:
         11 Questions: Boolean (asked or not)
         carBucket: Number (bucketized number of cars)

         Done: Boolean (reported cars or not)

       Cost Function:
           1 cost per utterance, 5 cost per extra reported car

                                                                  8
FULL VERSION OF PROBLEM


 Every 1000 epochs (100 dialogue runs) test the
  current policy
 Could not get it to converge




                                                   9
SIMPLIFIED VERSION OF PROBLEM


   Let’s simplify problem (common CS approach)
       Removed some attributes, reduced buckets from 4 to
        2, force output and exit when only one car left
           Reduced number of state-action pairs
 Were able to use exact user distribution for
  testing
 Tool to examine the degree of convergence
     Compare results from multiple policies (existing tool)
     Keep track of the minimum testing score seen while
      training a policy
     Calculated the percentage of test sessions that are at
      the minimum testing score                                10
    E    #    AVE       MIN       C      %     SA        E    #     AVE       MIN      C      %      SA
    1   20   7.96066   7.95082   0.1   1.000   182   19000   20   7.95082   7.95082   0.5   0.989   216
    2   20   7.95738   7.95082   0.3   0.975   187   20000   20   7.95082   7.95082   0.5   0.990   216
    5   20   7.95656   7.95082   0.4   0.933   193   21000   20   7.95082   7.95082   0.5   0.990   216
   10   20   7.95328   7.95082   0.5   0.925   200   22000   20   7.95082   7.95082   0.5   0.990   216
   25   20   7.95082   7.95082   0.5   0.940   208   23000   20   7.95082   7.95082   0.5   0.991   216
   50   20   7.95082   7.95082   0.5   0.950   212   24000   20   7.95082   7.95082   0.5   0.991   216
  100   20   7.95164   7.95082   0.5   0.950   213   25000   20   7.95082   7.95082   0.5   0.991   216
  200   20   7.95082   7.95082   0.5   0.956   214   26000   20   7.95082   7.95082   0.5   0.991   216
  300   20   7.95082   7.95082   0.5   0.961   214   27000   20   7.95082   7.95082   0.5   0.992   216
  400   20   7.95082   7.95082   0.5   0.965   214   28000   20   7.95082   7.95082   0.5   0.992   216
  500   20   7.95082   7.95082   0.5   0.968   216   29000   20   7.95082   7.95082   0.5   0.992   216
  700   20   7.95082   7.95082   0.5   0.971   216   30000   20   7.95082   7.95082   0.5   0.992   216
 1000   20   7.95082   7.95082   0.5   0.973   216   31000   20   7.95082   7.95082   0.5   0.992   216
 1500   20   7.95082   7.95082   0.5   0.975   216   32000   20   7.95082   7.95082   0.5   0.992   216
 2000   20   7.95082   7.95082   0.5   0.977   216   33000   20   7.95082   7.95082   0.5   0.993   216
 2500   20   7.95082   7.95082   0.5   0.978   216   34000   20   7.95082   7.95082   0.5   0.993   216
 3000   20   7.95082   7.95082   0.5   0.979   216   35000   20   7.95082   7.95082   0.5   0.993   216
 4000   20   7.95082   7.95082   0.5   0.980   216   36000   20   7.95082   7.95082   0.5   0.993   216
 5000   20   7.95082   7.95082   0.5   0.981   216   37000   20   7.95082   7.95082   0.3   0.993   216
 6000   20   7.95082   7.95082   0.5   0.982   216   38000   20   7.95082   7.95082   0.1   0.994   216
 7000   20   7.95082   7.95082   0.5   0.983   216   39000   20   7.95082   7.95082   0.1   0.997   216
 8000   20   7.95082   7.95082   0.5   0.984   216   40000   20   7.95082   7.95082   0.0   0.999   216
 9000   20   7.95082   7.95082   0.5   0.985   216   41000   20   7.95082   7.95082   0.0   0.999   216
10000   20   7.95082   7.95082   0.5   0.985   216   42000   20   7.95082   7.95082   0.0   0.999   216
11000   20   7.95082   7.95082   0.5   0.986   216   43000   20   7.95082   7.95082   0.0   1.000   216
12000   20   7.95082   7.95082   0.5   0.987   216   44000   20   7.95082   7.95082   0.0   1.000   216
13000   20   7.95082   7.95082   0.5   0.987   216   45000   20   7.95082   7.95082   0.0   1.000   216
14000   20   7.95082   7.95082   0.5   0.987   216   46000   20   7.95082   7.95082   0.0   1.000   216
15000   20   7.95082   7.95082   0.5   0.988   216                                                      11
                                                     47000   20   7.95082   7.95082   0.0   1.000   216
16000   20   7.95082   7.95082   0.5   0.988   216   48000   20   7.95082   7.95082   0.0   1.000   216
17000   20   7.95082   7.95082   0.5   0.989   216   49000   20   7.95082   7.95082   0.0   1.000   216
18000   20   7.95082   7.95082   0.5   0.989   216   50000   20   7.95082   7.95082   0.0   1.000   216
   E   #   AVE    MIN       C   %      SA

   1 20 7.96066 7.95082    0.1 1.000   182

   2 20 7.95738 7.95082    0.3 0.975   187

 500 20 7.95082 7.95082    0.5 0.968   216

43000 20 7.95082 7.95082   0.0 1.000   216




                                             12
SIMPLIFIED VERSION OF PROBLEM


   Bugs found:
       Would converge and then go out of convergence
           Alpha rounding errors
       Would find a minimum score within first 10 epochs
        that it could never find again
           States not seen yet in training, but seen in testing, must
            choose same action during the test session
   Got convergence
       Convergence achieved when all SA pairs explored



                                                                         13
  A BIT MORE COMPLEXITY

     Make domain more complex:
            Added back all attributes, 2 buckets, no exit constraint
     Absolute convergence before all SA pairs seen

    E    #    AVE       MIN       C      %      SA          E    #   AVE        MIN       C      %      SA
125000   7   6.82600   6.82600   0.1   0.997   19380                            …
126000   7   6.82600   6.82600   0.1   0.997   19390   1000000   7   6.82600   6.82600   0.0   1.000   21445
127000   7   6.82600   6.82600   0.1   0.997   19398   1001000   7   6.82600   6.82600   0.0   1.000   21445
128000   7   6.82600   6.82600   0.1   0.997   19401   1002000   7   6.82600   6.82600   0.0   1.000   21446
129000   7   6.82600   6.82600   0.1   0.997   19404   1003000   7   6.82600   6.82600   0.0   1.000   21448
130000   7   6.82600   6.82600   0.1   0.997   19415   1004000   7   6.82600   6.82600   0.0   1.000   21449
131000   7   6.82600   6.82600   0.1   0.997   19428   1005000   7   6.82600   6.82600   0.0   1.000   21449
132000   7   6.82600   6.82600   0.1   0.997   19434   1006000   7   6.82600   6.82600   0.0   1.000   21450
133000   7   6.82600   6.82600   0.1   1.000   19443   1007000   7   6.82600   6.82600   0.0   1.000   21450
134000   7   6.82600   6.82600   0.1   1.000   19450   1008000   7   6.82600   6.82600   0.0   1.000   21451
135000   7   6.82600   6.82600   0.1   1.000   19457   1009000   7   6.82600   6.82600   0.0   1.000   21451
136000   7   6.82600   6.82600   0.1   1.000   19463   1010000   7   6.82600   6.82600   0.0   1.000   21454
137000   7   6.82600   6.82600   0.0   1.000   19468   1011000   7   6.82600   6.82600   0.0   1.000   21455
138000   7   6.82600   6.82600   0.0   1.000   19475   1012000   7   6.82600   6.82600   0.0   1.000   21455
139000   7   6.82600   6.82600   0.0   1.000   19481   1013000   7   6.82600   6.82600   0.0   1.000   21457   14
140000   7   6.82600   6.82600   0.0   1.000   19492   1014000   7   6.82600   6.82600   0.0   1.000   21457
141000   7   6.82600   6.82600   0.0   1.000   19499   1015000   7   6.82600   6.82600   0.0   1.000   21459
                                                                                …
QTRAIN AND QTEST


   Qtrain: The Q values of an SA pair, used by RL
    in training by following the policy and exploring
       Should converge to Q* (values for the optimal policy),
        but never know what Q* is
   Our group has also been using Qtest
       The Q values of an SA pair, determined through
        testing the optimal policy
   Qtest and Qtrain should converge to Q* and so
    should converge to same value


                                                                 15
QTRAIN AND QTEST




   Helps to show absolute convergence   16
NOW THE FULL VERSION


 Moved up to 4 buckets, no absolute convergence
 Noticed difference in Qtest and Qtrain

 Can further analyze Qtest,
     Different states “merge”
     If MDP, path should not matter




                                                   17
VISUALIZING MDP VIOLATION


                       2.0         CDYM1    2.0

                             2.0
     CYM5

                         4.255
    Action: AskDoors               CDYM5    4.4425
                             5.0

    CYM15


                                   CDYM15   6.333
                       6.333

                                                    18
MDP VIOLATION


   Why is there an MDP violation?
       CarBuckets: caused states to be treated as equal
        when clearly they are not.
   How to remove MDP violation?
     Keep a more accurate history
     Not always possible: State space explodes
           Just keeping track of the order in which questions are
            asked leads to ~40 million states
       Barto & Sutton admit that most problems are not
        perfect MDPs but that RL can deal with it


                                                                     19
DISCUSSION


   Does having a MDP violation hurt you?
       Despite non-convergence the 4 buckets did better
        than the 2 buckets.
           6.8091 vs 6.8260
     RL can deal with some MDP violation
     Car gas mileage analogy




                                                           20
DISCUSSION


   Does this mean we don’t care about MDP
    violation?
       Rueckert (REU last year) removed an MDP violation
         did not increase the state space dramatically
         improved the policy learned

     One should be aware of any MDP violations
     Our tool can find them (or some of them)
     Major MDP violations need to be fixed




                                                            21
ACKNOWLEDGEMENTS AND QUESTIONS


   Thanks to:
       My fellow interns
       Pat Dickerson & Kim Basney
       Peter Heeman
       Rebecca Lunsford
       Andrew Rueckert
       Ethan Selfridge
       Everyone at OGI
   Questions?


                                     22

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:4/2/2013
language:Unknown
pages:22