Reinforcement Learning Resources by MikeJenny

VIEWS: 7 PAGES: 48

									     Reinforcement Learning:
            Resources
• Class text: Chapter 21 (RL)
• Class text: Chapter 17 (Markov decision
  processes)
• Java applet example (simulated robot)
  – http://iridia.ulb.ac.be/~fvandenb/qlearning/qlearning.html

• Rich Maclin’s notes
  http://www.d.umn.edu/~rmaclin/cs8751/Notes/L12_Reinforcement_Learning.pdf




                                                                              1
                 Outline
• Problem addressed
    – Example problems
•   Reward function: R(s)
•   Policies, discount
•   Actions & Markov Decision Problems
•   V*-- an initial formulation
•   Approximating V*: Q Learning
                                         2
        Problem Addressed
• “without some feedback about what is good
  and what is bad, [an] agent [has] no grounds
  for deciding which move to make” (p. 763,
  R&N)
• Characteristics of learning situation
  – Partially delayed reward
     • Incremental feedback about what is good or bad
  – Or fully delayed: Reward only at end of episode
     • E.g., Chess, backgammon
  – Opportunity for active exploration
  – Actions probabilistically lead to next state

                                                        3
      Example: TD-Gammon
• Training
  – played 1.5 million games against itself
• Reward scheme
  +100 if win
  -100 if lose
  0 for all other states
• After training, approximately as good as best
  human player

  Tesauro, G. (1995). Temporal Difference
  Learning and TD-Gammon. Communications
  of the ACM, 38, No. 3.                  4
  http://www.research.ibm.com/massive/tdl.html
  Example: Learning to Walk
• Policy Gradient Reinforcement Learning for
  Fast Quadrupedal Locomotion
  – Nate Kohl and Peter Stone.
  – Proceedings of the IEEE International Conference
    on Robotics and Automation, pp. 2619--2624, May
    2004.
• Methods
  –   Gradient reinforcement learning
  –   No “state”
  –   Forward speed is the objective function (“reward”)
  –   Estimate gradient (partial derivative) of gait
      configurations, and follow gradient towards higher
      speeds
                                                           5
  Experimental Situation



                QuickTime™ an d a
           YUV420 codec decompressor
          are need ed to see this p icture .




                                                                      6
http://www.cs.utexas.edu/users/AustinVilla/?p=research/learned_walk
                     QuickTime™ an d a
                YUV420 codec decompressor
               are need ed to see this p icture .




• Initially, the Aibo's gait is clumsy and fairly slow
  (less than 150 mm/s). We deliberately started with
  a poor gait so that the learning process would not
  be systematically biased towards our best hand-
  tuned gait, which might have been locally optimal.
            Learning to Walk



                  QuickTime™ an d a
             YUV420 codec decompressor
            are need ed to see this p icture .




• Midway through the training process, the Aibo is
  moving much faster than it was initially. However, it
  still exhibits some irregularities that slow it down.
        Done Learning to Walk


      QuickTime™ an d a                        QuickTime™ an d a
 YUV420 codec decompressor                        decompressor
are need ed to see this p icture .       are need ed to see this p icture .




• After traversing the field a total of just over 1000 times over the
  course of 3 hours, we achieved our best learned gait, which
  allows the Aibo to move at approximately 291 mm/s. To our
  knowledge, this is the fastest reported walk on an Aibo as of
  November 2003. The hash marks on the field are 200 mm apart.
  The Aibo traverses 9 of them in 6.13 seconds demonstrating a
  speed of 1800mm/6.13s > 291 mm/s.
   Example: Piloting a Small
         Helicopter

• An Application of Reinforcement Learning to
  Aerobatic Helicopter Flight, Pieter Abbeel,
  Adam Coates, Morgan Quigley and Andrew
  Y. Ng. To appear in NIPS 19, 2007.
• Difficult real-time control problem
• Negative rewards for crashing, wobbling, or
  deviating from a set course


                                                10
        http://ai.stanford.edu/~ang/
  Reward function for another
     problem: Hovering


                           QuickTime™ and a
                       TIFF (LZW) decompressor
                    are neede d to see this picture.




• Ng et al. (2004). Autonomous inverted
  helicopter flight via reinforcement learning
                                                       11
       QuickTime™ and a
DV/DVCPRO - NTSC d ecompressor
 are neede d to see this picture.




                                    12
      Reward function: R(s)
• R(s) is a reward function
• s is a state that the agent is in
• R(s) defines the reward that agent gets
  (once) for being in state s
• Can be positive (rewarding) or 0 (non-
  rewarding), or negative (punishing)
• State space & R(s) must be defined for
  the learning task
                                            13
                           QuickTime™ and a
                       TIFF (LZW) decompressor
                    are neede d to see this picture.




                                                                            14
http://www.d.umn.edu/~rmaclin/cs8751/Notes/L12_Reinforcement_Learning.pdf
                   Strategy
• Reinforcement learning attempts to find
  a policy, , that best satisfies this goal
              r0   r1   r2  ...
                           2




• Goal above is called utility (also known
  as value)

    Is called the discount factor
                                             15
                     Policies
• policy: 
   – What the agent should do for any state that the
      agent might reach (p. 615)
   (s) = a
   s is a state
   a is action agent should take in that state
• “quality of a policy is … measured by the
  expected utility of the possible environment
  histories generated by that policy” (p. 615)
• Optimal policy: *
   – Yields the highest expected utility
                                                       16
      Agent & Environment
• Agent performs a series of trials in the
  environment using its policy, 
• Agent percepts
  – Current state, s
  – Reward, r, for state s




                                             17
             Discount Factor, 
  • “describes the preference of an agent for
    current rewards over future rewards. When 
    is close to 0, rewards in the distant future are
    viewed as insignificant.” When  is close to 1
    rewards in the distant future are viewed as
    preferable. (P. 617 of text; italics added)
  •  can also be seen as a characteristic of a
    task
  • E.g., in flying a helicopter, distant future
    rewards may not useful if we crash the
    helicopter and permanently damage it on the
    current trial!

 near 0: immediate gratification is best
         Actions: Probabilistic
• Actions can have probabilistic effect
  – You don’t always achieve the effect you are trying for
• Probability of reaching s’ from s depends only
  on s and the action applied to s
  – Does not depend on the previous history of the agent
  – Markov Decision Process (MDP)
• MDP (Markov Decision Process) definition
  – Initial state: s0
  – Transition model: T(s, a, s’)
     • Probability that when you are in state s and apply action a
       you will end up in state s’
     • Depends only on state s and action a
  – Reward function: R(s)                                        19
  – Agent may not know T or R beforehand
         Equation 21.1 of Text
 • Or how to make relatively simple things
   complex ;)

         U (s)  E[  t R(st ) |  ,s0  s]
                           
            
                           t 0

 • Really just our previous goal with the E[f]
   notation
                r0   r1   2 r2  ...

                                                 20
                     V*(s)
• We want to maximize
  the expected value of
  equation 21.1
• I.e., select a policy 
  ((s) = a) that
  maximizes the value
  we can expect
• Call this maximized V * (s)  r   r   2 r  ...
                                 0     1       2
  value


                                               21
                  Example - 1
 • Assume a simple board game in which there
   is one location that is a “win”
 • Agent can move from any non-win state to
   neighbor states

                                    R(3) =100
Actions:                            R(n) = 0 for
up, down,     1    2     3          all other n
left, right
                                     Let  = 0.9
              4    5     6


                                                22
                        Exercise                     R(3) =100
                                                     R(n) = 0 for
                                                     all other n
• Now, by hand, determine V*(s) for all
  states, s
                                                       = 0.9
         V * (s)  r0   r1   2 r2  ...
                   R(s)   R(s ')   2 R(s '')  ...


              1          2         3



              4          5         6


                       Hint: Start at V*(3)                    23
                                 Solution                        R(3) =100
V (s)  r0   r1   r2  ...
  *                          2
                                                                 R(n) = 0 for
      R(s)   R(s ')   2 R(s '')  ...                       all other n

                                                                        = 0.9
 1           2           3



 4           5           6


 •    V*(3) = R(3) = 100
 •    V*(2) = R(2) + 0.9*R(3) = 0 + 0.9*100 = 90
 •    V*(1) = R(1) + 0.9*R(2) + 0.92R(3) = 0 + 0.9*0 + 0.92*100 = 81
 •    V*(6) = R(6) + 0.9*R(3) = 0 + 0.9*100 = 90
 •    V*(5) = R(5) + 0.9*R(2) + 0.92R(3) = 0 + 0.9*0 + 0.92*100 = 81
 •    V*(4) = R(4) + 0.9*R(1) + 0.92*R(2) + 0.93R(3) = 72.9
                                                                             24
                                  81        90       100
   Using V*                  1          2        3


                                 72.9       81       90
                             4          5        6




• V* can be used to determine which move is the
  best move
• Algorithm for using V* to pick the best next move
  – Let s be our current state
  – Pick the action, a, such that V*(s’) is maximized
• V* can thus be used to determine *

                                                           25
 How can we compute V*(s)?

• It would be good to have a
  method that
  –incrementally takes actions,
   through a series of trials
  –observes the new state
  –gets reward for the new state
• And uses these values to
  compute V*(s)                    26
                 For example
• At the start of learning to fly the helicopter we
  may not know
   – What the reward value, R(s), for a state will be;
     e.g., if we put the helicopter into a forward pitch
     attitude, what R(s) value will result?
• Also don’t know at the start with what
  probabilities one action taken in a state will
  lead to another state
   – I.e., we don’t know T(s, a, s’)
   – With helicopter, if we are flying ahead with 5
     degrees forward pitch, and we push the stick
     forward another 10 degrees, what will the next
     state be, and with what probability?                  27
                   Q Function
• Define a new function, Q(s, a), which is
  closely related to V*(s)

  Q(s, a) = R(s) + V*(s’)
     Where s’ is the state resulting from applying action a in
      state s
                                V * (s)  r0   r1   2 r2  ...
• Claims:
  A) If an agent learns the function, Q, it can choose
    an optimal action
    I.e., it will have implicitly learned V* and hence *
  B) We can write a formulation of function Q such that
    if the agent learns approximations of Q, these
    approximations will converge to Q itself
                                      ˆ
            Call these approximations Q
                                                                 28
           Q and V* are closely related
          V * (s)  max[Q(s,a)]
                     a

     • To see this, recall that V*(s) is defined as the
       maximum value for
                  r0   r1   2 r2  ...

     • So, if an action a that maximizes
          Q(s, a) is selected, then you are maximizing:
                 Q(s,a)  R(s)  V (s')
                                       *



     • r0=R(s), and

               max[Q(s,a)]  r0   r1   2 r2  ...     29
               a
                   Note Further
      • From our definition
          Q(s,a)  R(s)  V (s')
                             *



      • And from our proof
          V * (s)  max[Q(s,a)]
                  a

      • We now also have:
          Q(s, )  R(s)   max[Q(s',a)]
                                a
                                            30
                Claim A)
• If an agent learns the function, Q, it
  can choose an optimal action
    I.e., it will have implicitly learned *
• Have shown that
        V * (s)  max[Q(s,a)]
                  a

• Remember that we were able to
  use V*(s) previously to instantiate
 optimal policy
  an
• So, if we have Q(s,a), we can select an action
that maximizes Q(s, a), hence we have V*(s),
hence we have implicitly obtained *
     Learning Approximations of Q

     • Let   ˆ
             Q   Denote learner’s current
                 approximation to Q

     • Consider a training rule
       ˆ                       ˆ
         Q(s,  )  R(s)   max[Q(s',a)]
                              a

                                            32
            Q Learning for
          Deterministic Worlds
• For each (s, ) initialize a table entry
          ˆ
          Q(s,  )  0
• Initialize current state, s
• Do forever
      Select action  and take the action
      Observe new state, s’
      Get reward for new state, R(s’)
      Update the table entry for (s, ) as follows
         ˆ                       ˆ
         Q(s,  )  R(s)   max[Q(s',a)]
                                   a
    s  s’                                           33
Example of Computing ˆ
                     Q(s,a)
                                                  R(3) =100
 • Use earlier board game
                                                  R(n) = 0 for
                                                  all other n

                                                Let  = 0.9
    1        2          3

                                    Actions:
                                                   Initially:
    4        5          6
                                    up, down,        ˆ
                                                     Q(s,  )  0
                                    left, right

        Exercise:
                 Compute some initial values of
                                                 ˆ
                                               Q(s,  )
                     ˆ
                     Q(3,  )    ˆ
                                 Q(2,  )   ˆ
                                            Q(6,  )
                                                           34
                        Solution - 1        R(3) =100
1   2          3
                                            R(n) = 0 for
                                            all other n
4   5          6

                                            Let  = 0.9
         ˆ
         Q(s,  )
    up     down left      right   • First, we initialize
                                    the table
1   0      0        0     0
                                  • And let s = 3
2   0      0        0     0         (picking a start
3   0      0        0     0         state)
4   0      0        0     0       • Now, compute
                                       ˆ
                                       Q(s,  )
5   0      0        0     0
                                    for each action, 
6   0      0        0     0
                  ˆ )
        Computing Q(3,
  ˆ                       ˆ
  Q(3,  )  R(3)   max[Q(s',a)]
                         a


• s’ is a next state that we can reach from
  s=3
  – There are no next states reachable from
    s=3
• So,
        ˆ
        Q(3,  )  R(3)  100

        For all values of 
                                              36
                       Solution - 2        R(3) =100
1   2         3
                                           R(n) = 0 for
                                           all other n
4   5         6

                                           Let  = 0.9
          ˆ
          Q(s,  )
    up     down left       right   • Now, we can
                                                ˆ
                                     compute Q(s,  )
1   0      0         0     0
2   0      0         0     0       • For states that
                                     are neighboring
3   100    100       100   100
                                     s=3
4   0      0         0     0
5   0      0         0     0           ˆ
                                       Q(2,  )  ?
6   0      0         0     0           ˆ
                                       Q(6,  )  ?
            ˆ )
  Computing Q(2,                                        1    2    3

 ˆ                       ˆ
 Q(2,  )  R(2)   max[Q(s',a)]                       4    5    6
                                 a

• s’ is a next state that we can reach from s=2 with a
  particular action
   i.e., s’=3 (=right), s’=5 (=down), or s’=1 ( = left)
                                                   ˆ
• So, for each particular  we need to compute max[Q(s',a)]
                                                a
                            ˆ
• Right now, all of those Q(s ', a)
   values except for s’=3 ( =right) are 0
• So,
                                                   
      ˆ                         ˆ
      Q(2,right)  R(2)   max[Q(3,a)]
                                           a

                       R(2)  0.9(100)  90                     38
                       Solution - 3      R(3) =100
1   2         3
                                         R(n) = 0 for
                                         all other n
4   5         6

                                         Let  = 0.9
          ˆ
          Q(s,  )
    up     down left       right
1   0      0         0     0       ˆ
                                   Q(6,  )  ?
2   0      0         0     90
3   100    100       100   100
4   0      0         0     0
5   0      0         0     0
6   0      0         0     0
            ˆ
  Computing Q(6,a)                               1        3
                                                     2

 ˆ                       ˆ
 Q(6,  )  R(6)   max[Q(s',a)]                4   5    6
                                 a

• s’ is a next state that we can reach from s=6 with a

        
  particular action
   i.e., s’=5 ( =left), s’=3 ( =up)
                                                   ˆ
• So, for each particular  we need to compute max[Q(s',a)]
                                                a
                        ˆ
• Right now, the s’=5 Q(s',a)
   values are 0
• So,
                                            
        ˆ                    ˆ
        Q(6,up)  R(6)   max[Q(3,a)]
                                        a

                      R(6)  0.9(100)  90              40
                        Solution - 3   R(3) =100
1    2        3
                                       R(n) = 0 for
                                       all other n
4    5        6

                                       Let  = 0.9
           ˆ
           Q(s,a)
     up     down left     right
1    0      0       0     0

2    0      0       0     90
3    100    100     100   100
4    0      0       0     0
5    0      0       0     0
6    90     0       0     0
                Spreadsheet - 1
• Systematically compute Q estimator
  values
• Represent table of Q(s, a) estimates as
  a row in a spreadsheet
• In each trial (line on spreadsheet), just
  update one Q(s, a) estimate
  – I.e., an action is taken from a particular
    state


 http://www.cprince.com/courses/cs5541/lectures/RL/QLearning.xls   42
 http://www.cprince.com/courses/cs5541/lectures/RL/QLearning.pdf
                        Spreadsheet - 2
• Trials
  – Start in state 4
  – Used three of various possible trial
    sequences, ending at state 3
  – Indicate different trials with red, green, blue
• At the start of the third iteration through
  the entries in the Q(s, a) estimator table
  have converged to V* values

                        81        90       100
                   1          2        3

           V*          72.9       81       90         43
                   4          5        6
             Action Selection
• What if you guide your action selections by
  your initial estimates of Q(s, a)?
• Will need to program into the algorithm the
  preference to fill in parts of the Q(s, a)
  estimator table that have not yet been filled
• I.e., a preference to explore
   – Otherwise, if we are using just 0 values from the
     Q(s, a) estimators, we could end up going in
     cycles
   – E.g., 1 -> 2 -> 5 -> 4
• May need to keep a frequency count of the
  number of times we have performed an
  action from a state, and prefer to use actions
  less frequently tried                          44
                             Claim B)
                 ˆ
• Convergence of Q to Q
  – See slide 14 and 15 of R. Maclin’s notes
  –   http://www.d.umn.edu/~rmaclin/cs8751/Notes/L12_Reinforcement_Learning.pdf



                    




                                                                                  45
  Non-Deterministic Case - 1
• So far we’ve been looking at estimating Q
  when the next state, s’, is fully determined by
  s and a
• But, we started off by talking about Markov
  Decision Processes
   – Where a given action, a, taken from a state, s,
     only probabilistically lead to different new states, s’



                                                          46
      Non-Deterministic Case - 2
   • Come back to equation 21.1 of text
         U (s)  E[  R(st ) |  ,s0  s]
                       
                             t
                       t 0

   • We’ve been using the notation V, the
     text uses the notation U (utility)
 • The expected value, E[f], allows us to
     consider random variables
   • I.e., the effect of actions on states has
     an element of randomness
                                                 47
       Non-Deterministic Case - 2
     • Also define expected value for Q(s, a)
           Q(s,a)  E[R(s)  V (s')]   *



     • Alter rule for Q(s, a) estimator update
        ˆ                   ˆ                              ˆ
        Qn (s,a)  (1  n )Qn1 (s,a)   n [R(s)   max[Qn1(s',a)]
                                                    a


         where                      1
                         n 
                              1 visitsn (s,a)

                                      ˆ
     • Can still prove convergence of Qn (s,a)
       estimator to Q (Watkins & Dayan, 1992)

								
To top