Reinforcement Learning Resources
Document Sample


Reinforcement Learning:
Resources
• Class text: Chapter 21 (RL)
• Class text: Chapter 17 (Markov decision
processes)
• Java applet example (simulated robot)
– http://iridia.ulb.ac.be/~fvandenb/qlearning/qlearning.html
• Rich Maclin’s notes
http://www.d.umn.edu/~rmaclin/cs8751/Notes/L12_Reinforcement_Learning.pdf
1
Outline
• Problem addressed
– Example problems
• Reward function: R(s)
• Policies, discount
• Actions & Markov Decision Problems
• V*-- an initial formulation
• Approximating V*: Q Learning
2
Problem Addressed
• “without some feedback about what is good
and what is bad, [an] agent [has] no grounds
for deciding which move to make” (p. 763,
R&N)
• Characteristics of learning situation
– Partially delayed reward
• Incremental feedback about what is good or bad
– Or fully delayed: Reward only at end of episode
• E.g., Chess, backgammon
– Opportunity for active exploration
– Actions probabilistically lead to next state
3
Example: TD-Gammon
• Training
– played 1.5 million games against itself
• Reward scheme
+100 if win
-100 if lose
0 for all other states
• After training, approximately as good as best
human player
Tesauro, G. (1995). Temporal Difference
Learning and TD-Gammon. Communications
of the ACM, 38, No. 3. 4
http://www.research.ibm.com/massive/tdl.html
Example: Learning to Walk
• Policy Gradient Reinforcement Learning for
Fast Quadrupedal Locomotion
– Nate Kohl and Peter Stone.
– Proceedings of the IEEE International Conference
on Robotics and Automation, pp. 2619--2624, May
2004.
• Methods
– Gradient reinforcement learning
– No “state”
– Forward speed is the objective function (“reward”)
– Estimate gradient (partial derivative) of gait
configurations, and follow gradient towards higher
speeds
5
Experimental Situation
QuickTime™ an d a
YUV420 codec decompressor
are need ed to see this p icture .
6
http://www.cs.utexas.edu/users/AustinVilla/?p=research/learned_walk
QuickTime™ an d a
YUV420 codec decompressor
are need ed to see this p icture .
• Initially, the Aibo's gait is clumsy and fairly slow
(less than 150 mm/s). We deliberately started with
a poor gait so that the learning process would not
be systematically biased towards our best hand-
tuned gait, which might have been locally optimal.
Learning to Walk
QuickTime™ an d a
YUV420 codec decompressor
are need ed to see this p icture .
• Midway through the training process, the Aibo is
moving much faster than it was initially. However, it
still exhibits some irregularities that slow it down.
Done Learning to Walk
QuickTime™ an d a QuickTime™ an d a
YUV420 codec decompressor decompressor
are need ed to see this p icture . are need ed to see this p icture .
• After traversing the field a total of just over 1000 times over the
course of 3 hours, we achieved our best learned gait, which
allows the Aibo to move at approximately 291 mm/s. To our
knowledge, this is the fastest reported walk on an Aibo as of
November 2003. The hash marks on the field are 200 mm apart.
The Aibo traverses 9 of them in 6.13 seconds demonstrating a
speed of 1800mm/6.13s > 291 mm/s.
Example: Piloting a Small
Helicopter
• An Application of Reinforcement Learning to
Aerobatic Helicopter Flight, Pieter Abbeel,
Adam Coates, Morgan Quigley and Andrew
Y. Ng. To appear in NIPS 19, 2007.
• Difficult real-time control problem
• Negative rewards for crashing, wobbling, or
deviating from a set course
10
http://ai.stanford.edu/~ang/
Reward function for another
problem: Hovering
QuickTime™ and a
TIFF (LZW) decompressor
are neede d to see this picture.
• Ng et al. (2004). Autonomous inverted
helicopter flight via reinforcement learning
11
QuickTime™ and a
DV/DVCPRO - NTSC d ecompressor
are neede d to see this picture.
12
Reward function: R(s)
• R(s) is a reward function
• s is a state that the agent is in
• R(s) defines the reward that agent gets
(once) for being in state s
• Can be positive (rewarding) or 0 (non-
rewarding), or negative (punishing)
• State space & R(s) must be defined for
the learning task
13
QuickTime™ and a
TIFF (LZW) decompressor
are neede d to see this picture.
14
http://www.d.umn.edu/~rmaclin/cs8751/Notes/L12_Reinforcement_Learning.pdf
Strategy
• Reinforcement learning attempts to find
a policy, , that best satisfies this goal
r0 r1 r2 ...
2
• Goal above is called utility (also known
as value)
Is called the discount factor
15
Policies
• policy:
– What the agent should do for any state that the
agent might reach (p. 615)
(s) = a
s is a state
a is action agent should take in that state
• “quality of a policy is … measured by the
expected utility of the possible environment
histories generated by that policy” (p. 615)
• Optimal policy: *
– Yields the highest expected utility
16
Agent & Environment
• Agent performs a series of trials in the
environment using its policy,
• Agent percepts
– Current state, s
– Reward, r, for state s
17
Discount Factor,
• “describes the preference of an agent for
current rewards over future rewards. When
is close to 0, rewards in the distant future are
viewed as insignificant.” When is close to 1
rewards in the distant future are viewed as
preferable. (P. 617 of text; italics added)
• can also be seen as a characteristic of a
task
• E.g., in flying a helicopter, distant future
rewards may not useful if we crash the
helicopter and permanently damage it on the
current trial!
near 0: immediate gratification is best
Actions: Probabilistic
• Actions can have probabilistic effect
– You don’t always achieve the effect you are trying for
• Probability of reaching s’ from s depends only
on s and the action applied to s
– Does not depend on the previous history of the agent
– Markov Decision Process (MDP)
• MDP (Markov Decision Process) definition
– Initial state: s0
– Transition model: T(s, a, s’)
• Probability that when you are in state s and apply action a
you will end up in state s’
• Depends only on state s and action a
– Reward function: R(s) 19
– Agent may not know T or R beforehand
Equation 21.1 of Text
• Or how to make relatively simple things
complex ;)
U (s) E[ t R(st ) | ,s0 s]
t 0
• Really just our previous goal with the E[f]
notation
r0 r1 2 r2 ...
20
V*(s)
• We want to maximize
the expected value of
equation 21.1
• I.e., select a policy
((s) = a) that
maximizes the value
we can expect
• Call this maximized V * (s) r r 2 r ...
0 1 2
value
21
Example - 1
• Assume a simple board game in which there
is one location that is a “win”
• Agent can move from any non-win state to
neighbor states
R(3) =100
Actions: R(n) = 0 for
up, down, 1 2 3 all other n
left, right
Let = 0.9
4 5 6
22
Exercise R(3) =100
R(n) = 0 for
all other n
• Now, by hand, determine V*(s) for all
states, s
= 0.9
V * (s) r0 r1 2 r2 ...
R(s) R(s ') 2 R(s '') ...
1 2 3
4 5 6
Hint: Start at V*(3) 23
Solution R(3) =100
V (s) r0 r1 r2 ...
* 2
R(n) = 0 for
R(s) R(s ') 2 R(s '') ... all other n
= 0.9
1 2 3
4 5 6
• V*(3) = R(3) = 100
• V*(2) = R(2) + 0.9*R(3) = 0 + 0.9*100 = 90
• V*(1) = R(1) + 0.9*R(2) + 0.92R(3) = 0 + 0.9*0 + 0.92*100 = 81
• V*(6) = R(6) + 0.9*R(3) = 0 + 0.9*100 = 90
• V*(5) = R(5) + 0.9*R(2) + 0.92R(3) = 0 + 0.9*0 + 0.92*100 = 81
• V*(4) = R(4) + 0.9*R(1) + 0.92*R(2) + 0.93R(3) = 72.9
24
81 90 100
Using V* 1 2 3
72.9 81 90
4 5 6
• V* can be used to determine which move is the
best move
• Algorithm for using V* to pick the best next move
– Let s be our current state
– Pick the action, a, such that V*(s’) is maximized
• V* can thus be used to determine *
25
How can we compute V*(s)?
• It would be good to have a
method that
–incrementally takes actions,
through a series of trials
–observes the new state
–gets reward for the new state
• And uses these values to
compute V*(s) 26
For example
• At the start of learning to fly the helicopter we
may not know
– What the reward value, R(s), for a state will be;
e.g., if we put the helicopter into a forward pitch
attitude, what R(s) value will result?
• Also don’t know at the start with what
probabilities one action taken in a state will
lead to another state
– I.e., we don’t know T(s, a, s’)
– With helicopter, if we are flying ahead with 5
degrees forward pitch, and we push the stick
forward another 10 degrees, what will the next
state be, and with what probability? 27
Q Function
• Define a new function, Q(s, a), which is
closely related to V*(s)
Q(s, a) = R(s) + V*(s’)
Where s’ is the state resulting from applying action a in
state s
V * (s) r0 r1 2 r2 ...
• Claims:
A) If an agent learns the function, Q, it can choose
an optimal action
I.e., it will have implicitly learned V* and hence *
B) We can write a formulation of function Q such that
if the agent learns approximations of Q, these
approximations will converge to Q itself
ˆ
Call these approximations Q
28
Q and V* are closely related
V * (s) max[Q(s,a)]
a
• To see this, recall that V*(s) is defined as the
maximum value for
r0 r1 2 r2 ...
• So, if an action a that maximizes
Q(s, a) is selected, then you are maximizing:
Q(s,a) R(s) V (s')
*
• r0=R(s), and
max[Q(s,a)] r0 r1 2 r2 ... 29
a
Note Further
• From our definition
Q(s,a) R(s) V (s')
*
• And from our proof
V * (s) max[Q(s,a)]
a
• We now also have:
Q(s, ) R(s) max[Q(s',a)]
a
30
Claim A)
• If an agent learns the function, Q, it
can choose an optimal action
I.e., it will have implicitly learned *
• Have shown that
V * (s) max[Q(s,a)]
a
• Remember that we were able to
use V*(s) previously to instantiate
optimal policy
an
• So, if we have Q(s,a), we can select an action
that maximizes Q(s, a), hence we have V*(s),
hence we have implicitly obtained *
Learning Approximations of Q
• Let ˆ
Q Denote learner’s current
approximation to Q
• Consider a training rule
ˆ ˆ
Q(s, ) R(s) max[Q(s',a)]
a
32
Q Learning for
Deterministic Worlds
• For each (s, ) initialize a table entry
ˆ
Q(s, ) 0
• Initialize current state, s
• Do forever
Select action and take the action
Observe new state, s’
Get reward for new state, R(s’)
Update the table entry for (s, ) as follows
ˆ ˆ
Q(s, ) R(s) max[Q(s',a)]
a
s s’ 33
Example of Computing ˆ
Q(s,a)
R(3) =100
• Use earlier board game
R(n) = 0 for
all other n
Let = 0.9
1 2 3
Actions:
Initially:
4 5 6
up, down, ˆ
Q(s, ) 0
left, right
Exercise:
Compute some initial values of
ˆ
Q(s, )
ˆ
Q(3, ) ˆ
Q(2, ) ˆ
Q(6, )
34
Solution - 1 R(3) =100
1 2 3
R(n) = 0 for
all other n
4 5 6
Let = 0.9
ˆ
Q(s, )
up down left right • First, we initialize
the table
1 0 0 0 0
• And let s = 3
2 0 0 0 0 (picking a start
3 0 0 0 0 state)
4 0 0 0 0 • Now, compute
ˆ
Q(s, )
5 0 0 0 0
for each action,
6 0 0 0 0
ˆ )
Computing Q(3,
ˆ ˆ
Q(3, ) R(3) max[Q(s',a)]
a
• s’ is a next state that we can reach from
s=3
– There are no next states reachable from
s=3
• So,
ˆ
Q(3, ) R(3) 100
For all values of
36
Solution - 2 R(3) =100
1 2 3
R(n) = 0 for
all other n
4 5 6
Let = 0.9
ˆ
Q(s, )
up down left right • Now, we can
ˆ
compute Q(s, )
1 0 0 0 0
2 0 0 0 0 • For states that
are neighboring
3 100 100 100 100
s=3
4 0 0 0 0
5 0 0 0 0 ˆ
Q(2, ) ?
6 0 0 0 0 ˆ
Q(6, ) ?
ˆ )
Computing Q(2, 1 2 3
ˆ ˆ
Q(2, ) R(2) max[Q(s',a)] 4 5 6
a
• s’ is a next state that we can reach from s=2 with a
particular action
i.e., s’=3 (=right), s’=5 (=down), or s’=1 ( = left)
ˆ
• So, for each particular we need to compute max[Q(s',a)]
a
ˆ
• Right now, all of those Q(s ', a)
values except for s’=3 ( =right) are 0
• So,
ˆ ˆ
Q(2,right) R(2) max[Q(3,a)]
a
R(2) 0.9(100) 90 38
Solution - 3 R(3) =100
1 2 3
R(n) = 0 for
all other n
4 5 6
Let = 0.9
ˆ
Q(s, )
up down left right
1 0 0 0 0 ˆ
Q(6, ) ?
2 0 0 0 90
3 100 100 100 100
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
ˆ
Computing Q(6,a) 1 3
2
ˆ ˆ
Q(6, ) R(6) max[Q(s',a)] 4 5 6
a
• s’ is a next state that we can reach from s=6 with a
particular action
i.e., s’=5 ( =left), s’=3 ( =up)
ˆ
• So, for each particular we need to compute max[Q(s',a)]
a
ˆ
• Right now, the s’=5 Q(s',a)
values are 0
• So,
ˆ ˆ
Q(6,up) R(6) max[Q(3,a)]
a
R(6) 0.9(100) 90 40
Solution - 3 R(3) =100
1 2 3
R(n) = 0 for
all other n
4 5 6
Let = 0.9
ˆ
Q(s,a)
up down left right
1 0 0 0 0
2 0 0 0 90
3 100 100 100 100
4 0 0 0 0
5 0 0 0 0
6 90 0 0 0
Spreadsheet - 1
• Systematically compute Q estimator
values
• Represent table of Q(s, a) estimates as
a row in a spreadsheet
• In each trial (line on spreadsheet), just
update one Q(s, a) estimate
– I.e., an action is taken from a particular
state
http://www.cprince.com/courses/cs5541/lectures/RL/QLearning.xls 42
http://www.cprince.com/courses/cs5541/lectures/RL/QLearning.pdf
Spreadsheet - 2
• Trials
– Start in state 4
– Used three of various possible trial
sequences, ending at state 3
– Indicate different trials with red, green, blue
• At the start of the third iteration through
the entries in the Q(s, a) estimator table
have converged to V* values
81 90 100
1 2 3
V* 72.9 81 90 43
4 5 6
Action Selection
• What if you guide your action selections by
your initial estimates of Q(s, a)?
• Will need to program into the algorithm the
preference to fill in parts of the Q(s, a)
estimator table that have not yet been filled
• I.e., a preference to explore
– Otherwise, if we are using just 0 values from the
Q(s, a) estimators, we could end up going in
cycles
– E.g., 1 -> 2 -> 5 -> 4
• May need to keep a frequency count of the
number of times we have performed an
action from a state, and prefer to use actions
less frequently tried 44
Claim B)
ˆ
• Convergence of Q to Q
– See slide 14 and 15 of R. Maclin’s notes
– http://www.d.umn.edu/~rmaclin/cs8751/Notes/L12_Reinforcement_Learning.pdf
45
Non-Deterministic Case - 1
• So far we’ve been looking at estimating Q
when the next state, s’, is fully determined by
s and a
• But, we started off by talking about Markov
Decision Processes
– Where a given action, a, taken from a state, s,
only probabilistically lead to different new states, s’
46
Non-Deterministic Case - 2
• Come back to equation 21.1 of text
U (s) E[ R(st ) | ,s0 s]
t
t 0
• We’ve been using the notation V, the
text uses the notation U (utility)
• The expected value, E[f], allows us to
consider random variables
• I.e., the effect of actions on states has
an element of randomness
47
Non-Deterministic Case - 2
• Also define expected value for Q(s, a)
Q(s,a) E[R(s) V (s')] *
• Alter rule for Q(s, a) estimator update
ˆ ˆ ˆ
Qn (s,a) (1 n )Qn1 (s,a) n [R(s) max[Qn1(s',a)]
a
where 1
n
1 visitsn (s,a)
ˆ
• Can still prove convergence of Qn (s,a)
estimator to Q (Watkins & Dayan, 1992)
Get documents about "