VIEWS: 7 PAGES: 48 POSTED ON: 7/22/2011 Public Domain
Reinforcement Learning: Resources • Class text: Chapter 21 (RL) • Class text: Chapter 17 (Markov decision processes) • Java applet example (simulated robot) – http://iridia.ulb.ac.be/~fvandenb/qlearning/qlearning.html • Rich Maclin’s notes http://www.d.umn.edu/~rmaclin/cs8751/Notes/L12_Reinforcement_Learning.pdf 1 Outline • Problem addressed – Example problems • Reward function: R(s) • Policies, discount • Actions & Markov Decision Problems • V*-- an initial formulation • Approximating V*: Q Learning 2 Problem Addressed • “without some feedback about what is good and what is bad, [an] agent [has] no grounds for deciding which move to make” (p. 763, R&N) • Characteristics of learning situation – Partially delayed reward • Incremental feedback about what is good or bad – Or fully delayed: Reward only at end of episode • E.g., Chess, backgammon – Opportunity for active exploration – Actions probabilistically lead to next state 3 Example: TD-Gammon • Training – played 1.5 million games against itself • Reward scheme +100 if win -100 if lose 0 for all other states • After training, approximately as good as best human player Tesauro, G. (1995). Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38, No. 3. 4 http://www.research.ibm.com/massive/tdl.html Example: Learning to Walk • Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion – Nate Kohl and Peter Stone. – Proceedings of the IEEE International Conference on Robotics and Automation, pp. 2619--2624, May 2004. • Methods – Gradient reinforcement learning – No “state” – Forward speed is the objective function (“reward”) – Estimate gradient (partial derivative) of gait configurations, and follow gradient towards higher speeds 5 Experimental Situation QuickTime™ an d a YUV420 codec decompressor are need ed to see this p icture . 6 http://www.cs.utexas.edu/users/AustinVilla/?p=research/learned_walk QuickTime™ an d a YUV420 codec decompressor are need ed to see this p icture . • Initially, the Aibo's gait is clumsy and fairly slow (less than 150 mm/s). We deliberately started with a poor gait so that the learning process would not be systematically biased towards our best hand- tuned gait, which might have been locally optimal. Learning to Walk QuickTime™ an d a YUV420 codec decompressor are need ed to see this p icture . • Midway through the training process, the Aibo is moving much faster than it was initially. However, it still exhibits some irregularities that slow it down. Done Learning to Walk QuickTime™ an d a QuickTime™ an d a YUV420 codec decompressor decompressor are need ed to see this p icture . are need ed to see this p icture . • After traversing the field a total of just over 1000 times over the course of 3 hours, we achieved our best learned gait, which allows the Aibo to move at approximately 291 mm/s. To our knowledge, this is the fastest reported walk on an Aibo as of November 2003. The hash marks on the field are 200 mm apart. The Aibo traverses 9 of them in 6.13 seconds demonstrating a speed of 1800mm/6.13s > 291 mm/s. Example: Piloting a Small Helicopter • An Application of Reinforcement Learning to Aerobatic Helicopter Flight, Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. To appear in NIPS 19, 2007. • Difficult real-time control problem • Negative rewards for crashing, wobbling, or deviating from a set course 10 http://ai.stanford.edu/~ang/ Reward function for another problem: Hovering QuickTime™ and a TIFF (LZW) decompressor are neede d to see this picture. • Ng et al. (2004). Autonomous inverted helicopter flight via reinforcement learning 11 QuickTime™ and a DV/DVCPRO - NTSC d ecompressor are neede d to see this picture. 12 Reward function: R(s) • R(s) is a reward function • s is a state that the agent is in • R(s) defines the reward that agent gets (once) for being in state s • Can be positive (rewarding) or 0 (non- rewarding), or negative (punishing) • State space & R(s) must be defined for the learning task 13 QuickTime™ and a TIFF (LZW) decompressor are neede d to see this picture. 14 http://www.d.umn.edu/~rmaclin/cs8751/Notes/L12_Reinforcement_Learning.pdf Strategy • Reinforcement learning attempts to find a policy, , that best satisfies this goal r0 r1 r2 ... 2 • Goal above is called utility (also known as value) Is called the discount factor 15 Policies • policy: – What the agent should do for any state that the agent might reach (p. 615) (s) = a s is a state a is action agent should take in that state • “quality of a policy is … measured by the expected utility of the possible environment histories generated by that policy” (p. 615) • Optimal policy: * – Yields the highest expected utility 16 Agent & Environment • Agent performs a series of trials in the environment using its policy, • Agent percepts – Current state, s – Reward, r, for state s 17 Discount Factor, • “describes the preference of an agent for current rewards over future rewards. When is close to 0, rewards in the distant future are viewed as insignificant.” When is close to 1 rewards in the distant future are viewed as preferable. (P. 617 of text; italics added) • can also be seen as a characteristic of a task • E.g., in flying a helicopter, distant future rewards may not useful if we crash the helicopter and permanently damage it on the current trial! near 0: immediate gratification is best Actions: Probabilistic • Actions can have probabilistic effect – You don’t always achieve the effect you are trying for • Probability of reaching s’ from s depends only on s and the action applied to s – Does not depend on the previous history of the agent – Markov Decision Process (MDP) • MDP (Markov Decision Process) definition – Initial state: s0 – Transition model: T(s, a, s’) • Probability that when you are in state s and apply action a you will end up in state s’ • Depends only on state s and action a – Reward function: R(s) 19 – Agent may not know T or R beforehand Equation 21.1 of Text • Or how to make relatively simple things complex ;) U (s) E[ t R(st ) | ,s0 s] t 0 • Really just our previous goal with the E[f] notation r0 r1 2 r2 ... 20 V*(s) • We want to maximize the expected value of equation 21.1 • I.e., select a policy ((s) = a) that maximizes the value we can expect • Call this maximized V * (s) r r 2 r ... 0 1 2 value 21 Example - 1 • Assume a simple board game in which there is one location that is a “win” • Agent can move from any non-win state to neighbor states R(3) =100 Actions: R(n) = 0 for up, down, 1 2 3 all other n left, right Let = 0.9 4 5 6 22 Exercise R(3) =100 R(n) = 0 for all other n • Now, by hand, determine V*(s) for all states, s = 0.9 V * (s) r0 r1 2 r2 ... R(s) R(s ') 2 R(s '') ... 1 2 3 4 5 6 Hint: Start at V*(3) 23 Solution R(3) =100 V (s) r0 r1 r2 ... * 2 R(n) = 0 for R(s) R(s ') 2 R(s '') ... all other n = 0.9 1 2 3 4 5 6 • V*(3) = R(3) = 100 • V*(2) = R(2) + 0.9*R(3) = 0 + 0.9*100 = 90 • V*(1) = R(1) + 0.9*R(2) + 0.92R(3) = 0 + 0.9*0 + 0.92*100 = 81 • V*(6) = R(6) + 0.9*R(3) = 0 + 0.9*100 = 90 • V*(5) = R(5) + 0.9*R(2) + 0.92R(3) = 0 + 0.9*0 + 0.92*100 = 81 • V*(4) = R(4) + 0.9*R(1) + 0.92*R(2) + 0.93R(3) = 72.9 24 81 90 100 Using V* 1 2 3 72.9 81 90 4 5 6 • V* can be used to determine which move is the best move • Algorithm for using V* to pick the best next move – Let s be our current state – Pick the action, a, such that V*(s’) is maximized • V* can thus be used to determine * 25 How can we compute V*(s)? • It would be good to have a method that –incrementally takes actions, through a series of trials –observes the new state –gets reward for the new state • And uses these values to compute V*(s) 26 For example • At the start of learning to fly the helicopter we may not know – What the reward value, R(s), for a state will be; e.g., if we put the helicopter into a forward pitch attitude, what R(s) value will result? • Also don’t know at the start with what probabilities one action taken in a state will lead to another state – I.e., we don’t know T(s, a, s’) – With helicopter, if we are flying ahead with 5 degrees forward pitch, and we push the stick forward another 10 degrees, what will the next state be, and with what probability? 27 Q Function • Define a new function, Q(s, a), which is closely related to V*(s) Q(s, a) = R(s) + V*(s’) Where s’ is the state resulting from applying action a in state s V * (s) r0 r1 2 r2 ... • Claims: A) If an agent learns the function, Q, it can choose an optimal action I.e., it will have implicitly learned V* and hence * B) We can write a formulation of function Q such that if the agent learns approximations of Q, these approximations will converge to Q itself ˆ Call these approximations Q 28 Q and V* are closely related V * (s) max[Q(s,a)] a • To see this, recall that V*(s) is defined as the maximum value for r0 r1 2 r2 ... • So, if an action a that maximizes Q(s, a) is selected, then you are maximizing: Q(s,a) R(s) V (s') * • r0=R(s), and max[Q(s,a)] r0 r1 2 r2 ... 29 a Note Further • From our definition Q(s,a) R(s) V (s') * • And from our proof V * (s) max[Q(s,a)] a • We now also have: Q(s, ) R(s) max[Q(s',a)] a 30 Claim A) • If an agent learns the function, Q, it can choose an optimal action I.e., it will have implicitly learned * • Have shown that V * (s) max[Q(s,a)] a • Remember that we were able to use V*(s) previously to instantiate optimal policy an • So, if we have Q(s,a), we can select an action that maximizes Q(s, a), hence we have V*(s), hence we have implicitly obtained * Learning Approximations of Q • Let ˆ Q Denote learner’s current approximation to Q • Consider a training rule ˆ ˆ Q(s, ) R(s) max[Q(s',a)] a 32 Q Learning for Deterministic Worlds • For each (s, ) initialize a table entry ˆ Q(s, ) 0 • Initialize current state, s • Do forever Select action and take the action Observe new state, s’ Get reward for new state, R(s’) Update the table entry for (s, ) as follows ˆ ˆ Q(s, ) R(s) max[Q(s',a)] a s s’ 33 Example of Computing ˆ Q(s,a) R(3) =100 • Use earlier board game R(n) = 0 for all other n Let = 0.9 1 2 3 Actions: Initially: 4 5 6 up, down, ˆ Q(s, ) 0 left, right Exercise: Compute some initial values of ˆ Q(s, ) ˆ Q(3, ) ˆ Q(2, ) ˆ Q(6, ) 34 Solution - 1 R(3) =100 1 2 3 R(n) = 0 for all other n 4 5 6 Let = 0.9 ˆ Q(s, ) up down left right • First, we initialize the table 1 0 0 0 0 • And let s = 3 2 0 0 0 0 (picking a start 3 0 0 0 0 state) 4 0 0 0 0 • Now, compute ˆ Q(s, ) 5 0 0 0 0 for each action, 6 0 0 0 0 ˆ ) Computing Q(3, ˆ ˆ Q(3, ) R(3) max[Q(s',a)] a • s’ is a next state that we can reach from s=3 – There are no next states reachable from s=3 • So, ˆ Q(3, ) R(3) 100 For all values of 36 Solution - 2 R(3) =100 1 2 3 R(n) = 0 for all other n 4 5 6 Let = 0.9 ˆ Q(s, ) up down left right • Now, we can ˆ compute Q(s, ) 1 0 0 0 0 2 0 0 0 0 • For states that are neighboring 3 100 100 100 100 s=3 4 0 0 0 0 5 0 0 0 0 ˆ Q(2, ) ? 6 0 0 0 0 ˆ Q(6, ) ? ˆ ) Computing Q(2, 1 2 3 ˆ ˆ Q(2, ) R(2) max[Q(s',a)] 4 5 6 a • s’ is a next state that we can reach from s=2 with a particular action i.e., s’=3 (=right), s’=5 (=down), or s’=1 ( = left) ˆ • So, for each particular we need to compute max[Q(s',a)] a ˆ • Right now, all of those Q(s ', a) values except for s’=3 ( =right) are 0 • So, ˆ ˆ Q(2,right) R(2) max[Q(3,a)] a R(2) 0.9(100) 90 38 Solution - 3 R(3) =100 1 2 3 R(n) = 0 for all other n 4 5 6 Let = 0.9 ˆ Q(s, ) up down left right 1 0 0 0 0 ˆ Q(6, ) ? 2 0 0 0 90 3 100 100 100 100 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 ˆ Computing Q(6,a) 1 3 2 ˆ ˆ Q(6, ) R(6) max[Q(s',a)] 4 5 6 a • s’ is a next state that we can reach from s=6 with a particular action i.e., s’=5 ( =left), s’=3 ( =up) ˆ • So, for each particular we need to compute max[Q(s',a)] a ˆ • Right now, the s’=5 Q(s',a) values are 0 • So, ˆ ˆ Q(6,up) R(6) max[Q(3,a)] a R(6) 0.9(100) 90 40 Solution - 3 R(3) =100 1 2 3 R(n) = 0 for all other n 4 5 6 Let = 0.9 ˆ Q(s,a) up down left right 1 0 0 0 0 2 0 0 0 90 3 100 100 100 100 4 0 0 0 0 5 0 0 0 0 6 90 0 0 0 Spreadsheet - 1 • Systematically compute Q estimator values • Represent table of Q(s, a) estimates as a row in a spreadsheet • In each trial (line on spreadsheet), just update one Q(s, a) estimate – I.e., an action is taken from a particular state http://www.cprince.com/courses/cs5541/lectures/RL/QLearning.xls 42 http://www.cprince.com/courses/cs5541/lectures/RL/QLearning.pdf Spreadsheet - 2 • Trials – Start in state 4 – Used three of various possible trial sequences, ending at state 3 – Indicate different trials with red, green, blue • At the start of the third iteration through the entries in the Q(s, a) estimator table have converged to V* values 81 90 100 1 2 3 V* 72.9 81 90 43 4 5 6 Action Selection • What if you guide your action selections by your initial estimates of Q(s, a)? • Will need to program into the algorithm the preference to fill in parts of the Q(s, a) estimator table that have not yet been filled • I.e., a preference to explore – Otherwise, if we are using just 0 values from the Q(s, a) estimators, we could end up going in cycles – E.g., 1 -> 2 -> 5 -> 4 • May need to keep a frequency count of the number of times we have performed an action from a state, and prefer to use actions less frequently tried 44 Claim B) ˆ • Convergence of Q to Q – See slide 14 and 15 of R. Maclin’s notes – http://www.d.umn.edu/~rmaclin/cs8751/Notes/L12_Reinforcement_Learning.pdf 45 Non-Deterministic Case - 1 • So far we’ve been looking at estimating Q when the next state, s’, is fully determined by s and a • But, we started off by talking about Markov Decision Processes – Where a given action, a, taken from a state, s, only probabilistically lead to different new states, s’ 46 Non-Deterministic Case - 2 • Come back to equation 21.1 of text U (s) E[ R(st ) | ,s0 s] t t 0 • We’ve been using the notation V, the text uses the notation U (utility) • The expected value, E[f], allows us to consider random variables • I.e., the effect of actions on states has an element of randomness 47 Non-Deterministic Case - 2 • Also define expected value for Q(s, a) Q(s,a) E[R(s) V (s')] * • Alter rule for Q(s, a) estimator update ˆ ˆ ˆ Qn (s,a) (1 n )Qn1 (s,a) n [R(s) max[Qn1(s',a)] a where 1 n 1 visitsn (s,a) ˆ • Can still prove convergence of Qn (s,a) estimator to Q (Watkins & Dayan, 1992)