Document Sample

Reinforcement Learning Yishay Mansour Tel-Aviv University Outline • Goal of Reinforcement Learning • Mathematical Model (MDP) • Planning Goal of Reinforcement Learning Goal oriented learning through interaction Control of large scale stochastic environments with partial knowledge. Supervised / Unsupervised Learning Learn from labeled / unlabeled examples Reinforcement Learning - origins Artificial Intelligence Control Theory Operation Research Cognitive Science & Psychology Solid foundations; well established research. Typical Applications • Robotics – Elevator control [CB]. – Robo-soccer [SV]. • Board games – backgammon [T], – checkers [S]. – Chess [B] • Scheduling – Dynamic channel allocation [SB]. – Inventory problems. Contrast with Supervised Learning The system has a “state”. The algorithm influences the state distribution. Inherent Tradeoff: Exploration versus Exploitation. Mathematical Model - Motivation Model of uncertainty: Environment, actions, our knowledge. Focus on decision making. Maximize long term reward. Markov Decision Process (MDP) Mathematical Model - MDP Markov decision processes S- set of states A- set of actions d - Transition probability R - Reward function Similar to DFA! MDP model - states and actions Environment = states 0.7 0.3 action a Actions = transitions d (s, a, s' ) MDP model - rewards R(s,a) = reward at state s for doing action a (a random variable). Example: R(s,a) = -1 with probability 0.5 +10 with probability 0.35 +20 with probability 0.15 MDP model - trajectories trajectory: s0 a0 r0 s1 a1 r1 s2 a2 r2 Simple example: N- armed bandit Single state. Goal: Maximize sum of immediate rewards. a1 Given the model: s a2 Greedy action. a3 Difficulty: unknown model. MDP - Return function. Combining all the immediate rewards to a single value. Modeling Issues: Are early rewards more valuable than later rewards? Is the system “terminating” or continuous? Usually the return is linear in the immediate rewards. MDP model - return functions Finite Horizon - parameter H return R(s , a ) 1 i H i i Infinite Horizon discounted - parameter g<1. return γ i R(s i ,a i ) i0 N 1 1 undiscounted N i0 N R(s i ,a i ) return Terminating MDP MDP model - action selection AIM: Maximize the expected return. This talk: discounted return Fully Observable - can “see” the “exact” state. Policy - mapping from states to actions Optimal policy: optimal from any start state. THEOREM: There exists a deterministic optimal policy MDP model - summary sS - set of states, |S|=n. a A - set of k actions, |A|=k. d ( s1 , a, s2 ) - transition function. R(s,a) - immediate reward function. :S A - policy. g i0 i ri - discounted cumulative return. Relations to Board Games • state = current board • action = what we can play. • opponent action = part of the environment • Hidden assumption: Game is Markovian Contrast with Supervised Learning Supervised Learning: Fixed distribution on examples. Reinforcement Learning: The state distribution is policy dependent!!! A small local change in the policy can make a huge global change in the return. Planning - Basic Problems. Given a complete MDP model. Policy evaluation - Given a policy , estimate its return. Optimal control - Find an optimal policy * (maximizes the return from any start state). Planning - Value Functions V(s) The expected return starting at state s following . Q(s,a) The expected return starting at state s with action a and then following . V*(s) and Q*(s,a) are define using an optimal policy *. V*(s) = max V(s) Planning - Policy Evaluation Discounted infinite horizon (Bellman Eq.) V(s) = Es’~ (s) [ R(s, (s)) + g V(s’)] Rewrite the expectation V ( s ) E[ R ( s, ( s ))] g s ' d ( s, ( s ), s ' )V ( s ' ) Linear system of equations. Algorithms - Policy Evaluation Example A={+1,-1} s0 s1 g = 1/2 0 1 d(si,a)= si+a random "a: R(si,a) = i 3 2 s3 s2 V(s0) = 0 +g [(s0,+1)V(s1) + (s0,-1) V(s3) ] Algorithms -Policy Evaluation Example A={+1,-1} V(s0) = 5/3 s0 s1 g = 1/2 V(s1) = 7/3 0 1 d(si,a)= si+a V(s2) = 11/3 random V(s3) = 13/3 "a: R(si,a) = i 3 2 s3 s2 V(s0) = 0 + (V(s1) + V(s3) )/4 Algorithms - optimal control State-Action Value function: Q(s,a) E [ R(s,a)] + gEs’~ (s) [ V(s’)] Note V ( s ) Q ( s , ( s )) For a deterministic policy . Algorithms -Optimal control Example A={+1,-1} Q(s0,+1) = 7/6 s0 s1 g = 1/2 Q(s0,-1) = 13/6 0 1 d(si,a)= si+a random R(si,a) = i 3 2 s3 s2 Q(s0,+1) = 0 +g V(s1) Algorithms - optimal control CLAIM: A policy is optimal if and only if at each state s: V(s) MAXa {Q(s,a)} (Bellman Eq.) PROOF: Assume there is a state s and action a s.t., V(s) < Q(s,a). Then the strategy of performing a at state s (the first time) is better than . This is true each time we visit s, so the policy that performs action a at state s is better than . p Algorithms -optimal control Example A={+1,-1} s0 s1 g = 1/2 0 1 d(si,a)= si+a random R(si,a) = i 3 2 s3 s2 Changing the policy using the state-action value function. Algorithms - optimal control The greedy policy with respect to Q(s,a) is (s) = argmaxa{Q(s,a) } The e-greedy policy with respect to Q(s,a) is (s) = argmaxa{Q(s,a) } with probability 1-e, and (s) = random action with probability e MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. V i 1 ( s) max{R( s, a) g s ' d (s, a, s' ) V i (s' )} a 3. Policy Iteration method. i 1 i ( s) arg max {Q ( s, a)} a Convergence • Value Iteration – Drop in distance from optimal • By a factor of 1-γ • Policy Iteration – Policy only improves Relations to Board Games • state = current board • action = what we can play. • opponent action = part of the environment • value function = likelihood of winning • Q- function = modified policy. • Hidden assumption: Game is Markovian Planning versus Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning. Example - Elevator Control Learning (alone): Model the arrival model well. Planning (alone) : Given arrival model build schedule Real objective: Construct a schedule while updating model Partially Observable MDP Rather than observing the state we observe some function of the state. Ob - Observable function. a random variable for each states. Example: (1) Ob(s) = s+noise. (2) Ob(s) = first bit of s. Problem: different states may “look” similar. The optimal strategy is history dependent ! POMDP - Belief State Algorithm Given a history of actions and observable value we compute a posterior distribution for the state we are in (belief state). The belief-state MDP: States: distribution over S (states of the POMDP). actions: as in the POMDP. Transition: the posterior distribution (given the observation) We can perform the planning and learning on the belief-state MDP. POMDP - Hard computational problems. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a deterministic POMDP is P-space- hard (NP-complete) [PT,L]. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a stochastic POMDP is EXPTIME- hard (P-space-complete) [PT,L]. Computing an infinite (polynomial) horizon undiscounted optimal policy for an MDP is P-complete [PT] . Resources • Reinforcement Learning (an introduction) [Sutton & Barto] • Markov Decision Processes [Puterman] • Dynamic Programming and Optimal Control [Bertsekas] • Neuro-Dynamic Programming [Bertsekas & Tsitsiklis] • Ph. D. thesis - Michael Littman

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 5 |

posted: | 3/28/2012 |

language: | |

pages: | 37 |

OTHER DOCS BY dfhdhdhdhjr

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.