Docstoc

Reinforcement Learning (PowerPoint download)

Document Sample
Reinforcement Learning (PowerPoint download) Powered By Docstoc
					Reinforcement Learning

     Yishay Mansour
    Tel-Aviv University
              Outline
• Goal of Reinforcement Learning
• Mathematical Model (MDP)
• Planning




                                   2
Goal of Reinforcement Learning
Goal oriented learning through interaction

Control of large scale stochastic environments with
    partial knowledge.



 Supervised / Unsupervised Learning
       Learn from labeled / unlabeled examples

                                                      3
Reinforcement Learning - origins
 Artificial Intelligence

 Control Theory

 Operation Research

 Cognitive Science & Psychology


 Solid foundations; well established research.
                                                 4
             Typical Applications
• Robotics
  – Elevator control [CB].
  – Robo-soccer [SV].
• Board games
  – backgammon [T],
  – checkers [S].
  – Chess [B]
• Scheduling
  – Dynamic channel allocation [SB].
  – Inventory problems.
                                       5
 Contrast with Supervised Learning

The system has a “state”.


The algorithm influences the state distribution.

Inherent Tradeoff: Exploration versus Exploitation.



                                                      6
Mathematical Model - Motivation

 Model of uncertainty:
    Environment, actions, our knowledge.

 Focus on decision making.

Maximize long term reward.

Markov Decision Process (MDP)
                                           7
Mathematical Model - MDP

 Markov decision processes

 S- set of states

 A- set of actions

 d - Transition probability

 R - Reward function          Similar to DFA!
                                           8
MDP model - states and actions
   Environment = states




                      0.7

             0.3          action a


Actions = transitions          d (s, a, s' )
                                               9
MDP model - rewards

                              R(s,a) = reward at state s
                                       for doing action a
                                       (a random variable).




Example:
R(s,a) = -1 with probability 0.5
        +10 with probability 0.35
        +20 with probability 0.15                           10
MDP model - trajectories




    trajectory:
    s0 a0   r0   s1 a1   r1   s2 a2   r2
                                           11
      MDP - Return function.

 Combining all the immediate rewards to a single value.

 Modeling Issues:

 Are early rewards more valuable than later rewards?

 Is the system “terminating” or continuous?

Usually the return is linear in the immediate rewards.
                                                   12
MDP model - return functions
Finite Horizon - parameter H           return      R(s , a )
                                                  1 i  H
                                                             i   i




Infinite Horizon
                                                     

   discounted - parameter g<1.           return   γ i R(s i ,a i )
                                                    i0

                               N 1
                           1
    undiscounted
                           N
                               
                               i0
                                                     N
                                      R(s i ,a i )     return
                                                      



 Terminating MDP
                                                                       13
MDP model - action selection
 AIM: Maximize the expected return.



Fully Observable - can “see” the “entire” state.

            Policy - mapping from states to actions


Optimal policy: optimal from any start state.

 THEOREM: There exists a deterministic optimal policy
                                                        14
Contrast with Supervised Learning

Supervised Learning:
Fixed distribution on examples.


Reinforcement Learning:
The state distribution is policy dependent!!!

A small local change in the policy can make a huge
global change in the return.

                                                     15
MDP model - summary

          sS      - set of states, |S|=n.
          a A     - set of k actions, |A|=k.
d ( s1 , a, s2 ) - transition function.

   R(s,a)          - immediate reward function.
 :S  A           - policy.
    

   g
    i0
          i
              ri   - discounted cumulative return.


                                                     16
Simple example: N- armed bandit

Single state.             Goal: Maximize sum of
                          immediate rewards.
                a1
                               Given the model:
         s           a2        Greedy action.

                a3             Difficulty:
                               unknown model.
                                             17
   N-Armed Bandit: Highlights
• Algorithms (near greedy):
  – Exponential weights
     • Gi sum of rewards of action ai
     • wi = eGi
  – Follow the leader
• Results:
  – For any sequence of T rewards:
  – E[online] > maxi {Gi} - sqrt{T log N}
                                            18
       Planning - Basic Problems.

        Given a complete MDP model.


Policy evaluation - Given a policy , estimate its return.

Optimal control -    Find an optimal policy * (maximizes
                     the return from any start state).


                                                        19
  Planning - Value Functions

V(s) The expected return starting at state s following .


Q(s,a) The expected return starting at state s with
        action a and then following .


 V*(s) and Q*(s,a) are define using an optimal policy *.

                  V*(s) = max V(s)
                                                            20
 Planning - Policy Evaluation
  Discounted infinite horizon (Bellman Eq.)
V(s) = Es’~  (s) [ R(s, (s)) + g V(s’)]

Rewrite the expectation

V ( s )  E[ R ( s,  ( s ))]  g s ' d ( s,  ( s ), s ' )V ( s ' )
                                                             




      Linear system of equations.
                                                                  21
      Algorithms - Policy Evaluation
                                Example

 A={+1,-1}             s0                    s1
 g = 1/2                    0            1
 d(si,a)= si+a
  random

"a: R(si,a) = i             3            2
                       s3                     s2


                  V(s0) = 0 +g [(s0,+1)V(s1) + (s0,-1) V(s3) ]
                                                                      22
       Algorithms -Policy Evaluation
                            Example

 A={+1,-1}                                      V(s0) = 5/3
                   s0                    s1
 g = 1/2                                        V(s1) = 7/3
                        0            1
 d(si,a)= si+a                                  V(s2) = 11/3
  random                                       V(s3) = 13/3

"a: R(si,a) = i         3            2
                   s3                     s2


                  V(s0) = 0 + (V(s1) + V(s3) )/4
                                                                23
Algorithms - optimal control
State-Action Value function:


         Q(s,a)  E [ R(s,a)] + gEs’~ (s,a) [ V(s’)]

Note    V  ( s )  Q  ( s ,  ( s ))

  For a deterministic policy .




                                                         24
        Algorithms -Optimal control
                          Example

A={+1,-1}                                    Q(s0,+1) = 5/6
                 s0                    s1
g = 1/2                                      Q(s0,-1) = 13/6
                      0            1
d(si,a)= si+a
 random

R(si,a) = i           3            2
                 s3                     s2


                Q(s0,+1) = 0 +g V(s1)
                                                                25
  Algorithms - optimal control
CLAIM: A policy  is optimal if and only if at each state s:

         V(s)  MAXa {Q(s,a)}               (Bellman Eq.)

PROOF: Assume there is a state s and action a s.t.,

        V(s) < Q(s,a).
Then the strategy of performing a at state s (the first time)
is better than .
This is true each time we visit s, so the policy that
performs action a at state s is better than .          p       26
         Algorithms -optimal control
                         Example

A={+1,-1}       s0                 s1
g = 1/2              0         1
d(si,a)= si+a
 random

R(si,a) = i          3         2
                s3                  s2


Changing the policy using the state-action value function.
                                                      27
   Algorithms - optimal control

The greedy policy with respect to Q(s,a) is

   (s) = argmaxa{Q(s,a) }

The e-greedy policy with respect to Q(s,a) is

    (s) = argmaxa{Q(s,a) } with probability 1-e, and

    (s) = random action with probability e       28
MDP - computing optimal policy

1. Linear Programming
2. Value Iteration method.

V i 1 ( s)  max{R( s, a)  g s ' d (s, a, s' ) V i (s' )}
               a


3. Policy Iteration method.
                                  i 1
      i ( s)  arg max {Q                ( s, a)}
                      a
                                                               29
                Convergence
• Value Iteration
  – Drop in distance from optimal
  maxs {V*(s) – Vt(s)}
• Policy Iteration
  – Policy can only improve
     "s Vt+1(s)  Vt(s)
  – Less iterations then Value Iteration, but
  more expensive iterations.

                                                30
       Relations to Board Games
•   state = current board
•   action = what we can play.
•   opponent action = part of the environment
•   value function = probability of winning
•   Q- function = modified policy.
•   Hidden assumption: Game is Markovian

                                                31
  Planning versus Learning


Tightly coupled in Reinforcement Learning


Goal: maximize return while learning.



                                        32
Example - Elevator Control
           Learning (alone):
              Model the arrival model well.

           Planning (alone) :
              Given arrival model build schedule


            Real objective: Construct a
            schedule while updating model



                                            33
  Partially Observable MDP
Rather than observing the state we observe some
function of the state.
Ob - Observable function.
    a random variable for each states.

Example: (1) Ob(s) = s+noise. (2) Ob(s) = first bit of s.

Problem: different states may “look” similar.

The optimal strategy is history dependent !

                                                            34
 POMDP - Belief State Algorithm
Given a history of actions and observable value
we compute a posterior distribution for the state
we are in (belief state).


The belief-state MDP:
States: distribution over S (states of the POMDP).
actions: as in the POMDP.
Transition: the posterior distribution (given the observation)


We can perform the planning and learning on the belief-state MDP.
                                                                 35
POMDP -
Hard computational problems.
Computing an infinite (polynomial) horizon undiscounted
optimal strategy for a deterministic POMDP is P-space-
hard (NP-complete) [PT,L].

Computing an infinite (polynomial) horizon undiscounted
optimal strategy for a stochastic POMDP is EXPTIME-
hard (P-space-complete) [PT,L].

 Computing an infinite (polynomial) horizon
 undiscounted optimal policy for an MDP is
 P-complete [PT] .
                                                          36
              Resources
• Reinforcement Learning (an introduction)
  [Sutton & Barto]
• Markov Decision Processes [Puterman]
• Dynamic Programming and Optimal
  Control [Bertsekas]
• Neuro-Dynamic Programming [Bertsekas &
  Tsitsiklis]
• Ph. D. thesis - Michael Littman
                                         37

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:5/10/2012
language:English
pages:37