Reinforcement Learning (PowerPoint)

Document Sample
Reinforcement Learning (PowerPoint) Powered By Docstoc
					Reinforcement Learning

     Yishay Mansour
    Tel-Aviv University
• Goal of Reinforcement Learning
• Mathematical Model (MDP)
• Planning
Goal of Reinforcement Learning
Goal oriented learning through interaction

Control of large scale stochastic environments with
    partial knowledge.

 Supervised / Unsupervised Learning
       Learn from labeled / unlabeled examples
Reinforcement Learning - origins
 Artificial Intelligence

 Control Theory

 Operation Research

 Cognitive Science & Psychology

 Solid foundations; well established research.
             Typical Applications
• Robotics
  – Elevator control [CB].
  – Robo-soccer [SV].
• Board games
  – backgammon [T],
  – checkers [S].
  – Chess [B]
• Scheduling
  – Dynamic channel allocation [SB].
  – Inventory problems.
 Contrast with Supervised Learning

The system has a “state”.

The algorithm influences the state distribution.

Inherent Tradeoff: Exploration versus Exploitation.
Mathematical Model - Motivation

 Model of uncertainty:
    Environment, actions, our knowledge.

 Focus on decision making.

Maximize long term reward.

Markov Decision Process (MDP)
Mathematical Model - MDP

 Markov decision processes

 S- set of states

 A- set of actions

 d - Transition probability

 R - Reward function          Similar to DFA!
MDP model - states and actions
   Environment = states


             0.3          action a

Actions = transitions          d (s, a, s' )
MDP model - rewards

                              R(s,a) = reward at state s
                                       for doing action a
                                       (a random variable).

R(s,a) = -1 with probability 0.5
        +10 with probability 0.35
        +20 with probability 0.15
MDP model - trajectories

    s0 a0   r0   s1 a1   r1   s2 a2   r2
Simple example: N- armed bandit

Single state.             Goal: Maximize sum of
                          immediate rewards.
                               Given the model:
          s          a2        Greedy action.

                a3             Difficulty:
                               unknown model.
      MDP - Return function.

 Combining all the immediate rewards to a single value.

 Modeling Issues:

 Are early rewards more valuable than later rewards?

 Is the system “terminating” or continuous?

Usually the return is linear in the immediate rewards.
MDP model - return functions
Finite Horizon - parameter H           return      R(s , a )
                                                  1 i  H
                                                             i   i

Infinite Horizon

   discounted - parameter g<1.           return   γ i R(s i ,a i )

                               N 1
                                      R(s i ,a i )     return

 Terminating MDP
MDP model - action selection
 AIM: Maximize the expected return.
 This talk: discounted return

Fully Observable - can “see” the “exact” state.

             Policy - mapping from states to actions

Optimal policy: optimal from any start state.

 THEOREM: There exists a deterministic optimal policy
MDP model - summary

          sS      - set of states, |S|=n.
          a A     - set of k actions, |A|=k.
d ( s1 , a, s2 ) - transition function.

   R(s,a)          - immediate reward function.
 :S  A           - policy.

              ri   - discounted cumulative return.
       Relations to Board Games
•   state = current board
•   action = what we can play.
•   opponent action = part of the environment
•   Hidden assumption: Game is Markovian
Contrast with Supervised Learning

Supervised Learning:
Fixed distribution on examples.

Reinforcement Learning:
The state distribution is policy dependent!!!

A small local change in the policy can make a huge
global change in the return.
       Planning - Basic Problems.

        Given a complete MDP model.

Policy evaluation - Given a policy , estimate its return.

Optimal control -    Find an optimal policy * (maximizes
                     the return from any start state).
  Planning - Value Functions

V(s) The expected return starting at state s following .

Q(s,a) The expected return starting at state s with
        action a and then following .

 V*(s) and Q*(s,a) are define using an optimal policy *.

                  V*(s) = max V(s)
 Planning - Policy Evaluation
  Discounted infinite horizon (Bellman Eq.)
V(s) = Es’~  (s) [ R(s, (s)) + g V(s’)]

Rewrite the expectation

V ( s )  E[ R ( s,  ( s ))]  g s ' d ( s,  ( s ), s ' )V ( s ' )
                                                             

      Linear system of equations.
      Algorithms - Policy Evaluation

 A={+1,-1}             s0                    s1
 g = 1/2                    0            1
 d(si,a)= si+a
  random

"a: R(si,a) = i             3            2
                       s3                     s2

                  V(s0) = 0 +g [(s0,+1)V(s1) + (s0,-1) V(s3) ]
       Algorithms -Policy Evaluation

 A={+1,-1}                                      V(s0) = 5/3
                   s0                    s1
 g = 1/2                                        V(s1) = 7/3
                        0            1
 d(si,a)= si+a                                  V(s2) = 11/3
  random                                       V(s3) = 13/3

"a: R(si,a) = i         3            2
                   s3                     s2

                  V(s0) = 0 + (V(s1) + V(s3) )/4
Algorithms - optimal control
State-Action Value function:

         Q(s,a)  E [ R(s,a)] + gEs’~  (s) [ V(s’)]

Note    V  ( s )  Q  ( s ,  ( s ))

  For a deterministic policy .
        Algorithms -Optimal control

A={+1,-1}                                    Q(s0,+1) = 7/6
                 s0                    s1
g = 1/2                                      Q(s0,-1) = 13/6
                      0            1
d(si,a)= si+a
 random

R(si,a) = i           3            2
                 s3                     s2

                Q(s0,+1) = 0 +g V(s1)
  Algorithms - optimal control
CLAIM: A policy  is optimal if and only if at each state s:

         V(s)  MAXa {Q(s,a)}               (Bellman Eq.)

PROOF: Assume there is a state s and action a s.t.,

        V(s) < Q(s,a).
Then the strategy of performing a at state s (the first time)
is better than .
This is true each time we visit s, so the policy that
performs action a at state s is better than .          p
         Algorithms -optimal control

A={+1,-1}       s0                 s1
g = 1/2              0         1
d(si,a)= si+a
 random

R(si,a) = i          3         2
                s3                  s2

Changing the policy using the state-action value function.
   Algorithms - optimal control

The greedy policy with respect to Q(s,a) is

   (s) = argmaxa{Q(s,a) }

The e-greedy policy with respect to Q(s,a) is

    (s) = argmaxa{Q(s,a) } with probability 1-e, and

    (s) = random action with probability e
MDP - computing optimal policy

1. Linear Programming
2. Value Iteration method.

V i 1 ( s)  max{R( s, a)  g s ' d (s, a, s' ) V i (s' )}

3. Policy Iteration method.
                                  i 1
      i ( s)  arg max {Q                ( s, a)}
• Value Iteration
  – Drop in distance from optimal
     • By a factor of 1-γ
• Policy Iteration
  – Policy only improves
       Relations to Board Games
•   state = current board
•   action = what we can play.
•   opponent action = part of the environment
•   value function = likelihood of winning
•   Q- function = modified policy.
•   Hidden assumption: Game is Markovian
  Planning versus Learning

Tightly coupled in Reinforcement Learning

Goal: maximize return while learning.
Example - Elevator Control
           Learning (alone):
              Model the arrival model well.

           Planning (alone) :
              Given arrival model build schedule

            Real objective: Construct a
            schedule while updating model
  Partially Observable MDP
Rather than observing the state we observe some
function of the state.
Ob - Observable function.
    a random variable for each states.

Example: (1) Ob(s) = s+noise. (2) Ob(s) = first bit of s.

Problem: different states may “look” similar.

The optimal strategy is history dependent !
 POMDP - Belief State Algorithm
Given a history of actions and observable value
we compute a posterior distribution for the state
we are in (belief state).

The belief-state MDP:
States: distribution over S (states of the POMDP).
actions: as in the POMDP.
Transition: the posterior distribution (given the observation)

We can perform the planning and learning on the belief-state MDP.
Hard computational problems.
Computing an infinite (polynomial) horizon undiscounted
optimal strategy for a deterministic POMDP is P-space-
hard (NP-complete) [PT,L].

Computing an infinite (polynomial) horizon undiscounted
optimal strategy for a stochastic POMDP is EXPTIME-
hard (P-space-complete) [PT,L].

 Computing an infinite (polynomial) horizon
 undiscounted optimal policy for an MDP is
 P-complete [PT] .
• Reinforcement Learning (an introduction)
  [Sutton & Barto]
• Markov Decision Processes [Puterman]
• Dynamic Programming and Optimal
  Control [Bertsekas]
• Neuro-Dynamic Programming [Bertsekas &
• Ph. D. thesis - Michael Littman

Shared By: