Docstoc

Hierarchical Reinforcement Learning

Document Sample
Hierarchical Reinforcement Learning Powered By Docstoc
					   Hierarchical Reinforcement
            Learning
[A Survey and Comparison of HRL techniques]

                 Mausam
       The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
 RL: Simultaneous learning and planning.
    Explore avenues to speed up RL.
   Illustrate prominent HRL methods.
   Compare prominent HRL methods.
        Discuss future research.
               Summarise
      Decision Making
          Environment




              What action
              next?

Percept                     Action
                                     Slide courtesy
                                     Dan Weld
        Personal Printerbot
States (S) : {loc,has-robot-printout,
                  user-loc,has-user-
 printout},map
Actions (A) :{moven,moves,movee,movew,
        extend-arm,grab-page,release-pages}
Reward (R) : if h-u-po +20 else -1
Goal (G) : All states with h-u-po true.
Start state : A state with h-u-po
                     false.
Episodic Markov Decision Process
                                       Episodic MDP ´
     hS, A, P, R, G, s0i                MDP with
     S : Set of environment states. absorbing goals
     A : Set of available actions.
     P : Probability Transition model. P(s’|s,a)*
     R : Reward model. R(s)*
     G : Absorbing goal states.
     s0 : Start state.
                                           * Markovian
      : Discount factor**.                 assumption.
                                            ** bounds R for
                                             infinite horizon.
   Goal of an Episodic MDP

Find a policy (S ! A), which:
maximises expected discounted reward for a
a fully observable* Episodic MDP.
if agent is allowed to execute for an indefinite
 horizon.


                                         * Non-noisy
                                           complete
                                           information
                                           perceptors
Solution of an Episodic MDP
Define V*(s) : Optimal reward
 starting in state s.




Value Iteration : Start with an
 estimate of V*(s) and
 successively re-estimate it to
 converge to a fixed point.
Complexity of Value Iteration
Each iteration – polynomial in |S|
Number of iterations – polynomial in |S|
Overall – polynomial in |S|

Polynomial in |S| - 
 |S| : exponential in number of
                                  * Bellman’s
  features in the domain*.          curse of
                                    dimensionality
       The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
 RL: Simultaneous learning and planning.
    Explore avenues to speed up RL.
   Illustrate prominent HRL methods.
   Compare prominent HRL methods.
        Discuss future research.
               Summarise
       Learning
       Environment


           •Gain knowledge
           •Gain understanding
           •Gain skills
           •Modification of
           behavioural tendency

Data
Decision Making while Learning*
               Environment

          •Gain knowledge
          •Gain understanding
          •Gain skills
          •Modification of
          behavioural tendency
                         What action
     Percepts            next?
     Datum                             Action
                                            * Known as
                                            Reinforcement
                                            Learning
     Reinforcement Learning
Unknown P and reward R.
Learning Component : Estimate the P and R
 values via data observed from the
 environment.
Planning Component : Decide which actions
 to take that will maximise reward.
Exploration vs. Exploitation
  GLIE (Greedy in Limit with
          Infinite Exploration)
                Learning
Model-based learning
  Learn the model, and do planning
  Requires less data, more computation
Model-free learning
  Plan without learning an explicit model
  Requires a lot of data, less computation
             Q-Learning
Instead of learning, P and R, learn Q*
 directly.
 Q*(s,a) : Optimal reward starting in s,
     if the first action is a, and
     after that the optimal policy is followed.
 Q* directly defines the optimal policy:


             Optimal policy is the
             action with maximum
                   Q* value.
             Q-Learning


 Given an experience tuple hs,a,s’,ri


Under suitable assumptions, and GLIE
                       New     Old estimate
        exploration Q-Learning
                   estimate of
                    Q value
                                of Q value

        converges to optimal.
Semi-MDP: When actions take time.
      The Semi-MDP equation:


   Semi-MDP Q-Learning equation:


    where experience tuple is hs,a,s’,r,Ni
    r = accumulated discounted reward
        while action a was executing.
                Printerbot
Paul G. Allen Center has 85000 sq ft space
Each floor ~ 85000/7 ~ 12000 sq ft
Discretise location on a floor: 12000 parts.
State Space (without map) :
 2*2*12000*12000 --- very large!!!!!
How do humans do the
 decision making?
       The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
 RL: Simultaneous learning and planning.
    Explore avenues to speedup RL.
   Illustrate prominent HRL methods.
   Compare prominent HRL methods.
        Discuss future research.
               Summarise
1. The Mathematical Perspective
      A Structure Paradigm
  S : Relational MDP
  A : Concurrent MDP
  P : Dynamic Bayes Nets
  R : Continuous-state MDP
  G : Conjunction of state variables
  V : Algebraic Decision Diagrams
   : Decision List (RMDP)
2. Modular Decision Making
2. Modular Decision Making




                      •Go out of room
                      •Walk in hallway
                      •Go in the room
  2. Modular Decision Making
Humans plan modularly at different
 granularities of understanding.
Going out of one room is similar to going
 out of another room.
Navigation steps do not depend on whether
 we have the print out or not.
     3. Background Knowledge
Classical Planners using additional control
 knowledge can scale up to larger problems.
(E.g. : HTN planning, TLPlan)
What forms of control knowledge can we
 provide to our Printerbot?
  First pick printouts, then deliver them.
  Navigation – consider rooms, hallway,
   separately, etc.
A mechanism that exploits all three
      avenues : Hierarchies
1. Way to add a special (hierarchical)
   structure on different parameters of an
   MDP.
2. Draws from the intuition and reasoning in
   human decision making.
3. Way to provide additional control
   knowledge to the system.
       The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
 RL: Simultaneous learning and planning.
    Explore avenues to speedup RL.
   Illustrate prominent HRL methods.
   Compare prominent HRL methods.
        Discuss future research.
               Summarise
               Hierarchy
Hierarchy of : Behaviour, Skill, Module,
                SubTask, Macro-action, etc.
  picking the pages
  collision avoidance
  fetch pages phase
  walk in hallway

HRL ´ RL with temporally
         extended actions
Hierarchical Algos ´ Gating Mechanism
    Hierarchical Learning
      •Learning the gating function
      •Learning the individual behaviours
      •Learning both
                                   *
                                  g is a gate

     bi is a
   behaviour
                                     *Can be a multi-
                                      level hierarchy.
Option : Movee until end of hallway

Start : Any state in
 the hallway.
Execute : policy as
 shown.
 Terminate : when s
 is end of hallway.
                    Options
              [Sutton, Precup, Singh’99]

An option is a well defined behaviour.
o = h Io, o, o i
 Io : Set of states (IoµS) in which o can be
       initiated.
 o(s) : Policy (S!A*) when o is executing.
 o(s) : Probability that o terminates
                in s.                  *Can be a policy
                                           over lower level
                                           options.
                Learning
An option is temporally extended action
 with well defined policy.
Set of options (O) replaces the set of
 actions (A)
Learning occurs outside options.
Learning over options ´ Semi MDP Q-
 Learning.
Machine: Movee + Collision Avoidance
                  : End of hallway
                                              Call M1
                Movee               Choose
                         Obstacle
        End of hallway                        Call M2

                Return


   M1   Movew            Moves        Moves       Return


   M2   Movew            Moven        Moven       Return
Hierarchies of Abstract Machines
                [Parr, Russell’97]

A machine is a partial policy represented by
 a Finite State Automaton.
Node :
  Execute a ground action.
  Call a machine as a subroutine.
  Choose the next node.
  Return to the calling machine.
Hierarchies of Abstract Machines
A machine is a partial policy represented by
 a Finite State Automaton.
Node :
  Execute a ground action.
  Call a machine as subroutine.
  Choose the next node.
  Return to the calling machine.
              Learning
Learning occurs within machines, as
 machines are only partially defined.
Flatten all machines out and consider
 states [s,m] where s is a world state, and m,
 a machine node ´ MDP
reduce(SoM) : Consider only states where
 machine node is a choice node ´ Semi-MDP.
Learning ¼ Semi-MDP Q-Learning
Task Hierarchy: MAXQ Decomposition
                            [Dietterich’00]

                                Root                      Children of a
                                                            task are
                                                           unordered
                    Fetch                   Deliver



           Take             Navigate(loc)               Give



  Extend-arm      Grab                        Release      Extend-arm



                        MovesMovewMovee
                    Moven
        MAXQ Decomposition
Augment the state s by adding the
 subtask i : [s,i].
Define C([s,i],j) as the reward received in i
 after j finishes.
Q([s,Fetch],Navigate(prr)) =
 V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))*
    Reward V in
Express receivedterms of Reward received
                          C
    while navigating      after navigation *Observe the
Learn C, instead of learning Q             context-free
                                            nature of
                                            Q-value
       The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
 RL: Simultaneous learning and planning.
    Explore avenues to speedup RL.
   Illustrate prominent HRL methods.
   Compare prominent HRL methods.
        Discuss future research.
               Summarise
       1. State Abstraction
Abstract state : A state having fewer
 state variables; different world states
 maps to the same abstract state.
If we can reduce some state
 variables, then we can reduce on the
 learning time considerably!
We may use different abstract states
 for different macro-actions.
    State Abstraction in MAXQ
Relevance : Only some variables are
             relevant for the task.
  Fetch : user-loc irrelevant
  Navigate(printer-room) : h-r-po,h-u-po,user-loc
  Fewer params for V of lower levels.
Funnelling : Subtask maps many states to
              smaller set of states.
  Fetch : All states map to h-r-po=true,
   loc=pr.room.
  Fewer params for C of higher levels.
 State Abstraction in Options, HAM
Options : Learning required only in states
 that are terminal states for some option.
HAM : Original work has no abstraction.
  Extension: Three-way value decomposition*:
   Q([s,m],n) = V([s,n]) + C([s,m],n) + Cex([s,m])
  Similar abstractions are employed.


                                         *[Andre,Russell’02]
         2. Optimality




Hierarchical Optimality
               vs.
          Recursive Optimality
              Optimality
Options : Hierarchical
  Use (A [ O) : Global**
  Interrupt options
HAM : Hierarchical*
MAXQ : Recursive*
  Interrupt subtasks
  Use Pseudo-rewards       * Can define
                             eqns for both
  Iterate!                  optimalities
                            **Adv. of using
                             macro-actions
                             maybe lost.
  3. Language Expressiveness
Option
  Can only input a complete policy
HAM
  Can input a complete policy.
  Can input a task hierarchy.
  Can represent “amount of effort”.
  Later extended to partial programs.
MAXQ
  Cannot input a policy (full/partial)
  4. Knowledge Requirements
Options
  Requires complete specification of policy.
  One could learn option policies – given subtasks.
HAM
  Medium requirements
MAXQ
  Minimal requirements
      5. Models advanced
Options : Concurrency
HAM : Richer representation, Concurrency
MAXQ : Continuous time, state, actions;
 Multi-agents, Average-reward.
In general, more researchers have followed
 MAXQ
  Less input knowledge
  Value decomposition
   6. Structure Paradigm
 S : Options, MAXQ
 A : All
 P : None
 R : MAXQ
 G : All
 V : MAXQ
  : All
       The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
 RL: Simultaneous learning and planning.
    Explore avenues to speedup RL.
   Illustrate prominent HRL methods.
   Compare prominent HRL methods.
        Discuss future research.
               Summarise
Directions for Future Research
Bidirectional State Abstractions
Hierarchies over other RL research
  Model based methods
  Function Approximators
Probabilistic Planning
  Hierarchical P and Hierarchical R
Imitation Learning
Directions for Future Research
Theory
  Bounds (goodness of hierarchy)
  Non-asymptotic analysis
Automated Discovery
  Discovery of Hierarchies
  Discovery of State Abstraction
Apply…
           Applications
Toy Robot
Flight Simulator
AGV Scheduling
Keepaway soccer
               P2        P1
          D2        D1

                                Parts
                                Ware-
                                house      Images courtesy
                              Assemblies   various sources
          D3        D4
               P3        P4
              Thinking Big…
"... consider maze domains. Reinforcement learning
   researchers, including this author, have spent
   countless years of research solving a solved
   problem! Navigating in grid worlds, even with
   stochastic dynamics, has been far from rocket
   science since the advent of search techniques
   such as A*.”                         -- David Andre
Use planners, theorem provers, etc. as
   components in big hierarchical solver.
       The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
 RL: Simultaneous learning and planning.
    Explore avenues to speedup RL.
   Illustrate prominent HRL methods.
   Compare prominent HRL methods.
        Discuss future research.
               Summarise
How to choose appropriate hierarchy
Look at available domain knowledge
  If some behaviours are completely specified –
   options
  If some behaviours are partially specified –
   HAM
  If less domain knowledge available – MAXQ
We can use all three to specify different
 behaviours in tandem.
 Main ideas in HRL community
Hierarchies speedup learning
Value function decomposition
State Abstractions
Greedy non-hierarchical execution
Context-free learning and pseudo-rewards
Policy improvement by re-estimation
 and re-learning.

				
DOCUMENT INFO