# Hierarchical Reinforcement Learning by chenmeixiu

VIEWS: 10 PAGES: 54

• pg 1
```									   Hierarchical Reinforcement
Learning
[A Survey and Comparison of HRL techniques]

Mausam
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speed up RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
Decision Making
Environment

What action
next?

Percept                     Action
Slide courtesy
Dan Weld
Personal Printerbot
States (S) : {loc,has-robot-printout,
user-loc,has-user-
printout},map
Actions (A) :{moven,moves,movee,movew,
extend-arm,grab-page,release-pages}
Reward (R) : if h-u-po +20 else -1
Goal (G) : All states with h-u-po true.
Start state : A state with h-u-po
false.
Episodic Markov Decision Process
Episodic MDP ´
   hS, A, P, R, G, s0i                MDP with
   S : Set of environment states. absorbing goals
   A : Set of available actions.
   P : Probability Transition model. P(s’|s,a)*
   R : Reward model. R(s)*
   G : Absorbing goal states.
   s0 : Start state.
* Markovian
    : Discount factor**.                 assumption.
** bounds R for
infinite horizon.
Goal of an Episodic MDP

Find a policy (S ! A), which:
maximises expected discounted reward for a
a fully observable* Episodic MDP.
if agent is allowed to execute for an indefinite
horizon.

* Non-noisy
complete
information
perceptors
Solution of an Episodic MDP
Define V*(s) : Optimal reward
starting in state s.

estimate of V*(s) and
successively re-estimate it to
converge to a fixed point.
Complexity of Value Iteration
Each iteration – polynomial in |S|
Number of iterations – polynomial in |S|
Overall – polynomial in |S|

Polynomial in |S| - 
|S| : exponential in number of
* Bellman’s
features in the domain*.          curse of
dimensionality
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speed up RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
Learning
Environment

•Gain knowledge
•Gain understanding
•Gain skills
•Modification of
behavioural tendency

Data
Decision Making while Learning*
Environment

•Gain knowledge
•Gain understanding
•Gain skills
•Modification of
behavioural tendency
What action
Percepts            next?
Datum                             Action
* Known as
Reinforcement
Learning
Reinforcement Learning
Unknown P and reward R.
Learning Component : Estimate the P and R
values via data observed from the
environment.
Planning Component : Decide which actions
to take that will maximise reward.
Exploration vs. Exploitation
GLIE (Greedy in Limit with
Infinite Exploration)
Learning
Model-based learning
Learn the model, and do planning
Requires less data, more computation
Model-free learning
Plan without learning an explicit model
Requires a lot of data, less computation
Q-Learning
Instead of learning, P and R, learn Q*
directly.
 Q*(s,a) : Optimal reward starting in s,
if the first action is a, and
after that the optimal policy is followed.
 Q* directly defines the optimal policy:

Optimal policy is the
action with maximum
Q* value.
Q-Learning

Given an experience tuple hs,a,s’,ri

Under suitable assumptions, and GLIE
New     Old estimate
exploration Q-Learning
estimate of
Q value
of Q value

converges to optimal.
Semi-MDP: When actions take time.
The Semi-MDP equation:

Semi-MDP Q-Learning equation:

where experience tuple is hs,a,s’,r,Ni
r = accumulated discounted reward
while action a was executing.
Printerbot
Paul G. Allen Center has 85000 sq ft space
Each floor ~ 85000/7 ~ 12000 sq ft
Discretise location on a floor: 12000 parts.
State Space (without map) :
2*2*12000*12000 --- very large!!!!!
How do humans do the
decision making?
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
1. The Mathematical Perspective
 S : Relational MDP
 A : Concurrent MDP
 P : Dynamic Bayes Nets
 R : Continuous-state MDP
 G : Conjunction of state variables
 V : Algebraic Decision Diagrams
  : Decision List (RMDP)
2. Modular Decision Making
2. Modular Decision Making

•Go out of room
•Walk in hallway
•Go in the room
2. Modular Decision Making
Humans plan modularly at different
granularities of understanding.
Going out of one room is similar to going
out of another room.
Navigation steps do not depend on whether
we have the print out or not.
3. Background Knowledge
knowledge can scale up to larger problems.
(E.g. : HTN planning, TLPlan)
What forms of control knowledge can we
provide to our Printerbot?
First pick printouts, then deliver them.
separately, etc.
A mechanism that exploits all three
avenues : Hierarchies
1. Way to add a special (hierarchical)
structure on different parameters of an
MDP.
2. Draws from the intuition and reasoning in
human decision making.
3. Way to provide additional control
knowledge to the system.
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
Hierarchy
Hierarchy of : Behaviour, Skill, Module,
picking the pages
collision avoidance
fetch pages phase
walk in hallway

HRL ´ RL with temporally
extended actions
Hierarchical Algos ´ Gating Mechanism
Hierarchical Learning
•Learning the gating function
•Learning the individual behaviours
•Learning both
*
g is a gate

bi is a
behaviour
*Can be a multi-
level hierarchy.
Option : Movee until end of hallway

Start : Any state in
the hallway.
Execute : policy as
shown.
 Terminate : when s
is end of hallway.
Options
[Sutton, Precup, Singh’99]

An option is a well defined behaviour.
o = h Io, o, o i
 Io : Set of states (IoµS) in which o can be
initiated.
 o(s) : Policy (S!A*) when o is executing.
 o(s) : Probability that o terminates
in s.                  *Can be a policy
over lower level
options.
Learning
An option is temporally extended action
with well defined policy.
Set of options (O) replaces the set of
actions (A)
Learning occurs outside options.
Learning over options ´ Semi MDP Q-
Learning.
Machine: Movee + Collision Avoidance
: End of hallway
Call M1
Movee               Choose
Obstacle
End of hallway                        Call M2

Return

M1   Movew            Moves        Moves       Return

M2   Movew            Moven        Moven       Return
Hierarchies of Abstract Machines
[Parr, Russell’97]

A machine is a partial policy represented by
a Finite State Automaton.
Node :
Execute a ground action.
Call a machine as a subroutine.
Choose the next node.
Hierarchies of Abstract Machines
A machine is a partial policy represented by
a Finite State Automaton.
Node :
Execute a ground action.
Call a machine as subroutine.
Choose the next node.
Learning
Learning occurs within machines, as
machines are only partially defined.
Flatten all machines out and consider
states [s,m] where s is a world state, and m,
a machine node ´ MDP
reduce(SoM) : Consider only states where
machine node is a choice node ´ Semi-MDP.
Learning ¼ Semi-MDP Q-Learning
[Dietterich’00]

Root                      Children of a
unordered
Fetch                   Deliver

Take             Navigate(loc)               Give

Extend-arm      Grab                        Release      Extend-arm

MovesMovewMovee
Moven
MAXQ Decomposition
Augment the state s by adding the
Define C([s,i],j) as the reward received in i
after j finishes.
Q([s,Fetch],Navigate(prr)) =
V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))*
Reward V in
C
while navigating      after navigation *Observe the
Learn C, instead of learning Q             context-free
nature of
Q-value
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
1. State Abstraction
Abstract state : A state having fewer
state variables; different world states
maps to the same abstract state.
If we can reduce some state
variables, then we can reduce on the
learning time considerably!
We may use different abstract states
for different macro-actions.
State Abstraction in MAXQ
Relevance : Only some variables are
Fetch : user-loc irrelevant
Navigate(printer-room) : h-r-po,h-u-po,user-loc
Fewer params for V of lower levels.
Funnelling : Subtask maps many states to
smaller set of states.
Fetch : All states map to h-r-po=true,
loc=pr.room.
Fewer params for C of higher levels.
State Abstraction in Options, HAM
Options : Learning required only in states
that are terminal states for some option.
HAM : Original work has no abstraction.
Extension: Three-way value decomposition*:
Q([s,m],n) = V([s,n]) + C([s,m],n) + Cex([s,m])
Similar abstractions are employed.

*[Andre,Russell’02]
2. Optimality

Hierarchical Optimality
vs.
Recursive Optimality
Optimality
Options : Hierarchical
Use (A [ O) : Global**
Interrupt options
HAM : Hierarchical*
MAXQ : Recursive*
Use Pseudo-rewards       * Can define
eqns for both
Iterate!                  optimalities
macro-actions
maybe lost.
3. Language Expressiveness
Option
Can only input a complete policy
HAM
Can input a complete policy.
Can represent “amount of effort”.
Later extended to partial programs.
MAXQ
Cannot input a policy (full/partial)
4. Knowledge Requirements
Options
Requires complete specification of policy.
One could learn option policies – given subtasks.
HAM
Medium requirements
MAXQ
Minimal requirements
Options : Concurrency
HAM : Richer representation, Concurrency
MAXQ : Continuous time, state, actions;
Multi-agents, Average-reward.
In general, more researchers have followed
MAXQ
Less input knowledge
Value decomposition
 S : Options, MAXQ
 A : All
 P : None
 R : MAXQ
 G : All
 V : MAXQ
  : All
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
Directions for Future Research
Bidirectional State Abstractions
Hierarchies over other RL research
Model based methods
Function Approximators
Probabilistic Planning
Hierarchical P and Hierarchical R
Imitation Learning
Directions for Future Research
Theory
Bounds (goodness of hierarchy)
Non-asymptotic analysis
Automated Discovery
Discovery of Hierarchies
Discovery of State Abstraction
Apply…
Applications
Toy Robot
Flight Simulator
AGV Scheduling
Keepaway soccer
P2        P1
D2        D1

Parts
Ware-
house      Images courtesy
Assemblies   various sources
D3        D4
P3        P4
Thinking Big…
"... consider maze domains. Reinforcement learning
researchers, including this author, have spent
countless years of research solving a solved
problem! Navigating in grid worlds, even with
stochastic dynamics, has been far from rocket
science since the advent of search techniques
such as A*.”                         -- David Andre
Use planners, theorem provers, etc. as
components in big hierarchical solver.
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
How to choose appropriate hierarchy
Look at available domain knowledge
If some behaviours are completely specified –
options
If some behaviours are partially specified –
HAM
If less domain knowledge available – MAXQ
We can use all three to specify different
behaviours in tandem.
Main ideas in HRL community
Hierarchies speedup learning
Value function decomposition
State Abstractions
Greedy non-hierarchical execution
Context-free learning and pseudo-rewards
Policy improvement by re-estimation
and re-learning.

```
To top