# Using MDP Characteristics to Guide Exploration in Reinforcement

Document Sample

```					Using MDP Characteristics to
Guide Exploration in
Reinforcement Learning

Paper: Bohdana Ratich & Doina Precucp
Presenter: Michael Simon

Some pictures/formulas gratefully borrowed from slides by Ratich
MDP Terminology

• Transition Probabilities - Pas,s’
• Expected reward - Ras,s’
• Return
Reinforcement Learning
• Learning only on environmental rewards
– Achieve the best payoff possible
• Must balance exploitation with exploration
– exploration can take large amounts of time
• The structure of the problem/model can
assist in the exploration, in theory
– But with what in our MDP case?
Goals/Approach
• Find MDP Characteristics...
– ... that affect performance...
– ... and test on them.
• Use MDP Characteristics...
– ... to tune parameters.
– ... to select algorithms.
– ... to create strategy.
Back to RL
• Undirected
– Sufficient Exploration
– Simple, but can be exponential
• Directed
– Extra Computation/Storage, but possibly
polynomial
– Often uses aspects of the model to its advantage
RL Methods - Undirected
• -greedy exploration
– Probability 1-  of exploiting based on your
best greedy guess at the moment
– Explore with probability , select action
(uniform) randomly
• Boltzman Distribution
RL Methods - Directed
• Maximize w/Exploration Bonuses

– Different options for 
•   Counter-based (least frequently)
•   Recency-based (most frequently)
•   Error-based (most variable in estimation value)
•   Interval Estimation (highest variance in samples)
Properties of MDPs
•   State Transition Entropy
•   Controllability
•   Variance of Immediate Rewards
•   Risk Factor
•   Transition Distance
•   Transition Variability
State Transition Entropy

• Stochasticity of State Transitions
– High STE = good exploration
• Potential variance of samples needed
– High STE = more samples needed
Controllability - Calculation
• How much the environment’s response
differs for an action
– Can also be thought of as normalized
information gain
Controllability - Usage
• High controllability
• Control over actions
• Different actions lead to different parts of the space
• More variance = more sampling needed
• Take actions leading to controllable states
• Actions with Forward Controllability (FC)
Proposed Method
• Undirected
– Explore w/ probability 

– For experiments
• K1, K2 = {0,1}  = 1,  = {0.1, 0.4, 0.9}
Proposed Method
• Directed
– Pick action maximizing

– For Experments
• K0 = {1, 10, 50}, K1, K2 = {0,1}, K3 = 1
•  is recency based
Experiments
• Random MDPs
– 225 states
•   3 actions
•   1-20 branching factor
•   transition probs/rewards uniform [0,1]
•   0.01 chance of termination
– Divided into 4 groups
• Low STE, High STE
• High variation (test) vs. low variation (control)
Experiments Continued
• Performance Measures
– Return Estimates
• Run greedy policy from 50 different states, 30 trials
per state, average returns, normalize
– Penalty Measure

• Rmax = upper limit of return of optimal
• Rt is normalized greedy return after trial t
• T = # of trials
Graphs, Glorious Graphs
More Graphs, Glorious Graphs
Discussion
• Significant results obtained when using STE
and FC
– Results correspond with presence of STC
• Values can be calculated prior to learning
– Requires model knowledge
• Rug Sweeping and more judgements
– SARSA
It’s over!

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 6 posted: 7/23/2011 language: English pages: 19