Docstoc

Using MDP Characteristics to Guide Exploration in Reinforcement

Document Sample
Using MDP Characteristics to Guide Exploration in Reinforcement Powered By Docstoc
					Using MDP Characteristics to
    Guide Exploration in
  Reinforcement Learning

Paper: Bohdana Ratich & Doina Precucp
       Presenter: Michael Simon

Some pictures/formulas gratefully borrowed from slides by Ratich
           MDP Terminology



• Transition Probabilities - Pas,s’
• Expected reward - Ras,s’
• Return
      Reinforcement Learning
• Learning only on environmental rewards
  – Achieve the best payoff possible
• Must balance exploitation with exploration
  – exploration can take large amounts of time
• The structure of the problem/model can
  assist in the exploration, in theory
  – But with what in our MDP case?
             Goals/Approach
• Find MDP Characteristics...
  – ... that affect performance...
  – ... and test on them.
• Use MDP Characteristics...
  – ... to tune parameters.
  – ... to select algorithms.
  – ... to create strategy.
                Back to RL
• Undirected
  – Sufficient Exploration
  – Simple, but can be exponential
• Directed
  – Extra Computation/Storage, but possibly
    polynomial
  – Often uses aspects of the model to its advantage
     RL Methods - Undirected
• -greedy exploration
  – Probability 1-  of exploiting based on your
    best greedy guess at the moment
  – Explore with probability , select action
    (uniform) randomly
• Boltzman Distribution
         RL Methods - Directed
• Maximize w/Exploration Bonuses


  – Different options for 
     •   Counter-based (least frequently)
     •   Recency-based (most frequently)
     •   Error-based (most variable in estimation value)
     •   Interval Estimation (highest variance in samples)
          Properties of MDPs
•   State Transition Entropy
•   Controllability
•   Variance of Immediate Rewards
•   Risk Factor
•   Transition Distance
•   Transition Variability
      State Transition Entropy


• Stochasticity of State Transitions
  – High STE = good exploration
• Potential variance of samples needed
  – High STE = more samples needed
   Controllability - Calculation
• How much the environment’s response
  differs for an action
  – Can also be thought of as normalized
    information gain
       Controllability - Usage
• High controllability
     • Control over actions
     • Different actions lead to different parts of the space
     • More variance = more sampling needed
• Take actions leading to controllable states
• Actions with Forward Controllability (FC)
            Proposed Method
• Undirected
  – Explore w/ probability 




  – For experiments
     • K1, K2 = {0,1}  = 1,  = {0.1, 0.4, 0.9}
             Proposed Method
• Directed
  – Pick action maximizing



  – For Experments
     • K0 = {1, 10, 50}, K1, K2 = {0,1}, K3 = 1
     •  is recency based
                   Experiments
• Random MDPs
  – 225 states
     •   3 actions
     •   1-20 branching factor
     •   transition probs/rewards uniform [0,1]
     •   0.01 chance of termination
  – Divided into 4 groups
     • Low STE, High STE
     • High variation (test) vs. low variation (control)
       Experiments Continued
• Performance Measures
  – Return Estimates
     • Run greedy policy from 50 different states, 30 trials
       per state, average returns, normalize
  – Penalty Measure

     • Rmax = upper limit of return of optimal
     • Rt is normalized greedy return after trial t
     • T = # of trials
Graphs, Glorious Graphs
More Graphs, Glorious Graphs
                Discussion
• Significant results obtained when using STE
  and FC
  – Results correspond with presence of STC
• Values can be calculated prior to learning
  – Requires model knowledge
• Rug Sweeping and more judgements
  – SARSA
It’s over!

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:7/23/2011
language:English
pages:19