Docstoc

poster .ppt

Document Sample
poster .ppt Powered By Docstoc
					                                         Hybrid Least-Squares Algorithms
                                        for Approximate Policy Evaluation
   Jeff Johns, Marek Petrik†, and Sridhar Mahadevan                         {johns, petrik, mahadeva}@cs.umass.edu
   Autonomous Learning Laboratory &   †Resource-Bounded Reasoning Laboratory, University of Massachusetts Amherst



  Approximate Policy Evaluation                      Two Popular Least-Squares Methods                        Hybrid Least-Squares Methods                  “Backwards Bootstrapping”

  The problem is to represent the value              Bellman Residual (BR) Minimization                       Evaluated two hybrid approaches (H1 and       The BR objective causes the values of a state
  function associated with a specific policy         Minimize the norm of the Bellman residual                H2) to combining the BR and FP loss           to look like its preceding states and
  π. For large problems, we can only                                                                          functions:                                    successor states
  approximate the value function.                            LBR(w) =        Tπ( w) -  w                                                                                                                                    1       0       0   a1
                                                                                                                                                                  a1                b                1
                                                                  ~                                           LH (w) =  LBR(w) + (1-) LFP(w)                                                                    =
                                                                                                                                                                                                                              1       0       0   a2


  Assuming we use linear function                         where: V =  w                                            1
                                                                                                                                                                  a2                c                 0
                                                                                                                                                                                                                              0       1       0   b
                                                                                                                                                                                                                              0       0       1   c
  approximation with a set of basis functions                    Tπ = Bellman operator                        LH (w) =  LBR(w) + (1-) LFP(u, w)
                                                                                                                    2                                                                          =.33      =.67
  (s), the goal is to produce a set of                       T () = R +  P 
                                                                π            π        π
                                                                                                                                                                        V   π        ~
                                                                                                                                                                                     VFP
                                                                                                                                                                                               ~
                                                                                                                                                                                               VH
                                                                                                                                                                                                          ~
                                                                                                                                                                                                          VH
                                                                                                                                                                                                                     ~
                                                                                                                                                                                                                     VBR
                                                                                                                                                                                                      2       2
  coefficients w such that                                                                                    Difference between H1 and H2 is when the                      1           0.5        0.5     0.5         0.5        a1
                                                     Fixed Point (FP) Method                                  fixed point constraint is enforced.                           0           0.5        0.5     0.5         0.5        a2
        ~
        V(s) = ∑ i(s) wi                            Minimize the norm of the projected Bellman                                                                             1            1     0.875       0.8       0.75         b
                    i
                                                     residual                                                 Projections of Vπ                                             0            0     0.125       0.2       0.25         c


  is a “good” approximation of Vπ(s).
                                                             LFP(w) =         Tπ( w) -  w
                                                                                                                 Least-squares problem: A w = b             Policy Iteration Results on Grid MDP
  Our research: exploring the space of least-             where:  =  (T )-1 T                               Matrix A and vector b are sample-based
  squares algorithms for computing the                                                                        statistics.                                   Experiment: 500 policy iteration trials, for
  coefficients w.                                    Comparison                                                  Each least-squares method is a different   each trial:
                                                        Minimizing Bellman error norm                         weighted projection of target function Vπ.       Initialize w randomly
  Incremental vs. Least-Squares                      (LBR(w)) results in tightening of              ~                                                          π       greedy policy w.r.t.  w
                                                     performance bounds.                            V =  (T D )-1 T D Vπ                                   w       leastSquares(Rπ, Pπ, , )
                                                                                                                                                                  ~
  Incremental algorithms take a sample                  When using (single) samples, BR is                                                                     // V =  w
  (s, a, r, s’), update the coefficient vector w,    biased and FP is unbiased                     DBR = (I -  Pπ)T (I -  Pπ)                                Iterate until w converges or 500 iterations
  and then discard the sample.                          Empirical policy iteration behavior:       DFP = (I -  Pπ)
                                                         BR tends to be more stable between        DH = (I -   Pπ)T (I -  Pπ)                                                                            ~
                                                                                                     2                                                                                             V*     - V
  Least-squares algorithms take a sample              policy iteration rounds                      DH = (I -  Pπ)T ( I + (1-) ∏) (I -  Pπ)
                                                                                                     1
  (s, a, r, s’) and update sample-based                  FP results in better policies when it                                                                    Converged Trials                                   Non-converged Trials
  statistics. When needed, the coefficient            converges                Geometry of the Bellman Equation                                              BR
                                                                                                                                                                                                                    FP
  vector w is computed from these statistics.
                                                                                               Hybrid approaches
                                                                                                                                       FP                              H1

                                                                                               in between BR & FP                                                                                                        H1
  Benefits to Least-Squares Methods:                                                                                                                                                FP        H2
                                                                                                                                                                                                               H2
     More efficient use of data.                                                                                                                                                                               BR
     Does not require tuning a step-size
                                                                                                                                                                                                                                         
  parameter.
                                                                                  BR                                                                        Conclusions
  Drawbacks:                                        Distance                                                                                                   Hybrid least-squares algorithm H2
    Increased computational complexity.             minimized by BR
                                                                                 Tπ( w)
                                                                                                                                                            generalizes BR and FP by linearly combining
    Increased storage requirements.                                                                                                                         their loss functions.
                                                                                                                                                                =0 is FP, =1 is BR
                                                                                                                                                               Promising results for two policy iteration
                                                                                                                                                            tasks (grid MDP and Tetris) for intermediate
                                                                w               ∏ Tπ( w)                                                                  values of .
                                                    []                                                                                                        Can  be automatically determined based
                                                                                 Distance minimized by FP                                                   on the geometry of the Bellman equation?

  Behavior of Least-Squares Policy Iteration in Two-Room Grid MDP

Random Initial                        Iteration 1              Iteration 5                Iteration 10                  Iteration 15        Iteration 20           Iteration 25                                     Greedy Policy
   Policy


                        BR




                        FP




    Hybrid H2 (=0.5)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:5/3/2013
language:Unknown
pages:1