Docstoc

Reinforcement learning Recurrent networks Markov Decision Process

Document Sample
Reinforcement learning Recurrent networks Markov Decision Process Powered By Docstoc
					Le 8, 2011-03-02                                                      Neural Networks and Learning Systems, TBMI26




                                                                                          Recurrent	
  networks	
  

                 Reinforcement	
  learning	
  
                                    	
                                                                       Learning
                                                                                                              system
                                 TBMI	
  26	
  



                               Magnus Borga

                                                                                                                                                               2




               Markov	
  Decision	
  Process	
                                      Markov	
  Decision	
  Process	
  

        The next state depends on the current state (input) and
        the output of the system.                                                          State            Learning              Action
                                                                                                             system

                          xt+1 = f(xt , at , et)
                                                                                                   x                             a
        Where the system gets xt+1 depends on where it is (xt )
        and what it does (at ).
                                                                                                          Environment

        Actually an ordinary Markov process, since at = µ(xt) .         xt+1 = f(xt , at , et)                                   at = µ(xt)
                                                                  3                                                                                            4




                                  Cost	
                                                    The	
  value	
  func>on	
  
                                                                        •  The	
  value	
  of	
  a	
  state	
  x	
  given	
  a	
  certain	
  policy	
  µ	
  
        g(x, a) – the cost for making action a in state x.                 is	
  the	
  accumulated	
  cost:	
  

                                                                                                             ∞

                                 g=1
                                                                                          J µ (x t ) = ∑ γ i g (x t +i , a t +i )
                                                                                                            i =0

              g = 0,1                        g=1                        •  γ	
  is	
  a	
  ”discount	
  factor”	
  that	
  makes	
  costs	
  
                                                                             decrease	
  with	
  >me.	
  	
  0	
  <	
  γ	
  ≤	
  1.	
  
                                g = 0.5                                 	
  
                                                                  5
                                                                        	
                                                                                     6




Magnus Borga                                                                                                                                                       1
Le 8, 2011-03-02                                                                   Neural Networks and Learning Systems, TBMI26




                            Accumulated	
  cost	
                                                  Dynamic	
  programming	
  
                                                         J (x)                       •  Given	
  a	
  Markov	
  decision	
  process,	
  find	
  a	
  	
  
                                                                                        (sta>onary)	
  policy	
  µ	
  that	
  minimizes	
  the	
  	
  
                  ?                                                                     accumulated	
  cost	
  J	
  for	
  all	
  ini>al	
  states	
  x0.	
  
                                g=1

        g = 0,1                                g=1
                                     1                           0

              1.4              g = 0.5


                                                         γ = 0.9	


                                                                              7                                                                                  8




                              Op>mal	
  policy	
                                                          Richard	
  Bellman	
  
                                                                                     •  Richard	
  Bellman	
  (1920–1984)	
  was	
  an	
  applied	
  
            The optimal policy µ* gives the smallest cost:
                                                                                        mathema>cian,	
  celebrated	
  for	
  his	
  inven>on	
  of	
  
                        J µ * (x) = J * (x) ≤ J µ (x) ∀ x, µ                            dynamic	
  programming	
  in	
  1953,	
  and	
  important	
  
                               ∞                   ∞                                    contribu>ons	
  in	
  other	
  fields	
  of	
  mathema>cs.	
  	
  
              J µ (x t ) = ∑ γ i g t += gt + ∑ γ i gt +i= gt + γ J µ (xt +1 )
                                      i
                              i =0                i =1

                        *
                      J (x t ) = g (x t , µ * (x t )) + γ J * (x t +1 )   Next x


                                   = min{g (xt , at ) + γ J * (xt +1 )}
                                         a

                            Bellman’s optimality equation                     9                                                                                 10




                      Dynamic	
  programming	
                                                    Reinforcement	
  learning	
  
                                                       Jµ (x)                        •  On-­‐line	
  version	
  of	
  dynamic	
  programming.	
  
                                                                                     •  The	
  states,	
  possible	
  ac>ons	
  and	
  corresponding	
  
             1.36
                                g=1
                                                                                        costs	
  turn	
  up	
  during	
  learning.	
  
                                                                                     •  ”Neurodynamic	
  programming”	
  
        g = 0.1                                g=1
                                     1                           0

              1.4              g = 0.5


                                                         γ = 0.9	


                                                                             11                                                                                 12




Magnus Borga                                                                                                                                                         2
Le 8, 2011-03-02                                                                                                        Neural Networks and Learning Systems, TBMI26




                        Reinforcement	
  learning	
                                                                                                     Applica>ons	
  
                                                                                                                          •  Ac>ve	
  systems	
  interac>ng	
  with	
  the	
  
                                                     Learning                                                                environment,	
  e.g.	
  a	
  robot.	
  
                               State                                            Action
                                                      system

                                                                                                                          •  Op>miza>on	
  of	
  unknown	
  cost	
  func>ons,	
  e.g.	
  
                                       x                       cost            a
                                                              g(x,a)
                                                                                                                             rou>ng.	
  

                                                  Environment                                                             •  Can	
  become	
  beXer	
  than	
  the	
  teacher!	
  

        xt+1 = f(xt , at , et)                                                 at = µ(xt)
                                                                                                                   13                                                                                          14




                        Reinforcement	
  learning	
                                                                                       Reinforcement	
  learning	
  

         •  The	
  task	
  is	
  defined	
  by	
  a	
  scalar	
  func>on:	
                                                 •  It	
  is	
  o[en	
  difficult	
  to	
  tell	
  how	
  a	
  task	
  should	
  
                  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  g(x,	
  a)	
                                            be	
  solved	
  but	
  easy	
  to	
  tell	
  if	
  or	
  how	
  good	
  it	
  
         	
   	
  i.e.	
  the	
  cost	
  of	
  taking	
  ac>on	
  a	
  in	
  state	
  x	
  	
                                 has	
  been	
  solved.	
  
         •  The	
  system’s	
  goal	
  is	
  to	
  minimize	
  g	
  over	
  >me:	
  	
  	
  	
  	
  	
  	
  	
             •  More	
  general	
  than	
  supervised	
  learning	
  
                                                                                                                           •  The	
  system	
  must	
  find	
  a	
  solu>on	
  by	
  itself!	
  
                                                   ∞
                                                                                                                           •  ”Learning	
  by	
  doing”	
  or	
  ”trial	
  and	
  error”.	
  
                                        J = ∑ γ t g (t )
                                                  t =0


                                                                                                                                  ”We learn as we do and we do as well as we have learned.”
                                                                                                                   15                                                                                          16




                        Reinforcement	
  learning	
                                                                                                         Problem	
  
                                                                       Jµ (x)
                                                                                                                                  	
  “The	
  system	
  learned	
  to	
  properly	
  land	
  the	
  
                  1.36
                                                                                                                                      aircra[	
  with	
  a	
  rate	
  of	
  success	
  of	
  90%	
  to	
  
                  1.9                    g=1                                                                                          96%	
  a[er	
  some	
  60,000	
  aXempts.”	
  
         g = 0.1                              1
                                                             g=1
                                                                                   0
                                                                                                                           	
  
                                                                                                                           	
  
                   1.4                  g = 0.5


                                                                         γ = 0.9	


                                                                                                                   17                                                                                          18




Magnus Borga                                                                                                                                                                                                        3
Le 8, 2011-03-02                                                                                  Neural Networks and Learning Systems, TBMI26




                                        MENACE	
                                                                                Q-­‐learning	
  
                   Match-box Educable Noughts And Crosses Engine
         •  One	
  box	
  for	
  each	
  posi>on.	
  
         •  Each	
  box	
  filled	
  with	
  beans	
  of	
  different	
  colours.	
                                  Qµ (x t , a t ) = g (x t , a t ) + γ J µ (x t +1 )
         •  Each	
  colour	
  represent	
  a	
  certain	
  move.	
  
         •  Play	
  by	
  drawing	
  beans	
  and	
  move	
  according	
  to	
                        Qµ(x,a) is the cost of first doing a and then follow the
            the	
  colours.	
                                                                         policy µ.

         •  If	
  the	
  system	
  wins,	
  add	
  new	
  beans	
  with	
  the	
                                     J * (xt ) = min{g (xt , at ) + γ J * (xt +1 )}
            same	
  colours	
  as	
  the	
  ones	
  drawn.	
                                                                           a

                                                                                                                              = min{Q* (xt , at )}
         •  If	
  the	
  systems	
  looses,	
  remove	
  the	
  drawn	
                                                            a

            beans.	
                                                                                  Now, we don’t need a model of the environment (xt
                                                                                             19       +1 = f(xt,at)) to find the optimal response!                               20




                                       Q-­‐learning	
                                                                           Q-­‐learning	
  
                          Qµ (x t , a t ) = g (x t , a t ) + γ J µ (x t +1 )                        •  Think	
  of	
  the	
  Q-­‐func>on	
  as	
  a	
  table	
  with	
  values	
  
                                                                                                       of	
  each	
  possible	
  ac>on	
  in	
  each	
  possible	
  state.	
  
                       Qµ (x t , a t ) = g (x t , a t ) + γ Qµ (x t +1 , a t +1 )
                                                                                                    •  The	
  Q-­‐func>on	
  can,	
  however,	
  be	
  
                                                                                                       implemented	
  as	
  a	
  neural	
  network,	
  learning	
  a	
  
         Update the estimate of Q with                                                                 con>nuous	
  representa>on	
  from	
  the	
  sampled	
  
                                                                                                       states	
  and	
  ac>ons	
  
              ΔQt = η (g (x t , a t ) + γ Qµ (x t +1 , a t +1 ) − Qµ (x t , a t ) )

                              What Q should be                       What Q is now
                                                                                             21                                                                                  22




                                       Q-­‐learning	
                                                The	
  Explora>on-­‐exploita>on	
  dilemma	
  

               a          Q                                                                           •  We	
  want	
  to	
  use	
  the	
  safest	
  strategy	
  to	
  
               1          1.9                                                                            minimize	
  the	
  accumulated	
  cost.	
  
               2          1.36                                 Jµ (x)               γ = 0.9	

        •  We	
  want	
  to	
  try	
  new	
  strategies	
  in	
  order	
  to	
  
                                1                                                                        find	
  a	
  beXer	
  one.	
  
                                         g=1                                                          •  Conflict	
  between	
  exploring	
  the	
  state	
  space	
  
                      2
                                                        g=1                                              and	
  exploit	
  the	
  learnt	
  policy.	
  
              g = 0.1                       1                           0

                       1.4              g = 0.5
                                                                                             23                                                                                  24




Magnus Borga                                                                                                                                                                          4
Le 8, 2011-03-02                                                              Neural Networks and Learning Systems, TBMI26




            The	
  credit	
  assignment	
  problem	
                              The	
  temporal	
  credit	
  assignment	
  problem	
  

        •  Structural	
                                                         •  Q-­‐learning	
  only	
  learns	
  one	
  step	
  at	
  a	
  >me.	
  
           -­‐	
  what	
  part	
  is	
  responsible?	
                          •  Can	
  the	
  learning	
  speed	
  be	
  increased?	
  
           	
                                                                   •  If	
  the	
  system	
  remembers	
  a	
  sequence	
  of	
  states,	
  
        •  Temporal	
                                                              the	
  cost	
  for	
  the	
  whole	
  sequence	
  can	
  be	
  
           -­‐	
  when	
  was	
  the	
  crucial	
  ac>on	
  taken?	
               updated!	
  
           	
  



                                                                         25                                                                           26




Magnus Borga                                                                                                                                                5

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:7/22/2011
language:French
pages:5