Docstoc

Application of Reinforcement Learning in Network Routing

Document Sample
Application of Reinforcement Learning in Network Routing Powered By Docstoc
					Application of Reinforcement
Learning in Network Routing




           By
       Chaopin Zhu
Machine Learning

 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning
Supervised Learning
 Feature: Learning with a teacher
 Phases
        • Training phase
        • Testing phase
 Application
        • Pattern recognition
        • Function approximation
Unsupervised Leaning
 Feature
        • Learning without a teacher
 Application
        • Feature extraction
        • Other preprocessing
Reinforcement Learning



 Feature: Learning with a critic
 Application
        • Optimization
        • Function approximation
Elements of
Reinforcement Learning

 Agent
 Environment
 Policy
 Reward function
 Value function
 Model of environment (optional)
Reinforcement Learning Problem

                    Agent
 state   reward                 action
  xn       rn                     an

          Rn+1
          xn+1    Environment
Markov Decision Process (MDP)


Definition:
 A reinforcement learning task that satisfies
  the Markov property
Transition probabilities
            Pxany  Prxn1  y | xn , an 
              n
An Example of MDP


         wait
                          search

                     recharge
         High                   Low



            search          wait
Markov Decision Process (cont.)
   Parameters
               xn   state
               an   action
               rn   reward
                  discount
                  poli cy
Value functions
                        k                 
         V  x   E   rn  k | xn  x 
           

                        k 0               
                               k                
         Q  xn , an   E   rn  k | xn , an 
          

                               k 0              
Elementary Methods for
Reinforcement Learning Problem


 Dynamic programming
 Monte Carlo Methods
 Temporal-Difference Learning
Bellman’s Equations



                                                     
V * x   max E rn 1  V * xn 1  | xn  x, an  a
                                                              
            a

                                          
Q* x, a   E rn 1   max Q* xn 1 , a ' | xn  x, an  a
                          '
                          a
Dynamic Programming Methods
   Policy evaluation

     Vk 1 x   E rn1  Vk xn1  | xn  x
   Policy improvement
     ' x   arg max Q x, a 
                           
                     a

                                                    
    Q x, a   E rn1  V xn 1  | xn  x, an  a
                                
Dynamic Programming (cont.)

   E        I   E        I   E   I       E
       0           1
 0 V  1 V  2  V          *       *



E ---- policy evaluation
I ---- policy improvement
 Policy Iteration
 Value Iteration
Monte Carlo Methods
   Feature
           • Learning from experience
           • Do not need complete transition
             probabilities
   Idea
           • Partition experience into episodes
           • Average sample return
           • Update at episode-by-episode base
Temporal-Difference Learning
 Features
  (Combination of Monte Carlo and DP ideas)
        • Learn from experience (Monte
          Carlo)
        • Update estimates based in part on
          other learned estimates (DP)
 TD() algorithm seemlessly integrates TD
  and Monte Carlo Methods
TD(0) Learning
Initialize V(x) arbitrarily
 to the policy to be evaluated
Repeat (for each episode):
   Initialize x
   Repeat (for each step of episode)
         aaction given by  for x
         Take action a; observe reward r and next state x’
       V ( x)  V ( x)   [r  V ( x' )  V ( x)]
       xx’
  until x is terminal
Q-Learning
Initialize Q(x,a) arbitrarily
Repeat (for each episode)
  Initialize x
  Repeat (for each step of episode):
        Choose a from x using policy derived from Q
        Take action a, observe r, x’
      Q( x, a)  Q( x, a)  [r   max Q( x ' , a ' )  Q( x, a)]
                                     '
                                        a
      xx’
      until x is terminal
Q-Routing

Qx(y,d)----estimated time that a packet would
  take to reach the destination node d from
  current node x via x’s neighbor node y
Ty(d) ------y’s estimate for the time remaining
  in the trip Ty (d )  zminy )Qy z, d 
                         N (

qy ---------queuing time in node y
Txy --------transmission time between x and y
Algorithm of Q-Routing
1.   Set initial Q-values for each node
2.   Get the first packet from the packet queue
     of node x
3.   Choose the best neighbor node y and              ˆ
     forward the packet to node y by yˆ  arg min Q  y, d 
                                             ˆ
                                                           yN ( x )
                                                                       x


4.   Get the estimated value T d   q  from node
                                           ˆ
                                           y       ˆ
                                                   y


5.   Update Q  y, d   Q  y, d  t  q  T d   Q  y, d 
                 ˆx         x
                             ˆ         ˆ
                                      xy       ˆ
                                               y   ˆ
                                                   y   x
                                                            ˆ

6.   Go to 2.
Dual Reinforcement Q-Routing


               Qx(y,d)            Qy(x,s)
                         Packet
                  x                 y




        s                                       d

         Backward                        Forward
        Exploration                     Exploration
Network Model




         Subnet1   Subnet2
Network Model (cont.)


      16   17   18   19   20   21



      13   14   15   22   23   24



      10   11   12   25   26   27



      7    8    9    28   29   30



      4    5    6    31   32   33



      1    2    3    34   35   36
Node Model

                Packet                   Packet
               Generator                Destroyer




                            Routing
                           Controller

      Input Queue                             Output Queue

      Input Queue                             Output Queue


      Input Queue                             Output Queue


      Input Queue                             Output Queue
Routing Controller


                             Init


                                        Default
                             Idle




                   Arrival
                                    Depart
         Arrival                              Depart
                                              ure
Initialization/ Termination
Procedures


   Initilization
      Initialize and / or register global variable

      Initialize routing table

   Termination
      Destroy routing table

      Release memory
Arrival Procedure
 Data packet arrival
   Update routing table

   Route it with control information or
    destroy the packet if it reaches the
    destination
 Control information packet arrival
   Update routing table

   Destroy the packet
Departure Procedure



 Set all fields of the packet
 Get a shortest route
 Send the packet according to the route
References
[1] Richard S. Sutton and Andrew G. Barto,
  Reinforcement Learning—An Introduction
[2] Chengan Guo, Applications of
  Reinforcement Learning in Sequence
  Detection and Network Routing
[3] Simon Haykin, Neural Networks– A
  Comprehensive Foundation

				
DOCUMENT INFO