Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

chang

VIEWS: 11 PAGES: 58

									Learning in networks
(and other asides)



  A preliminary investigation & some comments

                       Yu-Han Chang
       Joint work with Tracey Ho and Leslie Kaelbling
                        AI Lab, MIT

       NIPS Multi-agent Learning Workshop, Whistler, BC 2002
    Networks: a multi-agent system
   Graphical games [Kearns, Ortiz, Guestrin, …]
   Real networks, e.g. a LAN [Boyan, Littman, …]
   “Mobile ad-hoc networks” [Johnson, Maltz, …]
    Mobilized ad-hoc networks
   Mobile sensors, tracking agents, …
   Generally a distributed system that wants to
    optimize some global reward function
    Learning
    Nash equilibrium is the phrase of the day,
     but is it a good solution?
    Other equilibria, i.e. refinements of NE

1.   Can we do better than Nash Equilibrium?
     (Game playing approach)
2.   Perhaps we want to just learn some good
     policy in a distributed manner. Then what?
     (Distributed problem solving)
     What are we studying?

Learning
              RL,               Stochastic games,
              NDP               Learning in games,
                                …




             Decision Theory,
             Planning                Game Theory
Known
world
           Single-agent               Multiple agents
Part I: Learning
 Rewards
               Observations,
               Sensations




   Learning
   Algorithm

                          World,
                          State



    Policy
                         Actions
Learning to act in the world
                                            Other agents
 Rewards                                    (possibly learning)
               Observations,
               Sensations            ?


   Learning
   Algorithm

                         Environ-
                         ment




    Policy
                         Actions




                                    World
    A simple example
   The problem: Prisoner’s Dilemma
   Possible solutions: Space of policies
   The solution metric: Nash equilibrium

                                          Player 2’s actions

                                      Cooperate     Defect


                         Cooperate
                                      1,1           -2,2
       World,
                         Defect
       State                          2,-2          -1,-1


                     Player 1’s actions        Rewards
            That Folk Theorem
           For discount factors close to 1, any individually
            rational payoffs are feasible (and are Nash) in the
            infinitely repeated game

                                                     R2
             Coop.   Defect
                                        (-2,2)
Coop.        1,1     -2,2
                                                          (1,1)
Defect       2,-2    -1,-1
                                                                       R1


                         safety value
                                           (-1,-1)


                                                                  (2,-2)
    Better policies: Tit-for-Tat

   Expand our notion of policies to include maps from
    past history to actions
   Our choice of action now depends on previous
    choices (i.e. non-stationary)


     Tit-for-Tat Policy:
     ( . , Defect )  Defect
     ( . , Cooperate )  Cooperate

history (last period’s play)
 Types of policies & consequences
Stationary: 1  At
      At best, leads to same outcome as single-shot Nash
       Equilibrium against rational opponents

Reactionary: { ( ht-1 ) }  At
      Tit for Tat achieves “best” outcome in Prisoners Dilemma

Finite Memory: { ( ht-n , … , ht-2 , ht-1 ) }  At
      May be useful against more complex opponents or in more
       complex games

“Algorithmic”: { ( h1 , h2 , … , ht-2 , ht-1 ) }  At
      Makes use of the entire history of actions as it learns over
       time
    Classifying our policy space

We can classify our learning algorithm’s potential power by
  observing the amount of history its policies can use

   Stationary: H0
     1  At

   Reactionary: H1
     { ( ht-1 ) }  At

   Behavioral/Finite Memory: Hn
     { ( ht-n , … , ht-2 , ht-1 ) }  At

   Algorithmic/Infinite Memory: H
     { ( h1 , h2 , … , ht-2 , ht-1 ) }  At
    Classifying our belief space

Its also important to quantify our belief space, i.e. our
   assumptions about what types of policies the
   opponent is capable of playing


   Stationary:    B0
   Reactionary:    B1
   Behavioral/Finite Memory: Bn

   Infinite Memory/Arbitrary: B
     A Simple Classification


             B0              B1           Bn                B
H0    Minimax-Q,                                    Bully
      Nash-Q, Corr-Q

H1                                                  Godfather



Hn

H
                                                    ???
      (WoLF) PHC,        Q1-learning Qt-learning?
      Fictitious Play,
      Q-learning (JAL)
     A Classification


             B0              B1           Bn                B
H0    Minimax-Q,                                    Bully
      Nash-Q, Corr-Q

H1                                                  Godfather



Hn

H
                                                    ???
      (WoLF) PHC,        Q1-learning Qt-learning?
      Fictitious Play,
      Q-learning (JAL)
    H x B0 :     Stationary opponent


   Since the opponent is stationary, this case reduces
    the world to an MDP. Hence we can apply any
    traditional reinforcement learning methods

   Policy hill climber (PHC)         [Bowling & Veloso, 02]
       Estimates the gradient in the action space and follows it
       towards the local optimum
   Fictitious play   [Robinson, 51] [Fudenburg & Levine, 95]
       Plays a stationary best response to the statistical frequency
       of the opponent’s play
   Q-learning (JAL)      [Watkins, 89] [Claus & Boutilier, 98]
       Learns Q-values of states and possibly joint actions
     A Classification


             B0              B1           Bn                B
H0    Minimax-Q,                                    Bully
      Nash-Q, Corr-Q

H1                                                  Godfather



Hn

H
                                                    ???
      (WoLF) PHC,        Q1-learning Qt-learning?
      Fictitious Play,
      Q-learning (JAL)
        H0 x B :       My enemy’s pretty smart


       “Bully”     [Littman & Stone, 01]

           Tries to force opponent to conform to the preferred outcome
           by choosing to play only some part of the game matrix


The “Chicken” game                          Them:
(Hawk-Dove)                   Cooperate       Defect
                              “Swerve”        “Drive”
                                                        Undesirable
                                                        Nash Eq.
              Cooperate       1,1             -2,2
              “Swerve”
        Us:
              Defect          2,-2            -5,-5
              “Drive”
    Achieving “perfection”
   Can we design a learning algorithm that will
    perform well in all circumstances?
       Prediction
       Optimization


   But this is not possible!*
       [Nachbar, 95] [Binmore, 89]
       * Universal consistency (Exp3 [Auer et al, 02], smoothed
        fictitious play [Fudenburg & Levine, 95]) does provide a way out, but
        it merely guarantees that we’ll do almost as well as any
        stationary policy that we could have used
    A reasonable goal?
   Can we design an algorithm in H x Bn or in a
    subclass of H x B that will do well?
        Should always try to play a best response to any given
         opponent strategy
        Against a fully rational opponent, should thus learn to play
         a Nash equilibrium strategy
        Should try to guarantee that we’ll never do too badly

   One possible approach: given knowledge about the
    opponent, model its behavior and exploit its
    weaknesses (play best response)
   Let’s start by constructing a player that plays well
    against PHC players in 2x2 games
 2x2 Repeated Matrix Games

• We choose row i to play
• Opponent chooses column j to play
• We receive reward rij , they receive cij


                    Left         Right


       Up           r11 , c11    r12 , c12

       Down         r21 , c21    r22 , c22
        Iterated gradient ascent
          System dynamics for 2x2 matrix games take
         one of two forms:




Player 1’s probability for Action 1   Player 1’s probability for Action 1



                                                [Singh Kearns Mansour, 00]
     Can we do better and actually win?
    Singh et al show that we can achieve Nash payoffs
    But is this a best response? We can do better…
         Exploit while winning
         Deceive and bait while losing


                                          Them:
Matching pennies               Heads         Tails


                  Heads        -1,1          1,-1

            Us:   Tails        1,-1          -1,1
       A winning strategy against PHC

                                                                           1
If winning




                                       Probability opponent plays heads
          play probability 1 for
          current preferred action
          in order to maximize
          rewards while winning
If losing                                                                 0.5
          play a deceiving policy
          until we are ready to take
          advantage of them again

                                                                            0             0.5               1
                                                                                Probability we play heads
        Formally, PHC does:
       Keeps and updates Q values:

Q( s, a)  (1   )Q( s, a)   ( R   max a ' Q( s' , a' ))

       Updates policy:

                                 if a  arg maxa' Q(s,a')
 ( s, a)   ( s, a)   
                         | Ai |1 otherwise
        PHC-Exploiter
       Updates policy differently if winning vs. losing:


If we are winning:
                         1 if a is the best action
            ( s, a )  
             1

                         0 otherwise
Otherwise, we are losing:

                           2
                                   if a is the best action
   ( s, a)   ( s, a)    2
   1           1

                           | A1|1 otherwise
                          
           PHC-Exploiter
          Updates policy differently if winning vs. losing:

If     1 ( s, a ' )Q ( s, a ' )  R1 (  1 * ( s ),  2 ( s ))
      a'

                            1 if a  arg maxa' Q(s,a')
               ( s, a )  
                 1

                            0 otherwise
Otherwise, we are losing:

                          2
                                  if a  arg maxa' Q(s,a')
  ( s, a)   ( s, a)    2
  1           1

                          | A1|1 otherwise
                         
           PHC-Exploiter
          Updates policy differently if winning vs. losing:

If     1 ( s, a ' )Q ( s, a ' )  R1 (  1 * ( s ),  2 ( s ))
      a'

                            1 if a  arg maxa' Q(s,a')
               ( s, a )  
                 1

                            0 otherwise
Otherwise, we are losing:

                          2
                                  if a  arg maxa' Q(s,a')
  ( s, a)   ( s, a)    2
  1           1

                          | A1|1 otherwise
                         
    But we don’t have complete information

   Estimate opponent’s policy 2 at each time
    period
   Estimate opponent’s learning rate 2



         t-2w         t-w         t
                                        time



                             w
Ideally we’d like to see this:



      winning




                losing
With our approximations:
And indeed we’re doing well.


                     losing
           winning
    Knowledge (beliefs) are useful

   Using our knowledge about the opponent,
    we’ve demonstrated one case in which we
    can achieve better than Nash rewards
   In general, we’d like algorithms that can
    guarantee Nash payoffs against fully rational
    players but can exploit bounded players (such
    as a PHC)
    So what do we want from learning?

   Best Response / Adaptive : exploit the
    opponent’s weaknesses, essentially always try
    to play a best response

   Regret-minimization : we’d like to be able to
    look back and not regret our actions; we
    wouldn’t say to ourselves: “Gosh, why didn’t
    I choose to do that instead…”
    A next step
   Expand the comparison class in universally
    consistent (regret-minimization) algorithms to
    include richer spaces of possible strategies
   For example, the comparison class could
    include a best-response player to a PHC
   Could also include all t-period strategies
    Part II

   What if we’re cooperating?
    What if we’re cooperating?
   Nash equilibrium is not the most useful
    concept in cooperative scenarios
   We simply want to distributively find the
    global (perhaps approximately) optimal
    solution
        This happens to be a Nash equilibrium, but its not
         really the point of NE to address this scenario
   Distributed problem solving rather than game
    playing
   May also deal with modeling emergent
    behaviors
    Mobilized ad-hoc networks
   Ad-hoc networks are limited in connectivity
   Mobilized nodes can significantly improve
    connectivity
Network simulator
    Connectivity bounds
   Static ad-hoc networks have loose bounds of
    the following form:

    Given n nodes uniformly distributed i.i.d. in a disk
    of area A, each with range
                                         A log n   n
                                r 'n 
                                             n
    the graph is connected almost surely as n   iff
    n   .
    Connectivity bounds

   Allowing mobility can improve our loose bounds to:


     Fraction mobile   Required range   # nodes
     1/2               rn   2           n/2
     2/3               rn   3           n/3
     k/(k+1)           rn   k 1        n/(k+1)


                                                     log n
                                              rn 
                                                      n
   Can we achieve this or even do significantly better
    than this?
    Many challenges
   Routing
        Dynamic environment: neighbor nodes moving in
         and out of range, source and receivers may also
         be moving
        Limited bandwidth: channel allocation, limited
         buffer sizes

   Moving
        What is the globally optimal configuration?
        What is the globally optimal trajectory of
         configurations?
        Can we learn a good policy using only local
         knowledge?
    Routing
   Q-routing        [Boyan Littman, 93]
        Applied simple Q-learning to the static network routing
         problem under congestion
        Actions: Forward packet to a particular neighbor node
        States: Current packet’s intended receiver
        Reward: Estimated time to arrival at receiver
        Performed well by learning to route packets around
         congested areas
   Direct application of Q-routing to the mobile ad-hoc
    network case
   Adaptations to the highly dynamic nature of
    mobilized ad-hoc networks
    Movement: An RL approach
   What should our actions be?
        North, South, East, West, Stay Put
        Explore, Maintain connection, Terminate
         connection, etc.

   What should our states be?
        Local information about nodes, locations, and
         paths
        Summarized local information
        Globally shared statistics

   Policy search? Mixture of experts?
    Macros, options, complex actions
   Allow the nodes (agents) to utilize complex
    actions rather than simple N, S, E, W type
    movements
   Actions might take varying amounts of time
   Agents can re-evaluate whether to continue
    to do the action or not at each time step
        If the state hasn’t really changed, then naturally
         the same action will be chosen again
     Example action: “plug”
1.    Sniff packets in neighborhood
2.    Identify path (source, receiver pair) with
      longest average hops
3.    Move to that path
4.    Move along this path until a long hop is
      encountered
5.    Insert yourself into the path at this point,
      thereby decreasing the average hop
      distance
    Some notion of state
   State space could be huge, so we choose
    certain features to parameterize the state
    space
        Connectivity, average hop distance, …
   Actions should change the world state
        Exploring will hopefully lead to connectivity,
         plugging will lead to smaller average hops, …
Experimental results
Number     Range     Theoretical       Empirical fraction
of nodes             fraction mobile   mobile required
25         2 rn
25         rn        1/2               0.21
50         1.7 rn
50         0.85 rn   1/2               0.25
100        1.7 rn
100        0.85 rn   1/2               0.19
200        1.6 rn
200        0.8 rn    1/2               0.17
400        1.6 rn
400        0.8 rn    1/2               0.14
Seems to work well
Pretty pictures
Pretty pictures
Pretty pictures
Pretty pictures
    Many things to play with
   Lossy transmissions
   Transmission interference
   Existence of opponents, jamming signals
   Self-interested nodes
   More realistic simulations – ns2
   Learning different agent roles or optimizing
    the individual complex actions
   Interaction between route learning and
    movement learning
     Three yardsticks
1.    Non-cooperative case: We want to play our
      best response to the observed play of the
      world – we want to learn about the
      opponent
         Minimize regret
         Play our best response
     Three yardsticks
1.    Non-cooperative case: We want to play our
      best response to the observed play of the
      world

2.    Cooperative case: Approximate a global
      optimal using only local information or less
      computation
     Three yardsticks
1.    Non-cooperative case: We want to play our
      best response to the observed play of the
      world

2.    Cooperative case: Approximate a global
      optimal in a distributed manner

3.    Skiiing case: 17 cm of fresh powder last
      night and its still snowing. More snow is
      better. Who can argue with that?
The End

								
To top