Training Products of Experts by Minimizing Contrastive Divergence

Document Sample
Training Products of Experts by Minimizing Contrastive Divergence Powered By Docstoc
					 Training Products of Experts by
Minimizing Contrastive Divergence

        Geoffrey E. Hinton

          presented by Frank Wood




           Frank Wood - fwood@cs.brown.edu
Goal

• Learn parameters for probability distribution models of
  high dimensional data
     – (Images, Population Firing Rates, Securities Data, NLP data, etc)


          Mixture Model                               Product of Experts

                                                                                         
                                                                          
                                                                      fm d | m
                                      
                          
                                                                  
                                                    
p d | 1 ,, n   m f m d |  m                p d | 1 , , n  m      
                     m                                               f m c |  m 
                                                                        
                                                                        c   m

       Use EM to learn parameters                   Use Contrastive Divergence to learn
                                                               parameters.



                               Frank Wood - fwood@cs.brown.edu
Take Home

• Contrastive divergence is a general MCMC
  gradient ascent learning algorithm particularly
  well suited to learning Product of Experts (PoE)
  and energy- based (Gibbs distributions, etc.)
  model parameters.
• The general algorithm:
  – Repeat Until “Convergence”
     • Draw samples from the current model starting from the training
       data.
     • Compute the expected gradient of the log probability w.r.t. all
       model parameters over both samples and the training data.
     • Update the model parameters according to the gradient.

                          Frank Wood - fwood@cs.brown.edu
Sampling – Critical to Understanding
• Uniform
   – rand()                  Linear Congruential Generator
       • x(n) = a * x(n-1) + b mod M
                           0.2311   0.6068   0.4860   0.8913   0.7621   0.4565   0.0185

• Normal
   – randn()                 Box-Mueller
       • x1,x2 ~ U(0,1) -> y1,y2 ~N(0,1)
            – y1 = sqrt( - 2 ln(x1) ) cos( 2 pi x2 )
            – y2 = sqrt( - 2 ln(x1) ) sin( 2 pi x2 )
• Binomial(p)
   – if(rand()<p)
• More Complicated Distributions
   – Mixture Model
       • Sample from a Gaussian
       • Sample from a multinomial (CDF + uniform)
   – Product of Experts
       • Metropolis and/or Gibbs



                                     Frank Wood - fwood@cs.brown.edu
The Flavor of Metropolis Sampling

• Given some distribution  
                                              
                                           p d |  , a random
  starting point d t 1, and a symmetric proposal
  distribution J dt | dt 1  .
                   
                                           p d t |  
                                              
                                       r 
• Calculate the ratio of densities        p d t 1 |  
          
  where d t is sampled from the proposal
  distribution.
                                     
• With probability min(r,1) accept d t .
• Given sufficiently many iterations              Only need to

                                              
                                                           know the
                    dn , dn1, dn2 , ~ p d |             distribution up
                                                                  to a
                                                            proportionality!
                        Frank Wood - fwood@cs.brown.edu
Contrastive Divergence (Final Result!)
                                                   Training data
     Model                                     (empirical distribution).
  parameters.

                       log f m                      log f m            Samples from
             m                                                           model.
                            m           P0
                                                          m     P1


                                                      Law of Large Numbers, compute
                                                        expectations using samples.


                1           log f m (d )          1          log f m (c)
          m 
                N
                     
                     dD          m
                                                    
                                                    N c ~ P1      m

                      Now you know how to do it, let’s see why this works!

                            Frank Wood - fwood@cs.brown.edu
But First: The last vestige of concreteness.

• Looking towards the future:                                                            1


                                                                                        0.9
                                                                                                               1-D Student-t Expert




       – Take f to be a Student-t.                                                      0.8


                                                                                        0.7
                                                                                                                                       = .6, j = 5




                           
                                       1
                                                                                        0.6




                                                                     1/(1+.5*(j tx))
             f                   
                                                                                        0.5


f m   d                    d
                   m; j m
                                          T   m
                                                 
                                                                                        0.4




                                     1                                                 0.3




                                    1  jm d                                          0.2




                                     2        
                                                                                        0.1


                                                                                         0
                                                                                         -10   -8   -6   -4   -2        0         2   4         6     8   10




       – Then (for instance)                                Dot product Projection 1-D Marginal




 log f m; m   
                                   1 T  
                           m log1  jm d          
                                                                                                                                  
                 d                  2                           
                                                log1  1  T d 
                       
            j
                                                            jm 
         m                       m                 2          

                                   Frank Wood - fwood@cs.brown.edu
Maximizing the training data log likelihood
                                                                       Standard PoE form
• We want maximizing parameters
                                                                                
                                                           
                                                     fm d | m 
                                                    m                 
arg max log p(D | 1 , ,  n )  arg max log                       
                                              d D   f m c |  m  
                                              
 1 ,, n                         1 ,, n
                                                    c m
                                                     
                                                                       
                       Over all training data.                   Assuming d’s drawn
                                                               independently from p()

• Differentiate w.r.t. to all parameters and
  perform gradient ascent to find optimal
  parameters.
• The derivation is somewhat nasty.

                             Frank Wood - fwood@cs.brown.edu
Defining the gradient of the log likelihood
                                           
                                  log  p(d | 1 ,, n )
      log p(D | 1 ,, n )           
                                         d D
              m                                  m
                                                     
                                             log p (d | 1 ,  ,  n )
                              
                                 d D    m
                                         
                                log p (d | 1 ,, n )
                                                  
                            N
                                         m                           P0




                                                              Remember this
                                                         equivalence! eqn. 3.2 lhs




                          Frank Wood - fwood@cs.brown.edu
Deriving the gradient of the log likelihood

                                                    
 1  log p (D | 1 , ,  n )                 f m (d |  m )
                               1  log  m           
                                        d D  f m (c |  m )
 N           m                N  m  
                                                             
                                                             c   m

                                                                                  
                                                             log          f m c |  m 
                                          
                              
              1      log f m d |  m   1
                                                    
                                                                 
                                                                c    m
                
              N dD        m          N            
                                                     d D              m
                                                      
                                        log   f m c |  m 
                                              
                             
             1      log f m d |  m
             
                                             
                                            c  m
               
             N dD        m                   m

                         Frank Wood - fwood@cs.brown.edu
Deriving the gradient of the log likelihood

                                                
                               log   f m c |  m 
                              
                     
  1      log f m d |  m
 
                                    
                                   c     m
  N dD         m                       m
                                              
                            log   f m c |  m 
                 
             
    log f m d |  m              
                                c    m
                                                                              log(x)’ = x’/x
          m          P  0            m
                                                                                       
                                                                               f m c |  m 

            
            
   log f m d |  m           
                                                  1
                                                        
                                                                      
                                                                      c   m
         m              P0                     f m c |  m               m
                                    
                                    c       m



                                   Frank Wood - fwood@cs.brown.edu
 Deriving the gradient of the log likelihood
                                                                                                          
                                                                                                  f m c |  m 

            
             
   log f m d |  m                   
                                                              1
                                                                    
                                                                                    
                                                                                    c         m
         m                  P0                             f m c |  m                       m
                                               
                                               c       m



                                                                                             f j c |  j f m c |  m 
                                                                                                                
            
            
   log f m d |  m                                       1
                                                                             
                                                                             
                                                                                  j m
                                  
                                                                             c
                                                                 
         m              P0                              f m c |  m                          m                 log(x)’ = x’/x
                                       
                                       c           m

                                                                                                                          
                                                                                                f m c |  m  log f m c |  m 

            
            
   log f m d |  m               
                                                             1
                                                                   
                                                                             
                                                                             c           m
         m                  P0                            f m c |  m                               m
                                           
                                           c           m




                                                       Frank Wood - fwood@cs.brown.edu
Deriving the gradient of the log likelihood
                                                                                                     
                                                                           f m c |  m  log f m c |  m 

            
            
   log f m d |  m           
                                              1
                                                    
                                                              
                                                              c          m
         m              P0                 f m c |  m                         m
                                   
                                   c    m

                                               
                                      f m c |  m                            

            
            
   log f m d |  m           
                                     m
                                                  
                                                                       
                                                             log f m c |  m  
                                                                                 
         m              P0    c  
                                            f m c |  m          m          
                                     
                                      c   m                                      

                     
                                                                
   log f m d |  m                                  log f m c |  m 
                               p(c | 1 ,, n )
         m               P0    c
                                                             m




                                       Frank Wood - fwood@cs.brown.edu
Deriving the gradient of the log likelihood

                      
                                                                
    log f m d |  m                                   log f m c |  m 
                                 p(c | 1 ,, n )
          m              P0     c                          m


 
             
             
    log f m d |  m           
                                             
                                   log f m c |  m 
          m              P0
                                         m                  P 



                                Phew! We’re done! So:
                                                                              
                                                                
                  log p(D | 1 ,  ,  n )          log p d |  m   
                                              N
                           m                              m          P0


                  
                                  
                                     
                       log f m d |  m
                                               
                                                           
                                                  log f m c |  m 
                             m            P0
                                                         m          P       


                                     Frank Wood - fwood@cs.brown.edu
Equilibrium Is Hard to Achieve

• With:

   log p(D | 1 ,  ,  n )
                             
                                         
                                              
                                log f m d |  m                
                                                                              
                                                                    log f m c |  m 
           m                        m                   P0
                                                                          m             P 



  we can now train our PoE model.
• But… there’s a problem:
  – P  is computationally infeasible to obtain (esp. in an
      
    inner gradient ascent loop).
  – Sampling Markov Chain must converge to target
    distribution. Often this takes a very long time!

                              Frank Wood - fwood@cs.brown.edu
Solution: Contrastive Divergence!

   log p(D | 1 ,  ,  n )
                             
                                         
                                             
                                log f m d |  m                   
                                                                                 
                                                                       log f m c |  m 
           m                        m                      P0
                                                                             m             P1




• Now we don’t have to run the sampling Markov
  Chain to convergence, instead we can stop after
  1 iteration (or perhaps a few iterations more
  typically)
• Why does this work?
  – Attempts to minimize the ways that the model
    distorts the data.

                             Frank Wood - fwood@cs.brown.edu
Equivalence of argmax log P() and argmax KL()

                            
                          
              
                     P d   0
  P P   P d log  
      

                            
   0          0
          
           d         P d
                       

                                                        
                                     
          P d log P d   P d log P d
           
              0        0
                             
                               0
                                     
                                      



                            
          d
                            d
            
         H P 0  log P  d 0
                       
                                      P



                       
                                                        This is what
P P
   0
          log P d
                                                          we got out of
      
  m         m
                                                           the nasty
                                P0                         derivation!




                        Frank Wood - fwood@cs.brown.edu
 Contrastive Divergence

 • We want to “update the parameters to reduce
   the tendency of the chain to wander away from
   the initial distribution on the first step”.
                                                                               
                                                                          
       P                       
                                                                            
                                       log P d                   log P d
            0
                P   P1 P 
                                          
                                                                        
 m                                        m            P0
                                                                       m              P1


                                                                                           
                                                                                                                  
   log f m d |  m                    log f m c |  m             log f m d |  m                     log f m c |  m 
                                                                                                     
         m                 P0
                                             m                P
                                                                            m                   P1
                                                                                                                 m             P
                                                                                                                                




                                                                                                                   
                                                                                               
                                           log f m d |  m                            log f m d |  m
                                                                                   
                                                 m                         P0
                                                                                             m                          P1
                                                                                                                           


                                                  Frank Wood - fwood@cs.brown.edu
Contrastive Divergence (Final Result!)
                                                   Training data
  Gradient of                                  (empirical distribution).
     model
  parameters.
                       log f m                      log f m         Don’t need samples
             m                                                            to reach
                            m           P0
                                                          m     P1
                                                                           equilibrium.


                                                      Law of Large Numbers, compute
                                                        expectations using samples.


                1           log f m (d )          1          log f m (c)
          m 
                N
                     
                     dD          m
                                                    
                                                    N c ~ P1      m

                            Now you know how to do it and why it works!

                            Frank Wood - fwood@cs.brown.edu

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:17
posted:9/19/2011
language:English
pages:19