Expectation Maximization Algorithm by khn19658

VIEWS: 13 PAGES: 61

									Expectation Maximization Algorithm

 Rong Jin
A Mixture Model Problem
          20

          18

          16

          14

          12

          10

           8

           6

           4

           2

           0
               0   5    10     15     20    25

   Apparently, the dataset consists of two modes
   How can we automatically identify the two modes?
Gaussian Mixture Model (GMM)
   Assume that the dataset is generated by two
    mixed Gaussian distributions
     Gaussian model 1: 1  1 ,  1 ; p1
     Gaussian model 2:  2  2 ,  2 ; p2 

   If we know the memberships for each bin,
    estimating the two Gaussian models is easy.
   How to estimate the two Gaussian models
    without knowing the memberships of bins?
EM Algorithm for GMM
   Let memberships to be hidden variables
     {x1 , x2 ,..., xn }   x1 , m1  ,  x2 , m2  ,...,  xn , mn 
   EM algorithm for Gaussian mixture model
     Unknown memberships:  x1 , m1  ,  x2 , m2  ,...,  xn , mn 

                                            1  1 ,  1 ; p1
       Unknown Gaussian models:
                                             2   2 ,  2 ; p2 

       Learn these two sets of parameters iteratively
Start with A Random Guess
 20

 18

 16

 14
                                      Random assign the
 12

 10
                                       memberships to
   8

   6
                                       each bin
   4

   2

  0
  1
       0   5   10   15   20   25
 0.9

 0.8

 0.7

 0.6

 0.5

 0.4

 0.3

 0.2

 0.1

  0
       0   5   10   15   20   25
Start with A Random Guess
 20

 18

 16

 14
                                      Random assign the
 12

 10
                                       memberships to
   8

   6
                                       each bin
   4

   2                                  Estimate the means
  0
  1

 0.9

 0.8
       0   5   10   15   20   25
                                       and variance of
 0.7

 0.6                                   each Gaussian
 0.5

 0.4

 0.3
                                       model
 0.2

 0.1

  0
       0   5   10   15   20   25
 E-step
       Fixed the two Gaussian models
       Estimate the posterior for each data point
              p( x, m  1)         p( x,1 )                    p( x | 1 ,  1 ) p1
p(m  1| x)                                      
                  p ( x)     p( x,1 )  p( x, 2 ) p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2
                p( x, m  2)         p ( x,  2 )                 p( x |  2 ,  2 ) p2
p (m  2 | x)                                      
                    p ( x)     p( x,1 )  p( x, 2 ) p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2

                       1          x   2                             1          x   2      
p( x | 1 ,  1 )          exp               , p( x | 1 ,  1 )          exp                 
                                         1                                                   2

                      2 1
                          2         2 12                              2 2
                                                                             2          2 2
                                                                                            2        
                                                                                                  
EM Algorithm for GMM
 20

 18

 16
                                      Re-estimate the
 14

 12                                    memberships for
 10

  8                                    each bin
  6

  4

  2

  0
  1
    0      5   10   15   20   25
 0.9

 0.8

 0.7

 0.6

 0.5

 0.4

 0.3

 0.2

 0.1

  0
       0   5   10   15   20   25
  M-Step
              Fixed the memberships
                                  Weighted by posteriors
              Re-estimate the two model Gaussian
     n
l    p(mi  1| xi ) log p( xi ,1 )  p(mi  2 | xi ) log p( xi , 2 )
        ˆ                                ˆ
    i 1


                                                                                                         
      n
   p(mi  1| xi ) log p1  log p( xi | 1 ,  1 )   p(mi  2 | xi ) log p2  log p( xi | 2 ,  2 ) 
     ˆ                                                   ˆ
    i 1

Weighted by posteriors
              i 1 p(mi  1| xi ) ,  i 1 p(mi  1| xi ) xi ,  2   i 1 p(mi  1| xi ) xi2   2
                 n                            n                                   n
                    ˆ                          ˆ                               ˆ
   p1                            1                                1                               1
                                        i 1 p  ˆ (mi  1| xi )           i 1 p
                                                                                 ˆ (mi  1| xi )
                                           n                                n
                  n

         i 1 p (mi  2 | xi )         i 1    p(mi  2 | xi ) xi 2  i 1 p (mi  2 | xi ) xi2
          n                                n                                 n
               ˆ                                 ˆ                               ˆ
   p2                          , 2                              , 2                            22

                                        i 1 p(mi  2 | xi )               i 1 p(mi  2 | xi )
                                             n                                 n
                   n                               ˆ                                ˆ
EM Algorithm for GMM
 20

 18

 16                                   Re-estimate the
                                       memberships for
 14

 12

 10

   8                                   each bin
   6

   4

   2
                                      Re-estimate the
  0
  1

 0.9
       0   5   10   15   20   25       models
 0.8

 0.7

 0.6

 0.5

 0.4

 0.3

 0.2

 0.1

  0
       0   5   10   15   20   25
At the 5-th Iteration
 20

 18

 16
                                      Red Gaussian
 14

 12
                                       component slowly
 10

  8
                                       shifts toward the left
  6

  4
                                       end of the x axis
  2

   0
 0.9
       0   5   10   15   20   25

 0.8

 0.7

 0.6

 0.5

 0.4

 0.3

 0.2

 0.1

  0
       0   5   10   15   20   25
At the10-th Iteration
20

18

16

14

12
                                     Red Gaussian
10

 8
                                      component still
 6

 4
                                      slowly shifts
 2

  0
0.9
                                      toward the left end
     0    5   10   15   20   25

0.8
                                      of the x axis
0.7

0.6

0.5

0.4

0.3

0.2

0.1

 0
      0   5   10   15   20   25
At the 20-th Iteration
 20

 18

 16
                                      Red Gaussian
 14

 12
                                       component make
 10

 8
                                       more noticeable shift
                                       toward the left end of
 6

 4

 2

 0
 1
      0   5   10   15   20   25
                                       the x axis
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

 0
      0   5   10   15   20    25
At the 50-th Iteration
 20



                                       Red Gaussian
 18

 16                                
                                       component is close
 14

 12

 10

  8
                                       to the desirable
  6

  4
                                       location
  2

  0
  1
    0      5   10   15   20   25
 0.9

 0.8

 0.7

 0.6

 0.5

 0.4

 0.3

 0.2

 0.1

  0
       0   5   10   15   20   25
At the 100-th Iteration
   20

   18

   16                                   The results are
   14

   12                                    almost identical to
   10

    8                                    the ones for the
    6

    4                                    50-th iteration
    2

    0
    1
      0      5   10   15   20   25
   0.9

   0.8

   0.7

   0.6

   0.5

   0.4

   0.3

   0.2

   0.1

    0
         0   5   10   15   20   25
EM as A Bound Optimization
     EM algorithm in fact maximizes the log-likelihood function
      of training data
     Likelihood for a data point x
    p( x)  p( x,1 )  p( x, 2 )  p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2

                           1          x   2                                 1          x   2   
    p( x | 1 ,  1 )          exp                   , p( x | 1 ,  1 )          exp              
                                             1                                                       2

                          2 1
                              2         2 12                                  2 2
                                                                                     2          2 2
                                                                                                    2     
                                                                                                       

     Log-likelihood of training data
    l   i 1 log p ( xi )   i 1 log  p ( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2 
              n                       n
EM as A Bound Optimization
     EM algorithm in fact maximizes the log-likelihood function
      of training data
     Likelihood for a data point x
    p( x)  p( x,1 )  p( x, 2 )  p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2

                           1          x   2                                 1          x   2   
    p( x | 1 ,  1 )          exp                   , p( x | 1 ,  1 )          exp              
                                             1                                                       2

                          2 1
                              2         2 12                                  2 2
                                                                                     2          2 2
                                                                                                    2     
                                                                                                       

     Log-likelihood of training data
    l   i 1 log p ( xi )   i 1 log  p ( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2 
              n                       n
EM as A Bound Optimization
     EM algorithm in fact maximizes the log-likelihood function
      of training data
     Likelihood for a data point x
    p( x)  p( x,1 )  p( x, 2 )  p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2

                           1          x   2                                 1          x   2   
    p( x | 1 ,  1 )          exp                   , p( x | 1 ,  1 )          exp              
                                             1                                                       2

                          2 1
                              2         2 12                                  2 2
                                                                                     2          2 2
                                                                                                    2     
                                                                                                       

     Log-likelihood of training data
l 1 , 2    i 1 log p( xi )   i 1 log  p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2 
                    n                       n
Logarithm Bound Algorithm
                           • Start with initial guess 10 , 2
                                                             0




             l (1 ,2 )




  10 , 2
         0
Logarithm Bound Algorithm

Touch                                                         • Start with initial guess 1 , 2 0
Point
                                                              • Come up with a lower bounded
                                                                l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )
                                             l (1 ,2 )
                                                                Q(1 , 2 ) is a concave function
                l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )     Touch point: Q(1  10 , 2   20 )  0




 1 , 2
   0 0
            
Logarithm Bound Algorithm
                                                                • Start with initial guess 1 , 2 0

                                                                • Come up with a lower bounded
                                                                  l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )
                                               l (1 ,2 )
                                                                  Q(1 , 2 ) is a concave function
                  l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )     Touch point: Q(1  10 , 2   20 )  0

                                                                  • Search the optimal solution
                                                                  that maximizes Q(1 , 2 )
 
  1 , 2
  0 0
            , 
              1
              1
                   1
                   2
Logarithm Bound Algorithm
                                                         • Start with initial guess 1 , 2 0

                                                         • Come up with a lower bounded
                                                           l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )
                                           l (1 ,2 )
                                                           Q(1 , 2 ) is a concave function
      l (1 , 2 )  l (11 , 2 )  Q(1 , 2 )
                               1

                                                           Touch point: Q(1  10 , 2   20 )  0

                                                           • Search the optimal solution
                                                           that maximizes Q(1 , 2 )
 
  1 , 2
  0 0
            ,  
              1
              1
                  1
                  2   1 , 2
                       2   2
                                                          • Repeat the procedure
Logarithm Bound Algorithm
                                       Optimal
                                        Point          • Start with initial guess 1 , 2 0

                                                       • Come up with a lower bounded
                                                         l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )
                                         l (1 ,2 )
                                                         Q(1 , 2 ) is a concave function
                                                         Touch point: Q(1  10 , 2   20 )  0

                                                         • Search the optimal solution
                                                         that maximizes Q(1 , 2 )
 
  1 , 2
  0 0
            ,  
              1
              1
                  1
                  2   1 , 2
                       2 2
                               ,...                     • Repeat the procedure
                                                         • Converge to the local optimal
EM as A Bound Optimization
   Parameter for previous iteration: 1' , 2
                                             '

   Parameter for current iteration: 1 , 2
   Compute Q(1 , 2 )
Q(1 , 2 )  l (1 , 2 )  l (1' , 2 )
                                       '


               
               p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2        

      n
      i 1
           log                                                        ' 
              p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 
                             '    '     '                    '     '
                                                                        
                              p ( xi | 1 ,  1 ) p1
                                            '      '       '
                                                                          p( xi | 1 ,  1 ) p1 
                                                                                                    
              p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 p( xi | 1 ,  1 ) p1 
                             '    '     '                    '     '   '              '    '    '
  i 1 log 
    n
                                                                                                     
                                p ( xi |  2 ,  2 ) p2
                                                '      '       '
                                                                             p ( xi |  2 ,  2 ) p2 
                
              p( x |  ' ,  ' ) p '  p( x |  ' ,  ' ) p ' p( x |  ' ,  ' ) p ' 
                       i      1 1 1                     i       2   2   2        i     2     2     2

                         p ( xi | 1 ,  1 ) p1
                                      '    '      '
                                                                          p( xi | 1 ,  1 ) p1 
                                                                    log                           
    n  p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2                  p( xi | 1 ,  1 ) p1 
                      '     '    '                   '       '     '                  '    '    '
  i 1                                                                                           
                       p ( xi |  2 ,  2 ) p2
                                    '     '        '
                                                                          p ( xi |  2 ,  2 ) p2 
          p ( x |  ' ,  ' ) p '  p ( x |  ' ,  ' ) p ' log p( x |  ' ,  ' ) p ' 
               i    1 1 1                    i      2       2     2           i      2     2     2 

      n                          p ( xi | 1 ,  1 ) p1                               p( xi |  2 ,  2 ) p2 
                                                                                                               
    i 1 
             p (1'   | xi ) log                                   p ( 2
                                                                         '
                                                                             | xi ) log                      ' 
          
                                  p ( xi |   1' ,  1' ) p1
                                                            '
                                                                                        p( xi |  2 ,  2 ) p2 
                                                                                                  '     '
                                                                                                               
Q(1 , 2 )  l (1 , 2 )  l (1' , 2 )
                                       '


               
               p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2        

      n
      i 1
           log                                                        ' 
              p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 
                             '    '     '                    '     '
                                                                        
                              p ( xi | 1 ,  1 ) p1
                                            '      '       '
                                                                          p( xi | 1 ,  1 ) p1 
                                                                                                    
              p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 p( xi | 1 ,  1 ) p1 
                             '    '     '                    '     '   '              '    '    '
  i 1 log 
    n
                                                                                                     
                                p ( xi |  2 ,  2 ) p2
                                                '      '       '
                                                                             p ( xi |  2 ,  2 ) p2 
                
              p( x |  ' ,  ' ) p '  p( x |  ' ,  ' ) p ' p( x |  ' ,  ' ) p ' 
                       i      1 1 1                     i       2   2   2        i     2     2     2

                         p ( xi | 1 ,  1 ) p1
                                      '    '      '
                                                                          p( xi | 1 ,  1 ) p1 
                                                                    log                           
    n  p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2                  p( xi | 1 ,  1 ) p1 
                      '     '    '                   '       '     '                  '    '    '
  i 1                                                                                           
                       p ( xi |  2 ,  2 ) p2
                                    '     '        '
                                                                          p ( xi |  2 ,  2 ) p2 
          p ( x |  ' ,  ' ) p '  p ( x |  ' ,  ' ) p ' log p( x |  ' ,  ' ) p ' 
               i    1 1 1                    i      2       2     2           i      2     2     2 

      n                          p ( xi | 1 ,  1 ) p1                               p( xi |  2 ,  2 ) p2 
                                                                                                               
    i 1 
             p (1'   | xi ) log                                   p ( 2
                                                                         '
                                                                             | xi ) log                      ' 
          
                                  p ( xi |   1' ,  1' ) p1
                                                            '
                                                                                        p( xi |  2 ,  2 ) p2 
                                                                                                  '     '
                                                                                                               
Q(1 , 2 )  l (1 , 2 )  l (1' , 2 )
                                       '


               
                p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 

      n
      i 1
           log                                                  ' 
              p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 
                              '     '     '                ' '
                                                                  
                                p ( xi | 1 ,  1 ) p1
                                              '    '     '
                                                                    p( xi | 1 ,  1 ) p1 
                                                                                                
 Concave property 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 p ( xi | 1 ,  1 ) p1 
              p ( xi | of
                              ' logarithm function
                                    '     '                ' '   '              '    '    '
  i 1 log 
     n
  log( p  (1  p)  )  p log   (1 ' p) log  '                                             
                                 p( x |  ,  ) p     '
                                                                       p ( xi |  2 ,  2 ) p2 
   0  p     ,   0 ' ' i ' 2 2 2 ' ' '
                    p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 p ( xi |  2 ,  2 ) p2 
              1,                                                                   '    '     '
                                                                                                
                         p ( xi | 1 ,  1 ) p1
                                        '    '    '
                                                                    p( xi | 1 ,  1 ) p1 
                                                              log                           
     n  p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2           p( xi | 1 ,  1 ) p1 
                       '     '     '                 '     ' '                  '    '    '
  i 1                                                                                       
                        p ( xi |  2 ,  2 ) p2
                                      '     '      '
                                                                    p ( xi |  2 ,  2 ) p2 
          p ( x |  ' ,  ' ) p '  p ( x |  ' ,  ' ) p ' log p( x |  ' ,  ' ) p ' 
                i     1 1 1                    i    2     2 2           i      2     2     2 

      n                          p ( xi | 1 ,  1 ) p1                               p( xi |  2 ,  2 ) p2 
                                                                                                               
    i 1 
             p (1'   | xi ) log                                   p ( 2
                                                                         '
                                                                             | xi ) log                      ' 
          
                                  p ( xi |   1' ,  1' ) p1
                                                            '
                                                                                        p( xi |  2 ,  2 ) p2 
                                                                                                  '     '
                                                                                                               
Q(1 , 2 )  l (1 , 2 )  l (1' , 2 )
                                       '


               p( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 
                
                                                                             

       n
       i 1
            log                                                          ' 
             
              p( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 
                           '     '     '                      '       '
                                                                              
                             p( xi | 1 ,  1 ) p1
                                           '       '        '
                                                                               p ( xi | 1 ,  1 ) p1 
                                                                                                           
              p( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 p( xi | 1 ,  1 ) p1 
                           '     '     '                      '       '   '                '    '     '
  i 1 log 
    n
                                                                                                            
                              p ( xi |  2 ,  2 ) p2
                                               '      '         '
                                                                                  p ( xi |  2 ,  2 ) p2 
               
              p( x |  ' ,  ' ) p'  p( x |  ' ,  ' ) p ' p( x |  ' ,  ' ) p ' 
                      i     1 1 1                        i       2     2     2        i     2      2      2

                                         ' Definition of posterior
                         p( xi | 1 ,  1 ) p1
                                     '           '
                                                                               p( xi | 1 ,  1 ) p1 
                                        p( xi | '1 ,  1 ) p1 ' log
                                                        '     '     '                                     
    n  p ( xi | 1 ,  1 ) p1  'p ( xi |  2 ,  2 ) p2'                          i  m 1 p
                    '     '     '                             '                            '    '     '
                                                                               p( x | p(1 , 1 )1| 1xi ;1' , 2 )
                                                                                                           
                                                                                                                '
  i 1                 p( xi | 1 , '1 ) p1'  p( xi |  2 ,  2 ) p2
                                           '     '                          '    '
                                                                                                           
                      p( xi |  2 ,  2 ) p2
                                   '
                                                                               p ( xi |  2 ,  2 ) p2 
          p( x |  ' ,  ' ) p '  p( x |  ' ,  ' ) p '              log
              i   1 1 1                     i       2        2       2        p( xi |  2 ,  2 ) p2 
                                                                                           '      '     '
                                                                                                           
       n                                          p ( xi | 1 ,  1 ) p1                                              p ( xi |  2 ,  2 ) p2 
                                                                                                                                                
     i 1 
              p (m1     1|   xi ;1' , 2 ) log
                                         '
                                                                                p(m1  2 |        xi ;1' , 2 ) log
                                                                                                              '
                                                                                                                                              ' 
           
                                                   p ( xi |   1 ,  1 ) p1
                                                                '     '    '
                                                                                                                        p ( xi |  2 ,  2 ) p2 
                                                                                                                                   '     '
                                                                                                                                                
Log-Likelihood of EM Alg.
                   -375


                   -380


                   -385
    Loglikelhood




                   -390


                   -395


                   -400
                                                              Saddle points

                   -405


                   -410
                          0   10   20   30   40       50      60   70   80   90   100
                                                  Iteration
Maximize GMM Model
    l   i 1 log p ( xi )   i 1 log  p ( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2 
             n                    n




                       1          x   2                             1          x   2 
p( x | 1 ,  1 )          exp               , p( x | 1 ,  1 )          exp            
                                         1                                                   2

                      2 1
                          2         2 12                              2 2
                                                                             2          2 2 
                                                                                            2
                                                                                             

   What is the global optimal solution to GMM?
                                 
                                             n
                                                 x
                                             i 1 i
         1  x1 ,  1  0, 1                       ,  2  1, p1  p2  0.5
                                             n
   Maximizing the objective function of GMM is ill-
    posed problem
Maximize GMM Model
    l   i 1 log p ( xi )   i 1 log  p ( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2 
             n                    n




                       1          x   2                             1          x   2 
p( x | 1 ,  1 )          exp               , p( x | 1 ,  1 )          exp            
                                         1                                                   2

                      2 1
                          2         2 12                              2 2
                                                                             2          2 2 
                                                                                            2
                                                                                             

   What is the global optimal solution to GMM?
                                 
                                             n
                                                 x
                                             i 1 i
         1  x1 ,  1  0, 1                       ,  2  1, p1  p2  0.5
                                             n
   Maximizing the objective function of GMM is ill-
    posed problem
Identify Hidden Variables
   For certain learning problems, identifying hidden variables is
    not a easy task
   Consider a simple translation model
          For a pair of English and Chinese sentences:

        e : (e1 , e2 ,..., es )  c : (c1, c2 ,..., cl )

          A simple translation model is

    Pr(e | c )   j 1 Pr(e j | c )  j 1
                       s                      s
                                                         t
                                                           k 1
                                                                Pr(e j   | ck )   
          The log-likelihood of training corpus                 e1 , c1  ,...,  en , cn 
          l   i 1 log Pr(ei | ci )   i 1  j i1 log
                   n                              n        e
                                                                         ci
                                                                           k 1
                                                                                Pr(ei , j   | ci ,k )   
 Identify Hidden Variables
     Consider a simple case                              e : (e1   e2 )
Pr(e | c )   j 1
                 2
                         2
                           k 1
                                Pr(e j   | ck )   
 Pr(e1 | c1 ) Pr(e2 | c1 )  Pr(e1 | c2 ) Pr(e2 | c2 )
 Pr(e1 | c1 ) Pr(e2 | c2 )  Pr(e1 | c2 ) Pr(e2 | c1 )   c : (c1   c2 )
   Alignment variable a(i)
 a:   
 map a position in English sentence to a position in Chinese sentence
     Rewrite Pr(e | c )  a Pr(e1 | ca(1) ) Pr(e2 | ca(2) )
 Identify Hidden Variables
     Consider a simple case                              e : (e1   e2 )
Pr(e | c )   j 1
                 2
                         2
                           k 1
                                Pr(e j   | ck )   
 Pr(e1 | c1 ) Pr(e2 | c1 )  Pr(e1 | c2 ) Pr(e2 | c2 )
 Pr(e1 | c1 ) Pr(e2 | c2 )  Pr(e1 | c2 ) Pr(e2 | c1 )   c : (c1   c2 )
   Alignment variable a(i)
 a:   
 map a position in English sentence to a position in Chinese sentence
     Rewrite Pr(e | c )  a Pr(e1 | ca(1) ) Pr(e2 | ca(2) )
 Identify Hidden Variables
     Consider a simple case                              e : (e1   e2 )
Pr(e | c )   j 1
                 2
                         2
                           k 1
                                Pr(e j   | ck )   
 Pr(e1 | c1 ) Pr(e2 | c1 )  Pr(e1 | c2 ) Pr(e2 | c2 )
 Pr(e1 | c1 ) Pr(e2 | c2 )  Pr(e1 | c2 ) Pr(e2 | c1 )   c : (c1   c2 )
   Alignment variable a(i)
 a:   
 map a position in English sentence to a position in Chinese sentence
     Rewrite Pr(e | c )  a Pr(e1 | ca(1) ) Pr(e2 | ca(2) )
 Identify Hidden Variables
     Consider a simple case                              e : (e1   e2 )
Pr(e | c )   j 1
                 2
                         2
                           k 1
                                Pr(e j   | ck )   
 Pr(e1 | c1 ) Pr(e2 | c1 )  Pr(e1 | c2 ) Pr(e2 | c2 )
 Pr(e1 | c1 ) Pr(e2 | c2 )  Pr(e1 | c2 ) Pr(e2 | c1 )   c : (c1   c2 )
   Alignment variable a(i)
 a:   
 map a position in English sentence to a position in Chinese sentence
     Rewrite Pr(e | c )  a Pr(e1 | ca(1) ) Pr(e2 | ca(2) )
EM Algorithm for A Translation Model
     Introduce an alignment variable for each translation pair
           e1 , c1 , a1  ,  e2 , c2 , a2  ,...,  en , cn ,, an 
     EM algorithm for the translation model
            E-step: compute the posterior for each alignment variable Pr(a j | e j , c j )
            M-step: estimate the translation probability Pr(e|c)
                                                          |e j |                             |e j |

                          Pr(a, e j , c j )                Pr(e j ,k | c j ,a (k ) )         Pr(e j ,k | c j ,a (k ) )
                                                          k 1                                k 1
Pr(a | e j , c j )                                                                    
                        a ' Pr(a ', e j , c j )             |e j |                         |e j |
                                                        a '  Pr(e j ,k | c j ,a '(k ) )  t 1 Pr(e j ,s | c j ,t )
                                                                                                      ci

                                                              k 1                          k 1
EM Algorithm for A Translation Model
     Introduce an alignment variable for each translation pair
           e1 , c1 , a1  ,  e2 , c2 , a2  ,...,  en , cn ,, an 
     EM algorithm for the translation model
            E-step: compute the posterior for each alignment variable Pr(a j | e j , c j )
            M-step: estimate the translation probability Pr(e|c)
                                                          |e j |                             |e j |

                          Pr(a, e j , c j )                Pr(e j ,k | c j ,a (k ) )         Pr(e j ,k | c j ,a (k ) )
                                                          k 1                                k 1
Pr(a | e j , c j )                                                                    
                        a ' Pr(a ', e j , c j )             |e j |                         |e j |
                                                        a '  Pr(e j ,k | c j ,a '(k ) )  t 1 Pr(e j ,s | c j ,t )
                                                                                                      ci

                                                              k 1                          k 1

                                        We are luck here. In general, this step can be extremely
                                        difficult and usually requires approximate approaches
Compute Pr(e|c)
   First compute Pr(e | c; ei , ci )
           Pr(e | c; ei , ci )   (e  ei ) (c  ci )  a Pr(a | ei , ci ) (a(e)  c)

                  (e  e ) (c  c )
                                        a Pr(a, ei , ci ) (a(e)  c)
                            i          i
                                                         Pr(ei , ci )
                                                              |e j |
                                                                t 1 Pr(e j , s | c j ,t )
                                                                          ci
                                           Pr(e | c)
                                                       k 1^ e j ,k  e
                  (e  ei ) (c  ci )             |e j |
                                                      t 1 Pr(e j ,s | c j ,t )
                                                                  ci

                                                    k 1
                                                Pr(e | c)
                  (e  ei ) (c  ci )
                                            t 1 Pr(e | c j ,t )
                                               ci
Compute Pr(e|c)
   First compute Pr(e | c; ei , ci )
           Pr(e | c; ei , ci )   (e  ei ) (c  ci )  a Pr(a | ei , ci ) (a(e)  c)

                  (e  e ) (c  c )
                                        a Pr(a, ei , ci ) (a(e)  c)
                            i            i
                                                           Pr(ei , ci )
                                                                |e j |
                                                                  t 1 Pr(e j , s | c j ,t )
                                                                            ci
                                             Pr(e | c)
                                                         k 1^ e j ,k  e
                  (e  ei ) (c  ci )               |e j |
                                                        t 1 Pr(e j ,s | c j ,t )
                                                                    ci

                                                      k 1
                                                  Pr(e | c)
                  (e  ei ) (c  ci )
                                              t 1 Pr(e | c j ,t )
                                                 ci



              Pr(e | c)   i 1 Pr(e | c; ei , ci )
                                     n
Bound Optimization for A Translation Model
 θ  Pr(e | c) for the current iteration
 θ '  Pr'(e | c) for the previous iteration

 l (θ)   i 1 log Pr(ei | ci ; θ)   i 1  j i1 log
             n                             n       e
                                                              ci
                                                                k 1
                                                                     Pr(ei , j    | ci ,k )   
 l (θ ')   i 1 log Pr(ei | ci ; θ ')   i 1  j i1
              n                                n       e
                                                           log  
                                                                     ci
                                                                     k 1
                                                                          Pr'(ei , j   | ci ,k )   
                                                    ci Pr(e | c ) 
 Q(θ, θ ')  l (θ)  l (θ ')   i 1  j i1 log  k 1                
                                 n      e                     i, j i ,k
                                                    ci Pr'(e | c ) 
                                                   l 1      i, j i ,l 
Bound Optimization for A Translation Model
                                                       ci Pr(e | c ) 
    Q(θ, θ ')  l (θ)  l (θ ')   i 1  j i1 log  k 1                
                                    n      e                     i, j i ,k
                                                       ci Pr'(e | c ) 
                                                      l 1      i, j i ,l 

                                 ci      Pr'(ei , j | ci ,k )        Pr(ei , j | ci ,k ) 
                      log                                                             
         n          ei
         i 1       j 1
                             k 1  ci Pr'(e               | ci ,l ) Pr'(ei , j | ci ,k ) 
                                   l 1     i, j                                          
                                       Pr'(ei , j | ci ,k )             Pr(ei , j | ci ,k ) 
               
         n          ei     ci
                                                                   log                       
         i 1       j 1   k 1                                         Pr'(ei , j | ci ,k ) 
                                  
                                        ci
                                       l 1
                                            Pr'(ei , j   | ci ,l )                           

                                                                           Pr'(e | c)
          Pr(e | c)   i 1 (e  ei ) (c  ci )
                                  n

                                                                     
                                                                          ci
                                                                         t 1
                                                                              Pr'(e | c j ,t )
Iterative Scaling
       Maximum entropy model
                      exp( x  wy )                                       exp( xi  wyi )
                                        , l ( Dtrain )  
                                                             N
 p ( y | x ; )                                                  log
                     y exp( x  wy )                        i 1
                                                                         y exp( xi  wy )
        Iterative scaling
            All features xi , j  0
             Sum of features are constant  j 1 xi, j  g
                                           d
         
Iterative Scaling
   Compute the empirical mean for each feature of every class,
    i.e., ey, j   N xi, j  ( y, yi ) N for every j and every class y
                    i 1

   Start w1 ,w2 …, wc = 0
   Repeat
       Compute p(y|x) for each training data point (xi, yi) using w from the
        previous iteration
       Compute the mean of each feature of every class using the estimated
        probabilities, i.e., my, j   N xi, j p( y | xi ) N for every j and every y
                                        i 1


       Compute w j , y 
                               1
                               g
                                 
                                 log e j , y  log m j , y    for every j and every y
       Update w as w j , y  w j , y  w j , y
Iterative Scaling
     w1 , w2 ,..., wc : parameters for the current iteration

                             
    '  w1 , w2 ,..., wc : parameters for the last iteration
          '    '        '


                        exp( x  wy )
   p ( y | x ; ) 
                       y exp( x  wy )
                                                              exp( xi  wyi )
   l ( )                 p ( y | x ; )  
                N                                N
                     log                              log
                i 1                             i 1
                                                             y exp( xi  wy )
                                                               exp( xi  w'yi )
   l ( ')                p( y | x; ')  
                 N                                N
                      log                              log
                 i 1                             i 1
                                                              y exp( xi  w'y )
                                      exp( x  w )                    exp( xi  w'yi ) 
                                                                                        
   l ( )  l ( ')  
                             N               i   yi
                                  log                                                 ' 
                                       y exp( xi  wy )              y exp( xi  wy ) 
                             i 1
                                                                                        
Iterative Scaling
                                  exp( x  w )           exp( xi  w'yi ) 
                                                                           
  l ( )  l ( ')  
                         N               i   yi
                              log                                          
                                   y exp( xi  wy )    y
                         i 1
                                                            exp( xi  w'y ) 
                                                                           

               
    i 1 xi  wyi  w'yi  log
        N
                                         y               
                                              exp( xi  w'y )  log     exp( x  w ) 
                                                                          y      i   y


  Can we use the concave property of logarithm function?

            No, we can’t because we need a lower bound
Iterative Scaling
       log x  x  1  log           exp(x  w )   exp(x  w ) 1
                                        y           i   y           y       i       y

 l ( )  l ( ')

      
   i 1 xi  wyi  w'yi  log
        N
                                             y                   exp( x  w ) 
                                                  exp( xi  w'y )  log         y       i   y


   x   w                       log         exp( x  w )    exp( x  w )  1
        N
        i 1   i    yi    w'yi                 y           i
                                                                '
                                                                y       y           i   y



        • Weights w y still couple with each other
        • Still need further decomposition
     Iterative Scaling
                   exp     q p    p exp  q  for i, p  0,  p  1
                                i i i                i   i        i                  i       i       i

                                                                                              
        exp( xi  wy )  exp            d
                                              x w
                                                       
                                          j 1 i , j y , j
                                                            d
                                                             
                                                 exp   j 1 d
                                                                  xi , j

                                                                k 1 xi,k
                                                                           wy , j  k 1 xi ,k 
                                                                                    d
                                                                                               
                                                                                              

          j 1 d
             d        xi , j

                   xi,k
                                          
                                          d              d xi , j
                             exp wy , j  k 1 xi , k   j 1
                                                               g
                                                                  
                                                                  exp gwy , j                       
                         k 1
l ( )  l ( ')
       N
            
  i 1 xi  wyi  w'yi  log                  y                  
                                                     exp( xi  w'y )   y exp( xi  wy )  1            
       N   
              j xi, j                             log                                                             
                                                                                                                             
                                                                                                     xi , j
     i 1 
                             wyi , j  w'yi , j                    exp( xi  w'y )                         exp( gwy , j )  1
           
           
                                                                 y                       y       j       g                   
                                                                                                                             
 Iterative Scaling
              N   
                                                                                                                               
                                                                                                                                   
                                                                                                           xi , j
Q( , ')   i 1  j xi , j  wyi , j  w'yi , j  log              y exp( xi  w'y )   y  j             exp( gwy , j )  1
                   
                                                                                                           g                      
                                                                                                                                   


      N
         log 
      i 1           y
                       exp( xi  w'y )     1     N
                                                      i 1
                                                                     
                                                                     
                                                                               
                                                              y  j  xi, j  wy, j  w'y, j
                                                                     
                                                                     
                                                                                                     y, yi  
                                                                                                                  xi , j
                                                                                                                   g
                                                                                                                                        
                                                                                                                                        
                                                                                                                         exp( gwy , j ) 
                                                                                                                                        
                                                                                                                                        

             Q( , ')
              wy , j
                                                  
                          i 1  y  j xi , j   y, yi   xi , j exp( gwy , j )  0
                            N
                                                                                                          
                              y  j xi, j  y, yi 
                                      N
                                      i 1
              wy , j  log
                               i 1  y  j xi, j
                                N




             Wait a minute, this can not be right! What happens?
Logarithm Bound Algorithm
                                                                • Start with initial guess 1 , 2 0

                                                                • Come up with a lower bounded
                                                                  l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )
                                               l (1 ,2 )
                                                                  Q(1 , 2 ) is a concave function
                  l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )     Touch point: Q(1  10 , 2   20 )  0

                                                                  • Search the optimal solution
                                                                  that maximizes Q(1 , 2 )
 
  1 , 2
  0 0
            , 
              1
              1
                   1
                   2
     Iterative Scaling
Q( , ')


     N
     i 1   log     y
                         exp( xi  w'y )     1     N
                                                        i 1
                                                                    
                                                                    
                                                               y j
                                                                    
                                                                    
                                                                                        '
                                                                                                
                                                                      xi , j  wy , j  wy , j   y, yi  
                                                                                                             xi , j
                                                                                                              g
                                                                                                                                   
                                                                                                                                   
                                                                                                                    exp( gwy , j ) 
                                                                                                                                   
                                                                                                                                   

Q(   ', ')


     N
     i 1    
             log      y                    
                                     1    y  j  xi , j  wy , j  wy , j   y, yi  
                         exp( xi  w'y )
                                                        N
                                                         
                                                        i 1
                                                         
                                                         
                                                                    '           '
                                                                                 
                                                                                             xi , j
                                                                                              g
                                                                                                          '       
                                                                                                                   
                                                                                                    exp( gwy , j ) 
                                                                                                                   
                                                                                                                   

  i 1
     N
             
             log  y exp( xi  wy )  1   i 1  y  j
                                '
                                            
                                            N            xi , j
                                                           g
                                                                exp( gw'y , j )

0


                               Where does it go wrong?
Iterative Scaling
       log x  x  1  log              exp(x  w )   exp(x  w ) 1
                                           y           i   y           y       i       y

 l ( )  l ( ')

 
        N
        i 1 xi  wyi  w'yi  log         y                      exp( x  w ) 
                                                  exp( xi  w'y )  log            y       i   y


   i 1
        N
            x   w
                i      yi    w'yi     log     y
                                                       exp( x  w )    exp( x  w )  1
                                                               i
                                                                   '
                                                                   y       y           i   y



                                                           Not zero when  = ’

                       exp( xi  wy )       exp( xi  wy )
 log x  x  1  log  y                                      1
                       exp( xi  wy )
                                    '      y
                                               y exp( xi  wy )
                                                              '
                      y                
Iterative Scaling
                              exp( xi  wy )       exp( xi  wy )
   log x  x  1  log       y                                      1
                              exp( xi  wy )
                                           '      y
                                                      y exp( xi  wy )
                                                                     '
                              y               
           l ( )  l ( ')
                                                              exp( xi  wy )   
            
                  N 
                  i 1  i
                        x         
                                  wyi  w'yi         y
                                                           y
                                                                                
                                                                              1
                                                              exp( xi  w'y ) 
                      
                                                                               
 y Definition y ' conditional
     wy  w of       
                    N            exp( xi  w'y )  exp( xi   y )  
               i 1  xi yi  y
      exponential model                                          1
                      
                                      y    exp( xi  w'y )          
                                                                      
             
                  N
                  i 1   x  
                             i        yi                                           
                                             y p( y | xi ; ') exp( xi   y )  1
Iterative Scaling
                                                                                          
 exp( xi   y )  exp        d
                                    x 
                                j 1 i , j y , j  
                                            exp   j 1 d
                                                       d     xi , j

                                                           k 1 xi,k
                                                                       y , j  k 1 xi ,k 
                                                                                d
                                                                                           
                                                                                          

   j 1 d
      d         xi , j

            xi,k
                                    d
                                                       
                                                    d xi , j
                       exp  y , j  k 1 xi , k   j 1
                                                          g
                                                             exp g  y , j        
                  k 1


  l ( )  l ( ')  
                         N
                         i 1   x  
                                    i    yi     y p( y | xi ; ') exp( xi   y )  1
         N                                                       xi , j                
                                                                                         
           
         i 1 
                xi , j  yi , j   y p( y | xi ; ') j           exp( g  y , j )  1
            j
                                                             g                          
                                                                                         
                      
                                                                 xi , j                     
                                                                                             
    i 1  j  y  xi , j  y , j  ( y, yi )  p( y | xi ; ')
      N
                                                                         exp( g  y , j )  1
                      
                                                                  g                         
                                                                                             
Iterative Scaling
                            
                                                                          xi , j                     
                                                                                                      
Q( , ')             jy
                N
                i 1
                              xi , j  y , j  ( y, yi )  p( y | xi ; ')        exp( g  y , j )  1
                            
                                                                           g                         
                                                                                                      


   Q( , ')
     y , j
                                                                                          
                i 1 xi , j  y , j  ( y, yi )  p( y | xi ; ') xi , j exp( g  y , j )  0
                  N




                                            
                                                  N
                                      1                 x   ( y, yi )
                                                  i 1 i , j y , j
     y , j  wy , j  w'y , j        log
                                            
                                                   N
                                      g
                                                   i 1
                                                        p( y | xi ; ') xi , j
Iterative Scaling
    How about d1 xi, j  gi  constant ?
                j
                                                                                             
    exp( xi   y )  exp          d
                                         x 
                                     j 1 i , j y , j  
                                               exp   j 1 d
                                                          d     xi , j

                                                              k 1 xi,k
                                                                          y , j  k 1 xi ,k 
                                                                                   d
                                                                                              
                                                                                             

      j 1 d
         d         xi , j

               xi,k
                                     
                          exp  y , j  k 1 xi , k   j 1
                                        d
                                                            
                                                       d xi , j
                                                             gi
                                                                            
                                                                exp gi  y , j        
                 k 1

                              
                                                                            xi , j                      
                                                                                                         
Q( , ')               jy
                  N
                  i 1
                                xi , j  y , j  ( y, yi )  p( y | xi ; ')        exp( gi  y , j )  1
                              
                                                                             gi                         
                                                                                                         
     Q( , ')
       y , j
                                 
                  i 1 xi , j  y , j ( y, yi )  p( y | xi ; ') xi, j exp( gi  y , j )  0
                    N
                                                                                                  
                                         Is this solution unique?
Iterative Scaling
    How about negative features?

               exp( xi   y )  exp     j 1 xi, j  y, j 
                                            d



                      d 1                     
                            d
                                                    d xi , j
                exp   j 1 d  y , j xi , k    j 1
                                                         d
                                                                    
                                                             exp d  y , j xi , k   

                                                                        1                           
Q( , ')   i 1  j  y  xi , j  y , j  ( y, yi )  p( y | xi ; ') exp( xi , j  y , j d )  1
              N

                                                                        d                           

     Q( , ')
       y , j
                                                                                            
                  i 1 xi , j  y , j  ( y, yi )  p( y | xi ; ') exp(d  y , j xi , j )  0
                    N
Faster Iterative Scaling
    The lower bound may not be tight given all the
     coupling between weights is removed
                            
                                                                          xi , j                      
                                                                                                       
Q( , ')             jy
                N
                i 1
                              xi , j  y , j  ( y, yi )  p( y | xi ; ')        exp( gi  y , j )  1
                            
                                                                           gi                         
                                                                                                       
            i 1  j  y q( y , j )
                N
                                                             Univariate functions!

    A tighter bound can be derived by not fully
     decoupling the correlation between weights
                  
                                                   xi, j                   g                        
    Q( , ')      y, yi  xi , j  y , j         log   p( y | x)e y , j i                   
                j  i, y                          i gi
                                                               y                                      
                                                                                                     
Faster Iterative Scaling
 Log-likelihood
Bad News
   You may feel great after the struggle of the derivation.
   However, is iterative scaling a true great idea?
   Given there have been so many studies in optimization, we
    should try out existing methods.
Comparing Improved Iterative Scaling to
Newton’s Method
                                           Dataset   Iterations   Time (s)
Dataset   Instances    Features
                                            Rule        823        42.48
 Rule      29,602         246
                                                        81          1.13
  Lex      42,509      135,182
          Try out the standard numerical 241
                                   Lex                             102.18
Summary     24,044   198,467
          methods before you get excited 176                       20.02
Shallow   8,625,782  264,142
                 about your algorithm
                                Summary  626                       208.22
               Limited-memory                           69          8.52
               Quasi-Newton method
                                           Shallow     3216       71053.12
                                                        421       2420.30
              Improved iterative scaling

								
To top