# Expectation Maximization Algorithm by khn19658

VIEWS: 13 PAGES: 61

• pg 1
```									Expectation Maximization Algorithm

Rong Jin
A Mixture Model Problem
20

18

16

14

12

10

8

6

4

2

0
0   5    10     15     20    25

   Apparently, the dataset consists of two modes
   How can we automatically identify the two modes?
Gaussian Mixture Model (GMM)
   Assume that the dataset is generated by two
mixed Gaussian distributions
 Gaussian model 1: 1  1 ,  1 ; p1
 Gaussian model 2:  2  2 ,  2 ; p2 

   If we know the memberships for each bin,
estimating the two Gaussian models is easy.
   How to estimate the two Gaussian models
without knowing the memberships of bins?
EM Algorithm for GMM
   Let memberships to be hidden variables
{x1 , x2 ,..., xn }   x1 , m1  ,  x2 , m2  ,...,  xn , mn 
   EM algorithm for Gaussian mixture model
 Unknown memberships:  x1 , m1  ,  x2 , m2  ,...,  xn , mn 

1  1 ,  1 ; p1
   Unknown Gaussian models:
 2   2 ,  2 ; p2 

   Learn these two sets of parameters iteratively
20

18

16

14
   Random assign the
12

10
memberships to
8

6
each bin
4

2

0
1
0   5   10   15   20   25
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0   5   10   15   20   25
20

18

16

14
   Random assign the
12

10
memberships to
8

6
each bin
4

2                                  Estimate the means
0
1

0.9

0.8
0   5   10   15   20   25
and variance of
0.7

0.6                                   each Gaussian
0.5

0.4

0.3
model
0.2

0.1

0
0   5   10   15   20   25
E-step
      Fixed the two Gaussian models
      Estimate the posterior for each data point
p( x, m  1)         p( x,1 )                    p( x | 1 ,  1 ) p1
p(m  1| x)                                      
p ( x)     p( x,1 )  p( x, 2 ) p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2
p( x, m  2)         p ( x,  2 )                 p( x |  2 ,  2 ) p2
p (m  2 | x)                                      
p ( x)     p( x,1 )  p( x, 2 ) p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2

1          x   2                             1          x   2      
p( x | 1 ,  1 )          exp               , p( x | 1 ,  1 )          exp                 
1                                                   2

2 1
2         2 12                              2 2
2          2 2
2        
                                                                  
EM Algorithm for GMM
20

18

16
   Re-estimate the
14

12                                    memberships for
10

8                                    each bin
6

4

2

0
1
0      5   10   15   20   25
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0   5   10   15   20   25
M-Step
            Fixed the memberships
Weighted by posteriors
            Re-estimate the two model Gaussian
n
l    p(mi  1| xi ) log p( xi ,1 )  p(mi  2 | xi ) log p( xi , 2 )
ˆ                                ˆ
i 1

                                                                                              
n
  p(mi  1| xi ) log p1  log p( xi | 1 ,  1 )   p(mi  2 | xi ) log p2  log p( xi | 2 ,  2 ) 
ˆ                                                   ˆ
i 1

Weighted by posteriors
 i 1 p(mi  1| xi ) ,  i 1 p(mi  1| xi ) xi ,  2   i 1 p(mi  1| xi ) xi2   2
n                            n                                   n
ˆ                          ˆ                               ˆ
p1                            1                                1                               1
 i 1 p  ˆ (mi  1| xi )           i 1 p
ˆ (mi  1| xi )
n                                n
n

 i 1 p (mi  2 | xi )         i 1    p(mi  2 | xi ) xi 2  i 1 p (mi  2 | xi ) xi2
n                                n                                 n
ˆ                                 ˆ                               ˆ
p2                          , 2                              , 2                            22

 i 1 p(mi  2 | xi )               i 1 p(mi  2 | xi )
n                                 n
n                               ˆ                                ˆ
EM Algorithm for GMM
20

18

16                                   Re-estimate the
memberships for
14

12

10

8                                   each bin
6

4

2
   Re-estimate the
0
1

0.9
0   5   10   15   20   25       models
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0   5   10   15   20   25
At the 5-th Iteration
20

18

16
   Red Gaussian
14

12
component slowly
10

8
shifts toward the left
6

4
end of the x axis
2

0
0.9
0   5   10   15   20   25

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0   5   10   15   20   25
At the10-th Iteration
20

18

16

14

12
   Red Gaussian
10

8
component still
6

4
slowly shifts
2

0
0.9
toward the left end
0    5   10   15   20   25

0.8
of the x axis
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0   5   10   15   20   25
At the 20-th Iteration
20

18

16
   Red Gaussian
14

12
component make
10

8
more noticeable shift
toward the left end of
6

4

2

0
1
0   5   10   15   20   25
the x axis
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0   5   10   15   20    25
At the 50-th Iteration
20

Red Gaussian
18

16                                
component is close
14

12

10

8
to the desirable
6

4
location
2

0
1
0      5   10   15   20   25
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0   5   10   15   20   25
At the 100-th Iteration
20

18

16                                   The results are
14

12                                    almost identical to
10

8                                    the ones for the
6

4                                    50-th iteration
2

0
1
0      5   10   15   20   25
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0   5   10   15   20   25
EM as A Bound Optimization
     EM algorithm in fact maximizes the log-likelihood function
of training data
     Likelihood for a data point x
p( x)  p( x,1 )  p( x, 2 )  p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2

1          x   2                                 1          x   2   
p( x | 1 ,  1 )          exp                   , p( x | 1 ,  1 )          exp              
1                                                       2

2 1
2         2 12                                  2 2
2          2 2
2     
                                                                   

     Log-likelihood of training data
l   i 1 log p ( xi )   i 1 log  p ( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2 
n                       n
EM as A Bound Optimization
     EM algorithm in fact maximizes the log-likelihood function
of training data
     Likelihood for a data point x
p( x)  p( x,1 )  p( x, 2 )  p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2

1          x   2                                 1          x   2   
p( x | 1 ,  1 )          exp                   , p( x | 1 ,  1 )          exp              
1                                                       2

2 1
2         2 12                                  2 2
2          2 2
2     
                                                                   

     Log-likelihood of training data
l   i 1 log p ( xi )   i 1 log  p ( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2 
n                       n
EM as A Bound Optimization
     EM algorithm in fact maximizes the log-likelihood function
of training data
     Likelihood for a data point x
p( x)  p( x,1 )  p( x, 2 )  p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2

1          x   2                                 1          x   2   
p( x | 1 ,  1 )          exp                   , p( x | 1 ,  1 )          exp              
1                                                       2

2 1
2         2 12                                  2 2
2          2 2
2     
                                                                   

     Log-likelihood of training data
l 1 , 2    i 1 log p( xi )   i 1 log  p( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2 
n                       n
Logarithm Bound Algorithm
0

l (1 ,2 )

10 , 2
0
Logarithm Bound Algorithm

Point
• Come up with a lower bounded
l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )
l (1 ,2 )
Q(1 , 2 ) is a concave function
l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )     Touch point: Q(1  10 , 2   20 )  0

1 , 2
0 0

Logarithm Bound Algorithm

• Come up with a lower bounded
l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )
l (1 ,2 )
Q(1 , 2 ) is a concave function
l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )     Touch point: Q(1  10 , 2   20 )  0

• Search the optimal solution
that maximizes Q(1 , 2 )

1 , 2
0 0
 , 
1
1
1
2
Logarithm Bound Algorithm

• Come up with a lower bounded
l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )
l (1 ,2 )
Q(1 , 2 ) is a concave function
l (1 , 2 )  l (11 , 2 )  Q(1 , 2 )
1

Touch point: Q(1  10 , 2   20 )  0

• Search the optimal solution
that maximizes Q(1 , 2 )

1 , 2
0 0
 ,  
1
1
1
2   1 , 2
2   2
                           • Repeat the procedure
Logarithm Bound Algorithm
Optimal

• Come up with a lower bounded
l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )
l (1 ,2 )
Q(1 , 2 ) is a concave function
Touch point: Q(1  10 , 2   20 )  0

• Search the optimal solution
that maximizes Q(1 , 2 )

1 , 2
0 0
 ,  
1
1
1
2   1 , 2
2 2
,...                     • Repeat the procedure
• Converge to the local optimal
EM as A Bound Optimization
   Parameter for previous iteration: 1' , 2
'

   Parameter for current iteration: 1 , 2
   Compute Q(1 , 2 )
Q(1 , 2 )  l (1 , 2 )  l (1' , 2 )
'


p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2        

n
i 1
log                                                        ' 
 p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 
'    '     '                    '     '
                                                           
                 p ( xi | 1 ,  1 ) p1
'      '       '
p( xi | 1 ,  1 ) p1 
                                                                                       
 p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 p( xi | 1 ,  1 ) p1 
'    '     '                    '     '   '              '    '    '
  i 1 log 
n

                   p ( xi |  2 ,  2 ) p2
'      '       '
p ( xi |  2 ,  2 ) p2 

 p( x |  ' ,  ' ) p '  p( x |  ' ,  ' ) p ' p( x |  ' ,  ' ) p ' 
          i      1 1 1                     i       2   2   2        i     2     2     2

                p ( xi | 1 ,  1 ) p1
'    '      '
p( xi | 1 ,  1 ) p1 
                                                           log                           
n  p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2                  p( xi | 1 ,  1 ) p1 
'     '    '                   '       '     '                  '    '    '
  i 1                                                                                           
              p ( xi |  2 ,  2 ) p2
'     '        '
p ( xi |  2 ,  2 ) p2 
 p ( x |  ' ,  ' ) p '  p ( x |  ' ,  ' ) p ' log p( x |  ' ,  ' ) p ' 
      i    1 1 1                    i      2       2     2           i      2     2     2 

n                          p ( xi | 1 ,  1 ) p1                               p( xi |  2 ,  2 ) p2 

    i 1 
p (1'   | xi ) log                                   p ( 2
'
| xi ) log                      ' 

                        p ( xi |   1' ,  1' ) p1
'
p( xi |  2 ,  2 ) p2 
'     '

Q(1 , 2 )  l (1 , 2 )  l (1' , 2 )
'


p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2        

n
i 1
log                                                        ' 
 p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 
'    '     '                    '     '
                                                           
                 p ( xi | 1 ,  1 ) p1
'      '       '
p( xi | 1 ,  1 ) p1 
                                                                                       
 p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 p( xi | 1 ,  1 ) p1 
'    '     '                    '     '   '              '    '    '
  i 1 log 
n

                   p ( xi |  2 ,  2 ) p2
'      '       '
p ( xi |  2 ,  2 ) p2 

 p( x |  ' ,  ' ) p '  p( x |  ' ,  ' ) p ' p( x |  ' ,  ' ) p ' 
          i      1 1 1                     i       2   2   2        i     2     2     2

                p ( xi | 1 ,  1 ) p1
'    '      '
p( xi | 1 ,  1 ) p1 
                                                           log                           
n  p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2                  p( xi | 1 ,  1 ) p1 
'     '    '                   '       '     '                  '    '    '
  i 1                                                                                           
              p ( xi |  2 ,  2 ) p2
'     '        '
p ( xi |  2 ,  2 ) p2 
 p ( x |  ' ,  ' ) p '  p ( x |  ' ,  ' ) p ' log p( x |  ' ,  ' ) p ' 
      i    1 1 1                    i      2       2     2           i      2     2     2 

n                          p ( xi | 1 ,  1 ) p1                               p( xi |  2 ,  2 ) p2 

    i 1 
p (1'   | xi ) log                                   p ( 2
'
| xi ) log                      ' 

                        p ( xi |   1' ,  1' ) p1
'
p( xi |  2 ,  2 ) p2 
'     '

Q(1 , 2 )  l (1 , 2 )  l (1' , 2 )
'


 p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 

n
i 1
log                                                  ' 
 p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 
'     '     '                ' '
                                                     
                   p ( xi | 1 ,  1 ) p1
'    '     '
p( xi | 1 ,  1 ) p1 
                                                                                   
Concave property 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 p ( xi | 1 ,  1 ) p1 
 p ( xi | of
' logarithm function
'     '                ' '   '              '    '    '
  i 1 log 
n
log( p  (1  p)  )  p log   (1 ' p) log  '                                             
                    p( x |  ,  ) p     '
p ( xi |  2 ,  2 ) p2 
 0  p     ,   0 ' ' i ' 2 2 2 ' ' '
p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 p ( xi |  2 ,  2 ) p2 
1,                                                                   '    '     '
                                                                                   
                p ( xi | 1 ,  1 ) p1
'    '    '
p( xi | 1 ,  1 ) p1 
                                                     log                           
n  p ( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2           p( xi | 1 ,  1 ) p1 
'     '     '                 '     ' '                  '    '    '
  i 1                                                                                       
               p ( xi |  2 ,  2 ) p2
'     '      '
p ( xi |  2 ,  2 ) p2 
 p ( x |  ' ,  ' ) p '  p ( x |  ' ,  ' ) p ' log p( x |  ' ,  ' ) p ' 
       i     1 1 1                    i    2     2 2           i      2     2     2 

n                          p ( xi | 1 ,  1 ) p1                               p( xi |  2 ,  2 ) p2 

    i 1 
p (1'   | xi ) log                                   p ( 2
'
| xi ) log                      ' 

                        p ( xi |   1' ,  1' ) p1
'
p( xi |  2 ,  2 ) p2 
'     '

Q(1 , 2 )  l (1 , 2 )  l (1' , 2 )
'

p( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 

                                                             

n
i 1
log                                                          ' 

 p( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 
'     '     '                      '       '

                p( xi | 1 ,  1 ) p1
'       '        '
p ( xi | 1 ,  1 ) p1 
                                                                                              
 p( xi | 1 ,  1 ) p1  p ( xi |  2 ,  2 ) p2 p( xi | 1 ,  1 ) p1 
'     '     '                      '       '   '                '    '     '
  i 1 log 
n

                 p ( xi |  2 ,  2 ) p2
'      '         '
p ( xi |  2 ,  2 ) p2 

 p( x |  ' ,  ' ) p'  p( x |  ' ,  ' ) p ' p( x |  ' ,  ' ) p ' 
         i     1 1 1                        i       2     2     2        i     2      2      2

                                ' Definition of posterior
p( xi | 1 ,  1 ) p1
'           '
p( xi | 1 ,  1 ) p1 
                               p( xi | '1 ,  1 ) p1 ' log
'     '     '                                     
n  p ( xi | 1 ,  1 ) p1  'p ( xi |  2 ,  2 ) p2'                          i  m 1 p
'     '     '                             '                            '    '     '
p( x | p(1 , 1 )1| 1xi ;1' , 2 )

'
  i 1                 p( xi | 1 , '1 ) p1'  p( xi |  2 ,  2 ) p2
'     '                          '    '

             p( xi |  2 ,  2 ) p2
'
p ( xi |  2 ,  2 ) p2 
 p( x |  ' ,  ' ) p '  p( x |  ' ,  ' ) p '              log
     i   1 1 1                     i       2        2       2        p( xi |  2 ,  2 ) p2 
'      '     '

n                                          p ( xi | 1 ,  1 ) p1                                              p ( xi |  2 ,  2 ) p2 

     i 1 
p (m1     1|   xi ;1' , 2 ) log
'
 p(m1  2 |        xi ;1' , 2 ) log
'
' 

                                        p ( xi |   1 ,  1 ) p1
'     '    '
p ( xi |  2 ,  2 ) p2 
'     '

Log-Likelihood of EM Alg.
-375

-380

-385
Loglikelhood

-390

-395

-400

-405

-410
0   10   20   30   40       50      60   70   80   90   100
Iteration
Maximize GMM Model
l   i 1 log p ( xi )   i 1 log  p ( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2 
n                    n

1          x   2                             1          x   2 
p( x | 1 ,  1 )          exp               , p( x | 1 ,  1 )          exp            
1                                                   2

2 1
2         2 12                              2 2
2          2 2 
2
                                                             

   What is the global optimal solution to GMM?

n
x
i 1 i
1  x1 ,  1  0, 1                       ,  2  1, p1  p2  0.5
n
   Maximizing the objective function of GMM is ill-
posed problem
Maximize GMM Model
l   i 1 log p ( xi )   i 1 log  p ( x | 1 ,  1 ) p1  p( x |  2 ,  2 ) p2 
n                    n

1          x   2                             1          x   2 
p( x | 1 ,  1 )          exp               , p( x | 1 ,  1 )          exp            
1                                                   2

2 1
2         2 12                              2 2
2          2 2 
2
                                                             

   What is the global optimal solution to GMM?

n
x
i 1 i
1  x1 ,  1  0, 1                       ,  2  1, p1  p2  0.5
n
   Maximizing the objective function of GMM is ill-
posed problem
Identify Hidden Variables
   For certain learning problems, identifying hidden variables is
   Consider a simple translation model
      For a pair of English and Chinese sentences:

e : (e1 , e2 ,..., es )  c : (c1, c2 ,..., cl )

      A simple translation model is

Pr(e | c )   j 1 Pr(e j | c )  j 1
s                      s
   t
k 1
Pr(e j   | ck )   
      The log-likelihood of training corpus                 e1 , c1  ,...,  en , cn 
l   i 1 log Pr(ei | ci )   i 1  j i1 log
n                              n        e
    ci
k 1
Pr(ei , j   | ci ,k )   
Identify Hidden Variables
    Consider a simple case                              e : (e1   e2 )
Pr(e | c )   j 1
2
   2
k 1
Pr(e j   | ck )   
 Pr(e1 | c1 ) Pr(e2 | c1 )  Pr(e1 | c2 ) Pr(e2 | c2 )
 Pr(e1 | c1 ) Pr(e2 | c2 )  Pr(e1 | c2 ) Pr(e2 | c1 )   c : (c1   c2 )
  Alignment variable a(i)
a:   
map a position in English sentence to a position in Chinese sentence
    Rewrite Pr(e | c )  a Pr(e1 | ca(1) ) Pr(e2 | ca(2) )
Identify Hidden Variables
    Consider a simple case                              e : (e1   e2 )
Pr(e | c )   j 1
2
   2
k 1
Pr(e j   | ck )   
 Pr(e1 | c1 ) Pr(e2 | c1 )  Pr(e1 | c2 ) Pr(e2 | c2 )
 Pr(e1 | c1 ) Pr(e2 | c2 )  Pr(e1 | c2 ) Pr(e2 | c1 )   c : (c1   c2 )
  Alignment variable a(i)
a:   
map a position in English sentence to a position in Chinese sentence
    Rewrite Pr(e | c )  a Pr(e1 | ca(1) ) Pr(e2 | ca(2) )
Identify Hidden Variables
    Consider a simple case                              e : (e1   e2 )
Pr(e | c )   j 1
2
   2
k 1
Pr(e j   | ck )   
 Pr(e1 | c1 ) Pr(e2 | c1 )  Pr(e1 | c2 ) Pr(e2 | c2 )
 Pr(e1 | c1 ) Pr(e2 | c2 )  Pr(e1 | c2 ) Pr(e2 | c1 )   c : (c1   c2 )
  Alignment variable a(i)
a:   
map a position in English sentence to a position in Chinese sentence
    Rewrite Pr(e | c )  a Pr(e1 | ca(1) ) Pr(e2 | ca(2) )
Identify Hidden Variables
    Consider a simple case                              e : (e1   e2 )
Pr(e | c )   j 1
2
   2
k 1
Pr(e j   | ck )   
 Pr(e1 | c1 ) Pr(e2 | c1 )  Pr(e1 | c2 ) Pr(e2 | c2 )
 Pr(e1 | c1 ) Pr(e2 | c2 )  Pr(e1 | c2 ) Pr(e2 | c1 )   c : (c1   c2 )
  Alignment variable a(i)
a:   
map a position in English sentence to a position in Chinese sentence
    Rewrite Pr(e | c )  a Pr(e1 | ca(1) ) Pr(e2 | ca(2) )
EM Algorithm for A Translation Model
     Introduce an alignment variable for each translation pair
     e1 , c1 , a1  ,  e2 , c2 , a2  ,...,  en , cn ,, an 
     EM algorithm for the translation model
      E-step: compute the posterior for each alignment variable Pr(a j | e j , c j )
      M-step: estimate the translation probability Pr(e|c)
|e j |                             |e j |

Pr(a, e j , c j )                Pr(e j ,k | c j ,a (k ) )         Pr(e j ,k | c j ,a (k ) )
k 1                                k 1
Pr(a | e j , c j )                                                                    
 a ' Pr(a ', e j , c j )             |e j |                         |e j |
 a '  Pr(e j ,k | c j ,a '(k ) )  t 1 Pr(e j ,s | c j ,t )
ci

k 1                          k 1
EM Algorithm for A Translation Model
     Introduce an alignment variable for each translation pair
     e1 , c1 , a1  ,  e2 , c2 , a2  ,...,  en , cn ,, an 
     EM algorithm for the translation model
      E-step: compute the posterior for each alignment variable Pr(a j | e j , c j )
      M-step: estimate the translation probability Pr(e|c)
|e j |                             |e j |

Pr(a, e j , c j )                Pr(e j ,k | c j ,a (k ) )         Pr(e j ,k | c j ,a (k ) )
k 1                                k 1
Pr(a | e j , c j )                                                                    
 a ' Pr(a ', e j , c j )             |e j |                         |e j |
 a '  Pr(e j ,k | c j ,a '(k ) )  t 1 Pr(e j ,s | c j ,t )
ci

k 1                          k 1

We are luck here. In general, this step can be extremely
difficult and usually requires approximate approaches
Compute Pr(e|c)
   First compute Pr(e | c; ei , ci )
Pr(e | c; ei , ci )   (e  ei ) (c  ci )  a Pr(a | ei , ci ) (a(e)  c)

  (e  e ) (c  c )
 a Pr(a, ei , ci ) (a(e)  c)
i          i
Pr(ei , ci )
|e j |
  t 1 Pr(e j , s | c j ,t )
ci
Pr(e | c)
k 1^ e j ,k  e
  (e  ei ) (c  ci )             |e j |
  t 1 Pr(e j ,s | c j ,t )
ci

k 1
Pr(e | c)
  (e  ei ) (c  ci )
 t 1 Pr(e | c j ,t )
ci
Compute Pr(e|c)
   First compute Pr(e | c; ei , ci )
Pr(e | c; ei , ci )   (e  ei ) (c  ci )  a Pr(a | ei , ci ) (a(e)  c)

  (e  e ) (c  c )
 a Pr(a, ei , ci ) (a(e)  c)
i            i
Pr(ei , ci )
|e j |
  t 1 Pr(e j , s | c j ,t )
ci
Pr(e | c)
k 1^ e j ,k  e
  (e  ei ) (c  ci )               |e j |
  t 1 Pr(e j ,s | c j ,t )
ci

k 1
Pr(e | c)
  (e  ei ) (c  ci )
 t 1 Pr(e | c j ,t )
ci

Pr(e | c)   i 1 Pr(e | c; ei , ci )
n
Bound Optimization for A Translation Model
θ  Pr(e | c) for the current iteration
θ '  Pr'(e | c) for the previous iteration

l (θ)   i 1 log Pr(ei | ci ; θ)   i 1  j i1 log
n                             n       e
 ci
k 1
Pr(ei , j    | ci ,k )   
l (θ ')   i 1 log Pr(ei | ci ; θ ')   i 1  j i1
n                                n       e
log  
ci
k 1
Pr'(ei , j   | ci ,k )   
  ci Pr(e | c ) 
Q(θ, θ ')  l (θ)  l (θ ')   i 1  j i1 log  k 1                
n      e                     i, j i ,k
  ci Pr'(e | c ) 
 l 1      i, j i ,l 
Bound Optimization for A Translation Model
  ci Pr(e | c ) 
Q(θ, θ ')  l (θ)  l (θ ')   i 1  j i1 log  k 1                
n      e                     i, j i ,k
  ci Pr'(e | c ) 
 l 1      i, j i ,l 

     ci      Pr'(ei , j | ci ,k )        Pr(ei , j | ci ,k ) 
                  log                                                             
n          ei
i 1       j 1
 k 1  ci Pr'(e               | ci ,l ) Pr'(ei , j | ci ,k ) 
       l 1     i, j                                          
Pr'(ei , j | ci ,k )             Pr(ei , j | ci ,k ) 
           
n          ei     ci
log                       
i 1       j 1   k 1                                         Pr'(ei , j | ci ,k ) 

ci
l 1
Pr'(ei , j   | ci ,l )                           

Pr'(e | c)
Pr(e | c)   i 1 (e  ei ) (c  ci )
n


ci
t 1
Pr'(e | c j ,t )
Iterative Scaling
       Maximum entropy model
exp( x  wy )                                       exp( xi  wyi )
, l ( Dtrain )  
N
p ( y | x ; )                                                  log
 y exp( x  wy )                        i 1
 y exp( xi  wy )
    Iterative scaling
   All features xi , j  0
Sum of features are constant  j 1 xi, j  g
d

Iterative Scaling
   Compute the empirical mean for each feature of every class,
i.e., ey, j   N xi, j  ( y, yi ) N for every j and every class y
i 1

   Start w1 ,w2 …, wc = 0
   Repeat
   Compute p(y|x) for each training data point (xi, yi) using w from the
previous iteration
   Compute the mean of each feature of every class using the estimated
probabilities, i.e., my, j   N xi, j p( y | xi ) N for every j and every y
i 1

   Compute w j , y 
1
g

log e j , y  log m j , y    for every j and every y
   Update w as w j , y  w j , y  w j , y
Iterative Scaling
  w1 , w2 ,..., wc : parameters for the current iteration

                     
 '  w1 , w2 ,..., wc : parameters for the last iteration
'    '        '

exp( x  wy )
p ( y | x ; ) 
 y exp( x  wy )
exp( xi  wyi )
l ( )                 p ( y | x ; )  
N                                N
log                              log
i 1                             i 1
 y exp( xi  wy )
exp( xi  w'yi )
l ( ')                p( y | x; ')  
N                                N
log                              log
i 1                             i 1
 y exp( xi  w'y )
 exp( x  w )                    exp( xi  w'yi ) 
                                                  
l ( )  l ( ')  
N               i   yi
log                                                 ' 
  y exp( xi  wy )              y exp( xi  wy ) 
i 1
                                                   
Iterative Scaling
 exp( x  w )           exp( xi  w'yi ) 
                                         
l ( )  l ( ')  
N               i   yi
log                                          
  y exp( xi  wy )    y
i 1
exp( xi  w'y ) 
                                          

   
  i 1 xi  wyi  w'yi  log
N
         y               
exp( xi  w'y )  log     exp( x  w ) 
y      i   y

Can we use the concave property of logarithm function?

No, we can’t because we need a lower bound
Iterative Scaling
log x  x  1  log           exp(x  w )   exp(x  w ) 1
y           i   y           y       i       y

l ( )  l ( ')

 
  i 1 xi  wyi  w'yi  log
N
           y                   exp( x  w ) 
exp( xi  w'y )  log         y       i   y

  x   w                       log         exp( x  w )    exp( x  w )  1
N
i 1   i    yi    w'yi                 y           i
'
y       y           i   y

• Weights w y still couple with each other
• Still need further decomposition
Iterative Scaling
exp     q p    p exp  q  for i, p  0,  p  1
i i i                i   i        i                  i       i       i

                                       
exp( xi  wy )  exp            d
x w

j 1 i , j y , j
d

 exp   j 1 d
xi , j

 k 1 xi,k
wy , j  k 1 xi ,k 
d

                                       

  j 1 d
d        xi , j

 xi,k

d              d xi , j
exp wy , j  k 1 xi , k   j 1
g

exp gwy , j                       
k 1
l ( )  l ( ')
N
 
  i 1 xi  wyi  w'yi  log                  y                  
exp( xi  w'y )   y exp( xi  wy )  1            
N   
 j xi, j                             log                                                             

xi , j
     i 1 
wyi , j  w'yi , j                    exp( xi  w'y )                         exp( gwy , j )  1


y                       y       j       g                   

Iterative Scaling
N   
                                                                                             

xi , j
Q( , ')   i 1  j xi , j  wyi , j  w'yi , j  log              y exp( xi  w'y )   y  j             exp( gwy , j )  1

                                                                                        g                      



N
log 
i 1           y
exp( xi  w'y )     1     N
i 1



 y  j  xi, j  wy, j  w'y, j


     y, yi  
xi , j
g


exp( gwy , j ) 



Q( , ')
wy , j

  i 1  y  j xi , j   y, yi   xi , j exp( gwy , j )  0
N

  y  j xi, j  y, yi 
N
i 1
 wy , j  log
 i 1  y  j xi, j
N

Wait a minute, this can not be right! What happens?
Logarithm Bound Algorithm

• Come up with a lower bounded
l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )
l (1 ,2 )
Q(1 , 2 ) is a concave function
l (1 , 2 )  l (10 , 20 )  Q(1 , 2 )     Touch point: Q(1  10 , 2   20 )  0

• Search the optimal solution
that maximizes Q(1 , 2 )

1 , 2
0 0
 , 
1
1
1
2
Iterative Scaling
Q( , ')


N
i 1   log     y
exp( xi  w'y )     1     N
i 1


y j


       '

xi , j  wy , j  wy , j   y, yi  
xi , j
g


exp( gwy , j ) 



Q(   ', ')


N
i 1    
log      y                    
 1    y  j  xi , j  wy , j  wy , j   y, yi  
exp( xi  w'y )
N

i 1


'           '

xi , j
g
          '       

exp( gwy , j ) 



  i 1
N
 
log  y exp( xi  wy )  1   i 1  y  j
'
 
N            xi , j
g
exp( gw'y , j )

0

Where does it go wrong?
Iterative Scaling
log x  x  1  log              exp(x  w )   exp(x  w ) 1
y           i   y           y       i       y

l ( )  l ( ')


N
i 1 xi  wyi  w'yi  log         y                      exp( x  w ) 
exp( xi  w'y )  log            y       i   y

  i 1
N
x   w
i      yi    w'yi     log     y
exp( x  w )    exp( x  w )  1
i
'
y       y           i   y

Not zero when  = ’

  exp( xi  wy )       exp( xi  wy )
log x  x  1  log  y                                      1
  exp( xi  wy )
'      y
 y exp( xi  wy )
'
 y                
Iterative Scaling
  exp( xi  wy )       exp( xi  wy )
log x  x  1  log       y                                      1
  exp( xi  wy )
'      y
 y exp( xi  wy )
'
  y               
l ( )  l ( ')
                                     exp( xi  wy )   

N 
i 1  i
x         
 wyi  w'yi         y
y

 1
exp( xi  w'y ) 

                                                         
 y Definition y ' conditional
 wy  w of       
N            exp( xi  w'y )  exp( xi   y )  
  i 1  xi yi  y
exponential model                                          1

                y    exp( xi  w'y )          


N
i 1   x  
i        yi                                           
  y p( y | xi ; ') exp( xi   y )  1
Iterative Scaling
                                        
exp( xi   y )  exp        d
x 
j 1 i , j y , j  
 exp   j 1 d
d     xi , j

 k 1 xi,k
 y , j  k 1 xi ,k 
d

                                        

  j 1 d
d         xi , j

 xi,k
    d

d xi , j
exp  y , j  k 1 xi , k   j 1
g
exp g  y , j        
k 1

l ( )  l ( ')  
N
i 1   x  
i    yi     y p( y | xi ; ') exp( xi   y )  1
N                                                       xi , j                

         
i 1 
xi , j  yi , j   y p( y | xi ; ') j           exp( g  y , j )  1
 j
                                                  g                          


                                           xi , j                     

  i 1  j  y  xi , j  y , j  ( y, yi )  p( y | xi ; ')
N
exp( g  y , j )  1

                                            g                         

Iterative Scaling

                                              xi , j                     

Q( , ')             jy
N
i 1
xi , j  y , j  ( y, yi )  p( y | xi ; ')        exp( g  y , j )  1

                                               g                         


Q( , ')
 y , j
                                                            
  i 1 xi , j  y , j  ( y, yi )  p( y | xi ; ') xi , j exp( g  y , j )  0
N


N
1                 x   ( y, yi )
i 1 i , j y , j
  y , j  wy , j  w'y , j        log

N
g
i 1
p( y | xi ; ') xi , j
Iterative Scaling
    How about d1 xi, j  gi  constant ?
j
                                        
exp( xi   y )  exp          d
x 
j 1 i , j y , j  
 exp   j 1 d
d     xi , j

 k 1 xi,k
 y , j  k 1 xi ,k 
d

                                        

  j 1 d
d         xi , j

 xi,k

exp  y , j  k 1 xi , k   j 1
d

d xi , j
gi

exp gi  y , j        
k 1


                                              xi , j                      

Q( , ')               jy
N
i 1
xi , j  y , j  ( y, yi )  p( y | xi ; ')        exp( gi  y , j )  1

                                               gi                         

Q( , ')
 y , j

  i 1 xi , j  y , j ( y, yi )  p( y | xi ; ') xi, j exp( gi  y , j )  0
N

Is this solution unique?
Iterative Scaling

exp( xi   y )  exp     j 1 xi, j  y, j 
d

 d 1                     
       d
d xi , j
 exp   j 1 d  y , j xi , k    j 1
          d

exp d  y , j xi , k   

                                             1                           
Q( , ')   i 1  j  y  xi , j  y , j  ( y, yi )  p( y | xi ; ') exp( xi , j  y , j d )  1
N

                                             d                           

Q( , ')
 y , j
                                                               
  i 1 xi , j  y , j  ( y, yi )  p( y | xi ; ') exp(d  y , j xi , j )  0
N
Faster Iterative Scaling
    The lower bound may not be tight given all the
coupling between weights is removed

                                              xi , j                      

Q( , ')             jy
N
i 1
xi , j  y , j  ( y, yi )  p( y | xi ; ')        exp( gi  y , j )  1

                                               gi                         

  i 1  j  y q( y , j )
N
Univariate functions!

    A tighter bound can be derived by not fully
decoupling the correlation between weights

                                 xi, j                   g                        
Q( , ')      y, yi  xi , j  y , j         log   p( y | x)e y , j i                   
j  i, y                          i gi
 y                                      
                                                                                   
Faster Iterative Scaling
Log-likelihood
   You may feel great after the struggle of the derivation.
   However, is iterative scaling a true great idea?
   Given there have been so many studies in optimization, we
should try out existing methods.
Comparing Improved Iterative Scaling to
Newton’s Method
Dataset   Iterations   Time (s)
Dataset   Instances    Features
Rule        823        42.48
Rule      29,602         246
81          1.13
Lex      42,509      135,182
Try out the standard numerical 241
Lex                             102.18
Summary     24,044   198,467
methods before you get excited 176                       20.02
Shallow   8,625,782  264,142