# EXPECTATION MAXIMIZATION

Document Sample

```					Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

Print

Title Page          EXPECTATION MAXIMIZATION
JJ           II
G ÓMER G ONZÁLEZ H.
J             I                Machine Learning
Page 1 of 23

Go Back

Full Screen

Close
I - 2007

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
1. Maximum likelihood
References

Recall the de nition of the maximum-likelihood estimation
problem (MLE):
Print             – We have a data set X = fx1 ; x2 ; : : : ; xN g
Title Page
– Suppose the xi are i.i.d. according to a parameterized dis-
tribution
JJ           II        – The aim is nding the set of parameters that most likely
produced X
J             I
The likelihood function is
Page 2 of 23
Q
N

Go Back
L( ) = p(Xj ) =           p(xij )
i=1

Full Screen          and we are interested in the optimization problem
Close                                   = arg max L( )
Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
For convenience it is often used the log-likelihood
References

P
N
L( ) = log L( ) =          p(xij )
i=1
Print

Title Page          2. Expectation Maximization
JJ           II         It is a technique for nding maximum likelihood estimates
when data have some missing values
J             I
This time X has not only observable (Xobs ) but also missing
Page 3 of 23            values (Xmis )

Go Back
2.1.   Working with the complete-data likelihood
Full Screen
Now, we have L( ) = log p(Xobs ; Xmis j ), which we call
Close
the “complete-data log-likelihood”

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

This function is in fact a random variable since the missing
information Xmis is unknown, random, and presumably gov-
Print
erned by an underlying distribution
The EM algorithm rst nds the expected value of L( ) with
Title Page
respect to the unknown data given the observed data and the
JJ           II
current parameter estimates:

J             I
Q( ; i) = E[log p(Xobs; Xmisj )jXobs; i]

Page 4 of 23         Key things to understand:

Go Back
– Xobs and i are “constants” in this expression
– Xmis is a random variable and thus taking the expectation
Full Screen
makes sense
Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

Print           ... we are trying to evaluate the likelihood but since we have
incomplete data, we try to “fantasize” what they should be,
Title Page          based on the current parameter setting
JJ           II      ... “we are lling in the missing values based on our current
expectation”
J             I
Then, the next step is to maximize ( nd a new estimation for
Page 5 of 23          ):
i+1
= arg max Q( ; i)
Go Back

Full Screen

Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References
2.2.       EM algorithm
0
Input: An initial estimate       for parameters
Print
i          0
Repeat
Title Page              Compute Q( ; i )
(this is called the E-step)
JJ           II          Find a new estimate i+1 by maximization of Q
(this is called the M-step)
J             I
i       i+1
Page 6 of 23
Until convergence is reached
i
Output: Final parameter estimation
Go Back

Full Screen
Theorem 1 An EM iteration does not decrease the observed-data
likelihood function
Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

X
Print              log p(Xobsj ) = log        p(Xobs; Y j )
Y
Title Page
X                          p(Xobs; Y j )
n
= log       P (Y jXobs;       )
JJ           II
Y
P (Y jXobs; n)
J             I
p(Xobs; Y j )               n
= log EY                      jXobs;
P (Y jXobs; n)
Page 7 of 23

In the last step we used the de nition of expectation. Now, we
Go Back          will apply Jensen's inequality: Let f be a convex function and X
a random variable, the following holds E[f (X)] f (E[X]).
Full Screen

Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

p(Xobs; Y j )
log p(Xobsj )      EY log              n   jXobs; n
Print
P (Y jXobs; )
X                      p(Xobs; Y j )
=   P (Y jXobs; n) log
Title Page

Y
P (Y jXobs; n)
X
JJ           II
=   P (Y jXobs; n) log p(Xobs; Y j )
Y
J             I                              X
n                  n
P (Y jXobs;       )P (Y jXobs;       )
Page 8 of 23
Y
= EY [log p(Xobs; Y j )jXobs; n]
Go Back
EY [log p(Y jXobs; n)jXobs; n]
Full Screen                         = Q( ; n) R( n; n)
Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied                 now,
References
X
n                    n                         n
log p(Xobsj ) = log p(Xobsj )             P (Y jXobs;        )
Y
X
n
Print
=       P (Y jXobs;        ) log p(Xobsj n)
Y
Title Page
X                            p(Xobs; Y j n)
n
=       P (Y jXobs;        ) log
JJ           II
Y
P (Y jXobs; n)
X
n
J             I                     =       P (Y jXobs;        ) log p(Xobs; Y j n)
Y
Page 9 of 23
X
n                       n
P (Y jXobs;         ) log P (Y jXobs;       )
Go Back                                 Y
= EY [log p(Xobs; Y j n)jXobs; n]
Full Screen
EY [log p(Y jXobs; n)jXobs; n]
Close
= Q( n; n) R( n; n)

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

n+1                            n                n+1       n
Home Page            It is clear that if         = arg max Q( ;               ), then Q(         ;       )>
Q( n; n), so
Print
n+1
log p(Xobsj            )     Q( n+1; n) R( n; n)
Title Page
Q( n; n) R( n; n)
JJ           II                                            = log p(Xobsj n)

J             I
2.3.    Generalized EM
Page 10 of 23
M-step can be hard... just try to improve Q
i+1                       i+1
Go Back              Find          such that Q(              ; i) > Q( i; i)
Full Screen
Convergence of GEM is slower than that of traditional EM

Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied

3. Properties of EM
References

In general EM converges to a local-maximum or saddle point
Print             of the observed-data log-likelihood
Title Page            To escape from local maxima a restarting technique could be
used. Simulated annealing has also been applied
JJ           II
The number of iterations required to convergence is undeter-
J             I        mined
EM is particularly useful when maximum likelihood estima-
Page 11 of 23
tion of a complete data model is easy
Go Back
Its robustness to noise has been proven
Full Screen           It may produce inaccurate results with high-dimensional
datasets
Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

Expectation maximization is a description of a class of re-
lated algorithms, not a speci c algorithm
Print
– EM is a recipe or meta-algorithm which is used to devise
Title Page
particular algorithms.
– The Baum-Welch (BW) algorithm is an example of an
JJ           II         EM algorithm applied to hidden Markov models
J             I
– Another example is the EM algorithm for tting a mixture
density model.
Page 12 of 23
Most of methods for MLE require the evaluation of rst
Go Back
and/or second derivatives of the likelihood function, for EM
derivatives are not mandatory unless we require closed-form
Full Screen         formulae

Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

Print
4. EM Applied
Title Page          4.1.   Fields of application
JJ           II         Computational vision (labelling regions in images: types of
tissue)
J             I
Computational biology (inference of phylogenetic trees)
Page 13 of 23
Datamining (datawarehouse analysis)
Go Back             Finance (risk management and portfolio)

Full Screen

Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied                 4.2.    BW algorithm for Hidden Markov Models
References

Recall the elements of a HMM
– N states. The state at time t is qt 2 f1; 2; : : : ; N g
Print                  – is a distribution for the initial state:
Title Page
i   = p(q1 = i)
JJ           II
– A is the matrix of transition probabilities:
J             I
aij = p(qt = jjqt    1   = i)
Page 14 of 23
– V = fv1 ; : : : ; vL g an alphabet of symbols and a particu-
Go Back
lar observation of T symbols is O = (o1 ; : : : ; oT ) where
ot 2 V
Full Screen                – The probability distribution of an observation
Close                                    bj (k) = p(ot = vk jqt = j)
Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

Recall from the forward-backward procedure

Print
– The probability of observing the partial sequence
O1; O2; : : : ; Ot and ending up in state j at step t is
Title Page

j (t)   = p(O1; : : : ; Ot; qt = jj )
JJ           II
– The probability of observing the partial sequence
J             I         Ot+1; : : : ; OT given we started in state j at step t is
Page 15 of 23
j (t)   = p(Ot+1; : : : ; OT jqt = j; )
Go Back           – p(O; qt = jj ) = j (t) j (t)
Full Screen
– p(O; qt 1 = j; qt = kj ) = j (t         1)ajk bk (ot) k (t)

Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied                 E-Step:
References

Q( ; i) = E[log p(Xobs; Xmisj )jXobs; i]
=     log P (O; qj )P (qjO; i)
Print                                    q
X                     P (q; Oj i)
Title Page
=          log P (O; qj )
q
P (Oj i)
JJ           II                             X
/          log P (O; qj )P (O; qj i)
J             I                               q

Page 16 of 23         since

Go Back
Q
T
P (O; qj ) =       q1 bq1 (o1 )         aq t   1 qt
bqt (ot)
t=2
Full Screen
Q
T                             Q
T
=        q1         bqt (ot)                      aq t   1 qt
Close                                             t=1                           t=2

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

X
i                                     i
Print                 Q( ; ) =             log   q1 P (O; qj            )+
q
!
Title Page
X X
T
log bqt (ot) P (O; qj i) +
JJ           II
q     t=1
!
J             I                           X X
T
log aqt   1 qt
P (O; qj i)
Page 17 of 23                                q     t=2

Go Back
... the parameters we wish to optimize are split into the tree terms,
Full Screen
we can optimize each term individually

Close

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied                 M-step
References

For the rst term
i
log   q1 P (O; qj        )
Print
q
X           X
i
Title Page                    =                    log   q1 P (O; q1 ; : : : qT j       )
q1          qT
JJ           II
X                X        X
=         log   q1                     P (O; q1; : : : qT j i)
J             I                    q1               q2        qT
X
i
=         log   q1 p(O; q1 j       )
Page 18 of 23
q1

Go Back                        X
N
=         log   j p(O; q1   = jj i)
Full Screen                       j=1

Close              which by optimization yields

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References
0      p(O; q1 = jj i)
j    =
p(O; q1 = jj i)
Print                                 = PN                 i
j=1 p(O; q1 = jj )
Title Page

For the second
JJ           II                                             !
X X
T

J             I                                log bqt (ot) P (O; qj i)
q     t=1
!
Page 19 of 23
X         X           X
T
=                                 log bqt (ot) P (O; q1; : : : ; qT j i)
Go Back
q1            qT       t=1

Full Screen                X         XX
T
=                                log bqt (ot)P (O; q1; : : : ; qT j i)
Close                  q1            qT     t=1

Quit
Maximum likelihood
Expectation Maximization
X         X
Properties of EM             =                     log bq1 (o1)P (O; q1; : : : ; qT j i) +
EM Applied
q
1        Tq
References                         X         X
log bq2 (o2)P (O; q1; : : : ; qT j i) +
q
1         qT
X                    X        X
Print             =           log bq1 (o1)                  P (O; q1; : : : ; qT j i) +
q
1                   q2       Tq
Title Page                  X                    X        X
log bq2 (o2)                  P (O; q1; : : : ; qT j i) +
JJ           II               q2                  q1        qT

J             I
X
N

Page 20 of 23               log bj (o1)P (O; q1 = jj i) +
j=1
Go Back
X
N
log bj (o2)P (O; q2 = jj i) + : : :
Full Screen
j=1

Close                                                XX
N T
=             log bj (ot)P (O; qt = jj i)
Quit                                                 j=1 t=1
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied                 which by optimization yields
References
PT                   i
t=1 p(O; qt = jj ) ot ;vk
t=1 p(O; qt = jj i)
Print
For the third one a similar processes applies
Title Page                                                          !
X X
T

JJ           II                                     log aqt   1 qt
P (O; qj i)
q        t=2
J             I
XXX
N N T
=                       log ajk P (O; qt         1   = j; qt = kj i)
Page 21 of 23
j=1 k=1 t=2

Go Back
which by optimization yields
PT
Full Screen
p(O; qt 1 = j; qt = kj i)            ot ;vk
a0jk   =     t=2
PT                    i
Close
t=2 p(O; qt 1 = jj )

Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied                 4.3.   Remarks
References

Experience has shown that for initial parameter estimation
the use of an uniform distribution is adequate, specially for
Print
the case of and A
Good initial estimates for B are useful when V is discrete
Title Page
and essential when V is continuous
JJ           II         In bioinformatics, a very common distribution used for tran-
sition and emission probabilities is Dirichlet
J             I
One of the most ef cient versions of the Baum-Welch al-
Page 22 of 23            gorithm is called the checkpointing algorithm which runs in
O(N log T ) memory and O(N 2T log T ) time
Go Back
EM is not the only way to train HMMs. Genetic algorithms
Full Screen            have also been applied successfully. They permit escaping
from local optima and nding the optimum number of states
Close
N
Quit
Maximum likelihood
Expectation Maximization
Properties of EM
EM Applied
References

Print           5. References
Title Page          Jeff A. Bilmes. A Gentle Tutorial of the EM Algorithm and its
Application to Parameter Estimation for Gaussian Mixture and
JJ           II
Hidden Markov Models. Department of Electrical Engineering
J             I
and Computer Science (U.C. Berkeley). TR-97-021. April 1998.
Dempster, Laird, and Rubin. Maximum likelihood from incom-
Page 23 of 23
plete data via the EM algorithm. Journal of the Royal Statistical
Go Back          Society, Series B. Vol. 39, No. 1. 1977.

Full Screen

Close

Quit

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 3 posted: 12/3/2011 language: English pages: 23
How are you planning on using Docstoc?