VIEWS: 49 PAGES: 8 POSTED ON: 2/3/2011
Optimization with EM and Expectation-Conjugate-Gradient Ruslan Salakhutdinov rsalakhu@cs.toronto.edu Sam Roweis roweis@cs.toronto.edu Department of Computer Science, University of Toronto 6 King’s College Rd, M5S 3G4, Canada Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, University College London 17 Queen Square, London WC1N 3AR, UK Abstract In spite of tremendous success of the EM algorithm in practice due to its simplicity and fast initial progress, We show a close relationship between the some authors [8] have argued that the speed of EM Expectation - Maximization (EM) algorithm convergence can be extremely slow, and that more and direct optimization algorithms such as complicated second-order methods should generally be gradient-based methods for parameter learn- favored to EM. Many methods have been proposed ing. We identify analytic conditions under to enhance the convergence speed of the EM algo- which EM exhibits Quasi-Newton behavior, rithm, mostly based on conventional optimization the- and conditions under which it possesses poor, ory [6, 7]. Several authors [8, 1] have also proposed hy- ﬁrst-order convergence. Based on this analy- brid approaches for ML learning, advocating switching sis, we propose two novel algorithms for max- to a Newton or Quasi-Newton method after perform- imum likelihood estimation of latent variable ing several EM iterations. All of these approaches, al- models, and report empirical results show- though sometimes successful in terms of convergence, ing that, as predicted by the theory, the pro- are much more complex than EM, and diﬃcult to an- posed new algorithms can substantially out- alyze; thus they have not been popular in practice. perform standard EM in terms of speed of convergence in certain cases. Our goal in this paper is to contrast the EM algorithm with a direct gradient-based optimization approach. As a concrete alternative, we present an Expectation- Conjugate-Gradient (ECG) algorithm for maximum 1. Introduction likelihood estimation in latent variable models, and The problem of Maximum Likelihood (ML) parameter show that it can outperform EM in terms of conver- estimation for latent variable models is an important gence in certain cases. However, in other cases the problem in the area of machine learning and pattern performance of EM is superior. To understand these recognition. ML learning with unobserved quantities behaviours, we study the convergence properties of the arises in many probabilistic models such as density EM algorithm and identify analytic conditions under estimation, dimensionality reduction, or classiﬁcation, which EM algorithm exhibits Quasi-Newton conver- and generally reduces to a relatively hard optimization gence behavior, and conditions under which it pos- problem in terms of the model parameters after the sesses extremely poor, ﬁrst-order convergence. Based hidden quantities have been integrated out. on this analysis, we introduce a simple hybrid EM- A common technique for ML estimation of model ECG algorithm that switches between EM and ECG parameters in the presence of latent variables is based on estimated quantities suggested by our anal- Expectation-Maximization (EM) algorithm [3]. The ysis. We report empirical results on synthetic as well EM algorithm alternates between estimating the un- as real-world data sets, showing that, as predicted by observed variables given the current model and reﬁt- the theory, this simple algorithm almost never per- ting the model given the estimated, complete data. As forms worse than standard EM and can substantially such it takes discrete steps in parameter space similar outperform EM’s convergence. to to ﬁrst order method operating on the gradient of a locally reshaped likelihood function. 2. Linear and Newton Convergence of M-step has a single unique solution.2 Expectation Maximization The important consequence of the above analysis is We ﬁrst focus on the analysis of the convergence that (when the expected complete log-likelihood func- properties of the Expectation-Maximization (EM) tion has a unique optimum), EM has the appealing algorithm. Consider a probabilistic model of ob- quality of always taking a step Θ(t+1) −Θt having pos- served data x which uses latent variables y. The itive projection onto the true gradient of the objective log-likelihood (objective function) can be written as function L(Θt ). This makes EM similar to a ﬁrst order a diﬀerence between expected complete log-likelihood method operating on the gradient of a locally reshaped and negative entropy terms: likelihood function. L(Θ) = ln p(x|Θ) = p(y|x, Ψ) ln p(x|Θ)dy = For maximum likelihood learning of a mixture of Gaus- p(y|x, Ψ) ln p(x, y|Θ)dy − p(y|x, Ψ) ln p(y|x, Θ)dy sians model using the EM-algorithm, this positive def- = Q(Θ, Ψ) − H(Θ, Ψ) inite transformation matrix P (Θt ) was ﬁrst described by Xu and Jordan[15]. We extended their results by The EM algorithm implicitly deﬁnes a mapping: deriving the explicit form for the transformation ma- M : Θ → Θ from parameter space to itself, such that trix for several other latent variables models such as Θt+1 = M (Θt ). If iterates of Θt converge to Θ∗ (and Factor Analysis (FA), Probabilistic Principal Compo- M (Θ) is continuous), then Θ∗ = M (Θ∗ ), and in the nent Analysis (PPCA), mixture of PPCAs, mixture of neighbourhood of Θ∗ , by a Taylor series expansion: FAs, and Hidden Markov Models [10].3 Θt+1 − Θ∗ = M (Θ∗ )(Θt − Θ∗ ) (1) We can further study the structure of the transfor- mation matrix P (Θt ) and relate it to the convergence where M (Θ∗ ) = ∂M |Θ=Θ∗ . Therefore EM is essen- ∂Θ rate matrix M . Taking the negative derivatives of tially a linear iteration algorithm with a convergence both sides of (2) with respect to Θt , we have: rate matrix M (Θ∗ ) (which is typically nonzero). I − M (Θt ) = −P (Θt ) L (Θ t ) − P (Θt )S(Θt ) (4) For most objective functions, the EM step Θ(t+1) −Θ(t) ∂ 2 L(Θ) in parameter space and true gradient can be related by where S(Θt ) = ∂Θ2 |Θ=Θ is the Hessian t of the a transformation matrix P (Θt ), that changes at each t ∂Θt+1 objective function, Mij (Θ ) = ∂Θt is the i input- iteration: j output derivative matrix for the EM mapping and Θ(t+1) − Θ(t) = P (Θt ) L (Θ t ) (2) P (Θt ) = ∂P (Θ) |Θ=Θt is the tensor derivative of P (Θt ) ∂Θ with respect to Θt . In ”ﬂat” regions of L(Θ), where (We deﬁne L (Θt ) = ∂L(Θ) |Θ=Θt .) Under certain t ∂Θ L (Θ) approaches zero (and P (Θ ) does not become conditions, the transformation matrix P (Θt ) is inﬁnite), the ﬁrst term on the RHS of equation (4) guaranteed to be positive deﬁnite with respect to the becomes much smaller than the second term, and the gradient.1 In particular, if transformation matrix becomes a rescaled version of C1: Expected complete log-likelihood Q(Θ, Θt ) is well- the negative inverse Hessian: −1 deﬁned, and diﬀerentiable everywhere in Θ. and P (Θt ) ≈ I − M (Θt ) − S(Θt ) (5) C2: For any ﬁxed Θt = Θ(t+1) , Q(Θ, Θt ) has only a single critical point along any direction in Θ, located In particular, if the EM algorithm iterates converge to at the maximum Θ = Θt+1 ; then a local optima at Θ∗ , then near this point (i.e. for suﬃ- t L (Θ )P (Θt ) L (Θ t ) > 0 ∀Θt (3) ciently large t) EM may exhibit Quasi-Newton conver- gence behavior. This is also true in “plateau” regions The second condition C2 may seem very strong. How- where the gradient is very small even if they are not ever, for the EM algorithm C2 is satisﬁed whenever the near a local optimum. 1 Note that Q (Θ t )(Θ(t+1) − Θt ), where Q (Θ t ) = The nature of the Quasi-Newton behavior is controlled t ∂Q(Θ,Θ ) ∂Θ |Θ=Θt is the directional derivative of function by the eigenvalues of the matrix M (Θt ). If all eigen- Q(Θ, Θ ) in the direction of Θ(t+1) − Θt . t C1 and values tend to zero, then EM becomes a true Newton C2 together imply that this quantity is positive, oth- 2 In particular C2 holds for any exponential family erwise by the Mean Value Theorem (C1) Q(Θ, Θt ) model, due to the well-known convexity property of would have a critical point along some direction, lo- Q(Θ, Θt ) for these models. cated at a point other than Θt+1 (C2). By us- 3 t ∂Q(Θ,Θt ) We also derived the general form of the transformation ing the identity L (Θ ) = ∂Θ |Θ=Θt , we have matrix for the exponential family models in term of their t t t t (t+1) L (Θ )P (Θ ) L (Θ ) = Q (Θ )(Θ − Θt ) > 0. natural parameters. 10 6 GRADIENT 8.4 EM 5 NEWTON 5 4 8.0 µ1 µ2 3 0 2 8.0 −5 1 7.7 0 9.1 −10 −9 −6 −3 0 3 6 9 −10 −5 0 5 10 −6 −5 −4 −3 −2 −1 0 10 1.5 GRADIENT 9.0 NEWTON EM 1 5 0.5 8.7 0 0 µ µ 1 2 8.3 −0.5 −5 8.5 −1 9.1 8.0 −10 −1.5 −6 −3 0 3 6 −10 −5 0 5 10 −1.5 −1 −0.5 0 0.5 1 1.5 Figure 1. Contour plots of the likelihood function L(Θ) for MoG examples using well-separated (upper panels) and not- well-separated (lower panels) one-dimensional data sets. Axes correspond to the two means. The dashdot line shows the direction of the true gradient L (Θ), the solid line shows the direction of P (Θ) L (Θ) and the dashed line shows the direction of (−S)−1 L (Θ). Right panels are blowups of dashed regions on the left. The numbers indicate the log of the l2 norm of the gradient. Note that for the ”well-separated” case, in the vicinity of the maximum, vectors P (Θ) L (Θ) and (−S)−1 L (Θ) become identical. method, rescaling the gradient by exactly the nega- case of ﬁtting a mixture of Gaussians model to well- tive inverse Hessian. As the eigenvalues tend to unity, clustered data – for which EM exhibits Quasi-Newton EM takes smaller and smaller stepsizes, giving poor, convergence – and not-well-clustered data, for which ﬁrst-order, convergence. EM is slow. As we will see from the empirical results of the later sections, many other models also show this Dempster, Laird, and Rubin [3] showed that if EM same eﬀect. For example, when Hidden Markov Mod- iterates converge to Θ∗ , then −1 els or Aggregate Markov Models [11] are trained on ∂M (Θ) ∂ 2 H(Θ,Θ∗ ) ∂ 2 Q(Θ,Θ∗ ) very structured sequences, EM exhibits Quasi-Newton ∂Θ |Θ=Θ = |Θ=Θ∗ |Θ=Θ∗ ∗ ∂Θ2 ∂Θ2 behavior, in particular when the state transition ma- which can be interpreted as the ratio of missing trix is sparse and the output distributions are almost information to the complete information near the deterministic at each state. local optimum. Thus, in the neighbourhood of a The above analysis and experiments motivates the use solution (for suﬃciently large t): of alternative optimization techniques in the regime −1 −1 ∂2H ∂2Q where missing information is high and EM is likely P (Θt ) ≈ I − |Θ=Θt − S(Θt ) ∂Θ2 ∂Θ2 to perform poorly. In the following section, we an- (6) alyze exactly such an alternative, the Expectation- This formulation of the EM algorithm has a very in- Conjugate Gradient (ECG) algorithm, a simple direct teresting interpretation which is applicable to any la- optimization method for learning the parameters of tent variable model: When the missing information is latent variables models. small compared to the complete information, EM ex- hibits Quasi-Newton behavior and enjoys fast, typically 3. Expectation-Conjugate-Gradient superlinear convergence in the neighborhood of Θ∗ . If (ECG) Algorithm fraction of missing information approaches unity, the eigenvalues of the ﬁrst term (6) approach zero and EM The key idea of the ECG algorithm is to note will exhibit extremely slow convergence. that if we can easily compute the derivative ∂ ∂Θ ln p(x, z|Θ) of the complete log likelihood, Figure 1 illustrates the above results in the simple then knowing the posterior p(z|x, Θ) we can com- oped to escape from such conﬁgurations. It turns out pute the exact gradient L (Θ). In particular: that direct optimization methods such as ECG may ∂ L (Θ) = z p(z|x, Θ) ∂Θ log p(x, z|Θ)dz. This exact also avoid this problem because of the nonlocal nature gradient can then be utilized in any standard manner, of the line search. In many of our experiments, ECG for example to do gradient (as)descent or to control actually converges to a better local optimum than EM; a line search technique. (Note that if one can derive ﬁgure 2 illustrates exactly such case. exact EM for a latent variable model, then one can always derive ECG by computing the above integral 4. Hybrid EM-ECG Algorithm over hidden variables.) As an example, we describe a conjugate gradient algorithm: As we have seen, the relative performance of EM ver- sus direct optimization depends on the missing infor- Expectation-Conjugate-Gradient algorithm: mation ratio for the given model and data set. The key Apply a conjugate gradient optimizer to L(Θ), perform- to practical speedups is the ability to design a hybrid ing an “EG” step whenever the value or gradient of algorithm that can estimate the local missing informa- L(Θ) is requested (e.g. during a line search). tion ratio M (Θt ) to detect whether to use EM or a The gradient computation is given by direct approach such as ECG. Some authors have at- • E-Step: Compute posterior p( | , Θt ) and log- ¡ tacked this problem by ﬁnding the top eigenvector of likelihood L(Θ) as normal. ∂M (Θ) t ∗ ∂Θ |Θ=Θ as Θ approaches Θ using conventional t t • G-Step: L (Θ )= p( | , Θt ) ∂Θ log p( , |Θ)d ∂ ¡ ¡ numerical methods, such as ﬁnite-diﬀerence approxi- mations, or power methods [4]. These approaches are When certain parameters must obey positivity or lin- computationally intensive and diﬃcult to implement, ear constraints, we can either modify our optimizer and thus they have not been popular in practice. to respect the constraints, or we can reparameterize to allow unconstrained optimization. In our exper- Here, we propose using the entropy of the posterior iments, we use simple reparameterizations of model over hidden variables, which can be computed after parameters that allow our optimizers to work with ar- performing an E-step, as a crude estimate of the bitrary values. For example, in the MoG model we local missing information ratio. This entropy has use a “softmax” parameterization of the mixing coef- a natural interpretation as the uncertainty about ﬁcients πi = exp (γi ) , for covariance matrices to missing information, and thus can serve as a guide M j=1 exp (γj ) between switching regimes of EM and ECG. For many be symmetric positive deﬁnite, we use the Choleski de- models with discrete hidden variables this quantity is composition (or log variances for diagonal covariance quite easy to compute. In particular, we deﬁne the matrices). In HMMs, we reparameterize probabilities Normalized Entropy term: via softmax functions as well. ¯ N M Ht = N −1M n i p(z = i|xn , Θt ) ln p(z = i|xn , Θt ) ln 2 with z being discrete hidden variable taking on Θ 2.5 0.01 ECG 2 M values, and N observed data vectors xn . We Log−Likelihood + Const 0.005 * 1.5 ΘECG 1 0 EM now simply switch between EM and ECG based on 0.5 Line Search −0.005 thresholding this quantity: −0.01 0 −0.5 Θ* EM −0.015 Hybrid EM-ECG algorithm: Gene Sequence: −1 EM −0.02 ATGCCGGAGGCCC ... −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 2 Θ 0 200 400 600 800 1000 ¯ • Perform EM iterations, evaluating Ht after each 1 E−Steps E-step ¯ • If Ht ≥ τ Thena Switch to ECG Figure 2. Left panel illustrates why ECG may converge to ¯ • Perform ECG, evaluating Ht at the end of each a better local optimum. Right panel displays the learn- ing curves for EM and ECG algorithms of training fully line search ¯ • If Ht < τ Then Switch back to EM connected 7-state HMM to model human DNA sequences. • Exit at either phase IF: Both algorithms started from the same initial parameter values, but converged to the diﬀerent local optimum. 1. (L(Θt ) − L(Θt−1 ))/abs(L(Θt )) < tol OR 2. t > Tmax Of course, the choice of initial conditions is very im- a We are near the optimum or in plateau region with portant for the EM algorithm or for ECG. Since EM high entropy is based on optimizing a convex lower bound on the likelihood, once EM is trapped in a poor basin of at- As we will see from the experimental results, this sim- traction, it can never ﬁnd a better local optimum. Al- ple hybrid EM-ECG algorithm performs no worse, and gorithms such as split and merge EM [13] were devel- often far better than either EM or ECG. −3 x 10 0.01 EM−ECG 0 EM−ECG 0 EM−ECG 0 EM EM ECG −0.005 −5 −0.01 ECG EM 6 ECG −0.02 −0.01 −10 4 −0.03 −0.015 2 −0.04 −15 −0.02 0 Unstructured −0.05 −5 −3 −1 1 Sequence: CEDBCEDDAC ... −20 −0.025 0 100 200 300 400 500 600 0 50 100 150 200 250 300 0 20 40 60 80 100 120 140 160 180 0.2 EM−ECG / EM EM−ECG / EM 0.05 0 0 EM−ECG / EM ECG ECG 0 −5 −0.2 −0.05 Log−Likelihood + Const −10 −0.4 −0.1 ... −0.15 −15 −0.6 ECG 2 −0.2 −0.8 −20 −0.25 0 −1 −0.3 −25 −2 −1.2 Structured −0.35 −30 Sequence: AEABCDEAEABCDE... −0.4 −2 0 2 −1.4 E−Steps −35 −0.45 0 10 20 30 40 50 0 5 10 15 20 25 30 35 0 10 20 30 40 50 Figure 3. Learning curves for ECG, EM-ECG, and EM algorithms, showing superior (upper panels) and inferior (lower panels) performance of ECG under diﬀerent conditions for three models: MoG (left), HMM (middle), and Aggregate Markov Models (right). The number of E-steps taken by either algorithm is shown on the horizontal axis, and log likelihood is shown on the vertical axis. For ECG and EM-ECG, diamonds indicate the maximum of each line search, and the zero-level likelihood corresponds to the converging point of the EM algorithm. The bottom panels use “well- separated”, or “structured” data for which EM possesses Quasi-Newton convergence behavior. All models in this case converge in 10-15 iterations with stopping criterion: [L(Θt+1 ) − L(Θt )]/abs(L(Θt+1 )) < 10−15 . The upper panels use “overlapping”, “aliased”, or “unstructured” data for which proposed algorithms performs much better. 5. Experimental Results the data is “well-separated” into distinct clusters and We now present empirical results comparing the per- another “not-well-separated” case in which the data formance of EM, ECG, and hybrid EM-ECG for learn- overlaps in one contiguous region. Figure 3 shows that ing the parameters of three latent variable models: ECG and Hybrid EM-ECG outperform standard EM Mixtures of Gaussians (MoG), Hidden Markov Mod- in the poorly separated cases. For the well-separated els (HMM), and Aggregate Markov Models. In many case, the hybrid EM-ECG algorithm never switches to latent variable models, performing inference (E-step) ECG due to the small normalized entropy term, and is signiﬁcantly more expensive compared to either the EM converges very quickly. This is predicted by our parameter updates (M-step) or the line search over- analysis: in the vicinity of the local optima Θ∗ the di- head in the CG step of ECG. To compare the perfor- rections of the vectors P (Θ) L (Θ) and (−S)−1 L (Θ) mance of the algorithms, we therefore simply compare become identical (ﬁg. 1), suggesting that EM will have the number of E-steps each algorithm executes until its Quasi-Newton convergence behavior. convergence. We ﬁrst show results on synthetic data We then applied our algorithms to the training of Hid- sets, whose properties we can control to verify certain den Markov Models (HMMs). Missing information in aspects of our theoretical analysis. We also report em- this model is high when the observed data do not well pirical results on several real world data sets, showing determine the underlying state sequence (given the pa- that our algorithms do work well in practice. Though rameters). We therefore generated two data sets from we show examples of single runs, we have conﬁrmed a 5-state HMM, with an alphabet size of 5 characters. that the convergence results presented in all our ex- The ﬁrst data set (“aliased” sequences) was generated periments do not vary signiﬁcantly for diﬀerent initial from a HMM where output parameters were set to uni- parameter conditions. For all of the reported experi- form values plus some small noise. The second data ments, we used tol = 10−8 and τ = 0.5. set (“very structured sequences”) was generated from a HMM with sparse transition and output matrices. 5.1. Synthetic Data Sets For the ambiguous or aliased data, ECG and hybrid First, consider a mixture of Gaussians (MoG) model. EM-ECG outperform EM substantially. For the very We considered two types of data sets, one in which structured data, EM performs well and exhibits second −4 x 10 0.04 5−Component MOG 1 50−Component MOG 65−Component MOG 0.01 0.02 EM−ECG EM−ECG 0 EM−ECG / EM 0 ECG EM 0.006 −1 −0.02 0.002 ECG ECG −2 −0.04 −0.002 −3 −0.06 EM −0.006 −4 −0.08 −0.1 −5 −0.01 0 50 100 150 200 250 0 200 400 600 800 1000 0 100 200 300 400 500 600 −3 x 10 0.02 4 5−State HMM 8−State HMM 10−State HMM 0.01 EM−ECG 2 EM−ECG 0.01 0 0 EM−ECG ECG ECG EM ECG EM 0 EM −2 −0.01 −0.01 −4 −0.02 −0.02 −6 DNA Sequence: DNA Sequence: DNA Sequence: ATGCCGGAGGCCC ... ATGCCGGAGGCCC ... ATGCCGGAGGCCC ... −8 −0.03 −0.03 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0.01 3−Class AMM 0.02 6−Class AMM 0.05 9−Class AMM 0 0 0 EM−ECG EM−ECG EM EM−ECG EM Log−Likelihood + Const EM −0.02 −0.01 −0.05 −0.04 −0.06 −0.02 −0.1 −0.08 −0.03 −0.15 −0.1 −0.04 −0.12 −0.2 ECG −0.14 ECG ECG E−Steps −0.05 −0.25 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Figure 4. Learning curves for ECG, EM-ECG, and EM algorithms, displaying convergence performance under diﬀerent conditions for three models: MoG (upper), HMM (middle), and Aggregate Markov Models (bottom). The number of E-steps taken by either algorithm is shown on the horizontal axis, and log likelihood is shown on the vertical axis. For ECG and EM-ECG, diamonds indicate the maximum of each line search, and the zero-level likelihood corresponds to the converging point of the EM algorithm. The number of learned clusters for MoG model were 5 (left), 50 (middle), and 65 (right). For HMM model, the number of states were 5 (left), 8 (middle), and 10 (right). The number of learned themes for the AMM model were 3 (left), 6 (middle), and 9 (right). order convergence in the vicinity of the local optimum. determine class labels, and on more structured data, in which the proportion of missing information is very Finally, we experimented with Aggregate Markov small. ECG and hybrid EM-ECG are superior to EM Models (AMMs) [11]. AMMs model deﬁne a discrete by at least a factor of two for ambiguous data; for conditional probability table pij = p(y = j|x = i) structured data EM shows the expected Quasi-Newton using a low rank approximation. In the context of convergence behavior. n-gram models for word sequences, AMMs are class- based bigram models in which the mapping from words to classes is probabilistic. In particular, the 5.2. Real World Data Sets class-based bigram model predicts that word w1 is In our ﬁrst experiment, we cluster a set of 50,000 followed by word w2 with probability: P (w2 |w1 ) = 8 × 8 greyscale pixel image patches4 using a mixture C of Gaussians model. The patches were extracted from c=1 P (w2 |c)P (c|w1 ) with C being the total number 768×512 natural images, described in [14] (see ﬁg 5 for of classes. Here, the concept of missing information an example of a natural image, and sample patches). corresponds to how well or poor a set of words de- To speed-up the experiments, the patch data was pro- termine the class labels C based on the observation jected with PCA down to a 10-dimensional linear sub- words that follow them. The right panels of ﬁgure 4 3 show training of a 2-class 50-state AMM model on The data set used was the imlog data set publicly ambiguous (aliased) data, in which words do not well available at ftp://hlab.phys.rug.nl/pub/samples/imlog space and the mixing proportions and covariances of ing a small number of “soft” classes (t): P (W |A) = T the model were held ﬁxed. The means were initial- t=1 P (W |t)P (t|A). Once again, we observe that for ized by performing K-means. We experimented with this simple model, this data set has a large fraction of mixtures having M=2 up to M=65 clusters. missing information. Figure 4 displays the convergence of EM, ECG, and EM-ECG algorithms for T=3,6,9. with hybrid EM-ECG and ECG having superior con- vergence over EM. 6. Discussion and Conclusions Although we have focused here on discrete latent vari- ables, the ECG and hybrid algorithms can also be de- rived for latent variable models with continuous hidden Figure 5. An example of a natural image and some samples variables. As an example ﬁgure 6 illustrates conver- of 8 × 8 grey pixel image patches, used in the clustering gence behaviour of the Probabilistic Principal Com- experiment. ponent Analysis (PPCA) latent variable model[9, 12], which has continuous rather than discrete hidden vari- Figure 4 displays the convergence of EM, ECG, ables. Here the concept of missing information is re- and Hybrid EM-EC algorithms for M=5, M=50 and lated to the ratios of the leading eigenvalues of the M=65. The experimental results show that with fewer sample covariance, which corresponds to the elliptic- mixture components EM outperforms ECG, since the ity of the distribution. For “low-rank” data with a components generally model the data with fairly dis- large ratio EM performs well; for nearly circular data tinct, non-contiguous clusters. As the number of ECG converges faster.6 mixtures components increases, clusters overlap in contiguous regions and the normalized entropy term In some degenerate cases, where the proportion of grows, suggesting a relatively high proportion of the missing information is very high, i.e. M (Θ∗ ) ap- missing information. In this case ECG outperform EM proaches identity, EM convergence can be exponen- by several orders of magnitude. Hybrid EM-ECG al- tially slow. Figure 6 illustrates such example for the gorithm is never inferior to either EM or ECG (using case of HMM training using almost random sequences. our untuned setting of switching threshold τ = 0.5). It takes about 7,000 iterations for ECG and EM-ECG to converge to the ML estimate, whereas even after Our second experiment consisted of training a fully 250,000 iterations EM is still only approaching the lo- connected HMM to model DNA sequences. For the cal optimum. training, we used publicly available ”GENIE gene ﬁnding data set”, provided by UCSC and LBNL [5], In this paper we have presented comparative analy- that contains 793 unrelated human genomic DNA se- sis of EM and direct optimization algorithms for la- quences. We applied our diﬀerent algorithms on 66 tent variable models, and developed a theoretical con- DNA sequences with length varying anywhere between nection between these two approaches. We have also 200 to 3000 genes per sequence. The number of states analyzed and determined conditions under which EM ranged from M=5 to M=10 and all the parameter val- algorithm can demonstrate local-gradient and Quasi- ues were randomly initialized. Figure 4 shows the Newton convergence behaviors. Our results extend convergence of EM, ECG, and Hybrid EM-ECG algo- those of Xu and Jordan[15] who analyzed the conver- rithms for M=5,8,10. This data set contains very com- gence properties of the EM algorithm in the special plex structure which is not easily modeled by HMMs, case of Gaussian mixtures, and apply to any exponen- resulting in a very high proportion of missing informa- tial family model. tion. As a result, hybrid EM-ECG and ECG substan- Motivated by these analyses, we have proposed an al- tially outperform EM in terms of convergence. ternative hybrid optimization method that can signif- In our last experiment, we applied Aggregate Markov icantly outperform EM in many cases and is almost Models to the data set consisting of 2,037 NIPS au- never inferior. We tested the proposed algorithms by thors and corresponding counts of the top 1,000 most 6 The slow convergence of EM in PPCA is also true for frequently used words of the NIPS conference proceed- factor analysis and especially for linear dynamic systems. ings, volumes 1 to 12.5 The goal was to model the In these models, there is large amount of missing informa- probability that an author A will use word W us- tion due to the fact that latent variables are continuous and they can be rotated without aﬀecting the likelihood as 5 NIPS corpus used in the experiments is publicly avail- long as the parameters are rotated accordingly. able at http://www.cs.toronto.edu/∼roweis/data.html −4 0.1 x 10 2 Probabilistic PCA Probabilistic PCA 5−State HMM 0 0 EM 0 EM EM−ECG / Log−Likelihood + Const ECG ECG −5 −0.1 ECG −2 EM 10 3 −10 −4 −0.2 2 5 1 −0.3 −15 −6 0 0 −1 −5 −8 −0.4 −20 −2 −10 −3 −10 −5 0 5 10 −2 0 2 E−Steps −10 0 50,000 100,000 150,000 200,000 250,000 −0.5 −25 0 50 100 150 200 0 50 100 150 200 Figure 6. Learning curves for ECG (dots) and EM (solid lines) algorithms, showing superior (left) and inferior (middle) performance of ECG. The left panel uses ”ill-conditioned” data for which ECG converges quickly; the middle panel uses “low-rank” data for which EM performs better. Right panel displays ”non-converging” case of the EM. Very unstructured data (30 sequences, each of length 50) was generated from a full 5-state HMM with alphabet size of 5. Parameter values were set to be uniform plus some small uniform noise. ECG and EM-ECG converge in about 7,000 iterations, whereas after even 250,000 iterations, EM is only approaching to the ML estimate. training several basic latent variable models on sev- [5] GENIE gene data set. LBNL and UC Santa Cruz, eral synthetic as well as real world data sets, reporting http://www.fruitfly.org/sequence. convergence behavior and explaining the results with [6] Mortaza Jamshidian and Robert I. Jennrich. Accel- reference to our analysis. eration of the EM algorithm by using quasi-newton methods. J. of the RS Society series B, 49, 1997. Our convergence analysis can also be extended to a broader class of bound optimization techniques, [7] Meng X. L. and van Dyk D. Fast EM-type imple- mentations for mixed eﬀects models. J. of the Royal such as iterative scaling (IS) algorithms for parameter Statistical Society series B, 60:559–578, 1998. estimation in maximum entropy models[2] and the recent CCCP algorithm for minimizing the Bethe free [8] Richard A. Redner and Homer F. Walker. Mixture energy in approximate inference problems[16]. These densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195–239, April 1984. analyses allow us to gain a deeper understanding of the nature of these algorithms and the conditions [9] S. T. Roweis. EM algorthms for PCA and SPCA. under which certain optimization techniques can In Advances in neural information processing systems, volume 10, pages 626–632, Cambridge, MA, 1998. be expected to outperform others. Based on these extended analyses we are designing accelerated ﬁtting [10] Ruslan Salakhutdinov. Relationship between gra- algorithms for these models as well. dient and EM steps in latent variable models. http://www.cs.toronto.edu/∼rsalakhu/ecg. Acknowledgments [11] Lawrence Saul and Fernando Pereira. Aggregate and We would like to thank Yoshua Bengio and Drew Bag- mixed-order Markov models for statistical language nell for many useful comments and Carl Rasmussen for processing. In Proceedings of the Second Conference on Empirical Methods in Natural Language Process- providing an initial version of conjugate gradient code. ing, pages 81–89. 1997. References [12] M. E. Tipping and C. M. Bishop. Mixtures of prob- [1] S.E. Atkinson. The performance of standard and hy- abilistic principal component analysers. Neural Com- brid EM algorithms for ML estimates of the normal putation, 11(2):443–482, 1999. mixture model with censoring. J. of Stat. Computa- [13] Naonori Ueda, Ryohei Nakano, Zoubin Ghabramani, tion and Simulation, 44, 1992. and Geoﬀrey E. Hinton. SMEM algorithm for mixture models. Neural Computation, 12(9):2109–2128, 2000. [2] Stephen Della Pietra, Vincent J. Della Pietra, and John D. Laﬀerty. Inducing features of random ﬁelds. [14] J. H. van Hateren and A. van der Schaaf. Independent IEEE Transactions on Pattern Analysis and Machine component ﬁlters of natural images compared with Intelligence, 19(4):380–393, 1997. simple cells in primary visual cortex. In Proceedings of the Royal Society of London, pages 359–366, 1998. [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Max- imum likelihood from incomplete data via the EM al- [15] L. Xu and M. I. Jordan. On convergence properties gorithm. J. of the RS Society series B, 39:1–38, 1977. of the EM algorithm for Gaussian mixtures. Neural Computation, 8(1):129–151, 1996. [4] Chris Fraley. On computing the largest fraction of [16] Alan Yuille and Anand Rangarajan. The convex- missing information for the EM algorithm and the concave computational procedure (CCCP). In Ad- worst linear function for data augmentation. Tech- vances in NIPS, volume 13, 2001. nical report, University of Washington.