Document Sample

Dirichlet Enhanced Latent Semantic Analysis Kai Yu Shipeng Yu Volker Tresp Siemens Corporate Technology Institute for Computer Science Siemens Corporate Technology D-81730 Munich, Germany University of Munich D-81730 Munich, Germany Kai.Yu@siemens.com D-80538 Munich, Germany Volker.Tresp@siemens.com spyu@dbs.informatik.uni-muenchen.de Abstract of latent topics. Latent Dirichlet allocation (LDA) [3] generalizes PLSI by treating the topic mixture param- This paper describes nonparametric Bayesian eters (i.e. a multinomial over topics) as variables drawn treatments for analyzing records containing from a Dirichlet distribution. Its Bayesian treatment occurrences of items. The introduced model avoids overﬁtting and the model is generalizable to retains the strength of previous approaches new data (the latter is problematic for PLSI). How- that explore the latent factors of each record ever, the parametric Dirichlet distribution can be a (e.g. topics of documents), and further uncov- limitation in applications which exhibit a richer struc- ers the clustering structure of records, which ture. As an illustration, consider Fig. 1 (a) that shows reﬂects the statistical dependencies of the la- the empirical distribution of three topics. We see that tent factors. The nonparametric model in- the probability that all three topics are present in a duced by a Dirichlet process (DP) ﬂexibly document (corresponding to the center of the plot) is adapts model complexity to reveal the clus- near zero. In contrast, a Dirichlet distribution ﬁtted to tering structure of the data. To avoid the the data (Fig. 1 (b)) would predict the highest proba- problems of dealing with inﬁnite dimensions, bility density for exactly that case. The reason is the we further replace the DP prior by a simpler limiting expressiveness of a simple Dirichlet distribu- alternative, namely Dirichlet-multinomial al- tion. location (DMA), which maintains the main This paper employs a more general nonparametric modelling properties of the DP. Instead of re- Bayesian approach to explore not only latent topics lying on Markov chain Monte Carlo (MCMC) and their probabilities, but also complex dependen- for inference, this paper applies eﬃcient vari- cies between latent topics which might, for example, ational inference based on DMA. The pro- be expressed as a complex clustering structure. The posed approach yields encouraging empirical key innovation is to replace the parametric Dirichlet results on both a toy problem and text data. prior distribution in LDA by a ﬂexible nonparamet- The results show that the proposed algorithm ric distribution G(·) that is a sample generated from uncovers not only the latent factors, but also a Dirichlet process (DP) or its ﬁnite approximation, the clustering structure. Dirichlet-multinomial allocation (DMA). The Dirich- let distribution of LDA becomes the base distribution for the Dirichlet process. In this Dirichlet enhanced 1 Introduction model, the posterior distribution of the topic mixture for a new document converges to a ﬂexible mixture We consider the problem of modelling a large corpus of model in which both mixture weights and mixture pa- high-dimensional discrete records. Our assumption is rameters can be learned from the data. Thus the a that a record can be modelled by latent factors which posteriori distribution is able to represent the distribu- account for the co-occurrence of items in a record. To tion of topics more truthfully. After convergence of the ground the discussion, in the following we will iden- learning procedure, typically only a few components tify records with documents, latent factors with (la- with non-negligible weights remain; thus the model is tent) topics and items with words. Probabilistic la- able to naturally output clusters of documents. tent semantic indexing (PLSI) [7] was one of the ﬁrst Nonparametric Bayesian modelling has attracted con- approaches that provided a probabilistic approach to- siderable attentions from the learning community wards modelling text documents as being composed butions wd,n |zd,n ; β ∼ Mult(zd,n , β) (1) zd,n |θd ∼ Mult(θd ). (2) wd,n is generated given its latent topic zd,n , which takes value {1, . . . , k}. β is a k × |V | multinomial pa- (a) (b) rameter matrix, j βi,j = 1, where βz,wd,n speciﬁes the probability of generating word wd,n given topic Figure 1: Consider a 2-dimensional simplex represent- z. θd denotes the parameters of a multinomial distri- ing 3 topics (recall that the probabilities have to sum bution of document d over topics for wd , satisfying to one): (a) We see the probability distribution of k θd,i ≥ 0, i=1 θd,i = 1. topics in documents which forms a ring-like distribu- tion. Dark color indicates low density; (b) The 3- In the LDA model, θd is generated from a k- dimensional Dirichlet distribution that maximizes the dimensional Dirichlet distribution G0 (θ) = Dir(θ|λ) likelihood of samples. with parameter λ ∈ Rk×1 . In our Dirichlet enhanced model, we assume that θd is generated from distribu- tion G(θ), which itself is a random sample generated (e.g. [1, 13, 2, 15, 17, 16]). A potential problem with from a Dirichlet process (DP) [5] this class of models is that inference typically relies on G|G0 , α0 ∼ DP(G0 , α0 ), (3) MCMC approximations, which might be prohibitively slow in dealing with the large collection of documents where nonnegative scalar α0 is the precision parame- in our setting. Instead, we tackle the problem by a ter, and G0 (θ) is the base distribution, which is identi- less expensive variational mean-ﬁeld inference based cal to the Dirichlet distribution. It turns out that the on the DMA model. The resultant updates turn out to distribution G(θ) sampled from a DP can be written be quite interpretable. Finally we observed very good as ∞ empirical performance of the proposed algorithm in both toy data and textual document, especially in the G(·) = πl δθl (·) ∗ (4) l=1 latter case, where meaningful clusters are discovered. ∞ where πl ≥ 0, l πl = 1, δθ (·) are point mass distri- This paper is organized as follows. The next section ∗ butions concentrated at θ, and θl are countably inﬁ- introduces Dirichlet enhanced latent semantic analy- nite variables i.i.d. sampled from G0 [14]. The proba- sis. In Section 3 we present inference and learning bility weights πl are solely depending on α0 via a stick- algorithms based on a variational approximation. Sec- breaking process, which is deﬁned in the next subsec- tion 4 presents experimental results using a toy data tion. The generative model summarized by Fig. 2(a) set and two document data sets. In Section 5 we is conditioned on (k × |V | + k + 1) parameters, i.e. β, present conclusions. λ and α0 . Finally the likelihood of the collection D is given by 2 Dirichlet Enhanced Latent Semantic D Analysis LDP (D|α0 , λ, β) = p(G; α0 , λ) p(θd |G) G d=1 θd Nd k Following the notation in [3], we consider a corpus D containing D documents. Each document d is p(wd,n |zd,n ; β)p(zd,n |θd ) dθd dG. a sequence of Nd words that is denoted by wd = n=1 zd,n =1 {wd,1 , . . . , wd,Nd }, where wd,n is a variable for the n-th (5) word in wd and denotes the index of the corresponding word in a vocabulary V . Note that a same word may In short, G is sampled once for the whole corpus D, θd occur several times in the sequence wd . is sampled once for each document d, and topic zd,n sampled once for the n-th word wd,n in d. 2.1 The Proposed Model 2.2 Stick Breaking and Dirichlet Enhancing We assume that each document is a mixture of k latent The representation of a sample from the DP-prior in topics and words in each document are generated by Eq. (4) is generated in the stick breaking process in ∗ repeatedly sampling topics and words using the distri- which inﬁnite number of pairs (πl , θl ) are generated. (a) (b) (c) Figure 2: Plate models for latent semantic analysis. (a) Latent semantic analysis with DP prior; (b) An equivalent representation, where cd is the indicator variable saying which cluster document d takes on out of the inﬁnite clusters induced by DP; (c) Latent semantic analysis with a ﬁnite approximation of DP (see Sec. 2.3). ∗ θl is sampled independently from G0 and πl is deﬁned explain upcoming data very well, which is particularly as suitable for our setting where dictionary is ﬁxed while l−1 documents can be growing. π1 = B 1 , πl = Bl (1 − Bj ), j=1 By applying the stick breaking representation, our where Bl are i.i.d. sampled from Beta distribution model obtains the equivalent representation in ∗ Beta(1, α0 ). Thus, with a small α0 , the ﬁrst “sticks” Fig. 2(b). An inﬁnite number of θl are generated from πl will be large with little left for the remaining sticks. the base distribution and the new indicator variable cd ∗ Conversely, if α0 is large, the ﬁrst sticks πl and all indicates which θl is assigned to document d. If more ∗ subsequent sticks will be small and the πl will be more than one document is assigned to the same θl , cluster- evenly distributed. In conclusion, the base distribu- ing occurs. π = {π1 , . . . , π∞ } is a vector of probability tion determines the locations of the point masses and weights generated from the stick breaking process. α0 determines the distribution of probability weights. The distribution is nonzero at an inﬁnite number of 2.3 Dirichlet-Multinomial Allocation (DMA) discrete points. If α0 is selected to be small the am- ∗ plitudes of only a small number of discrete points will Since inﬁnite number of pairs (πl , θl ) are generated in be signiﬁcant. Note, that both locations and weights the stick breaking process, it is usually very diﬃcult to are not ﬁxed but take on new values each time a new deal with the unknown distribution G. For inference sample of G is generated. Since E(G) = G0 , initially, there exist Markov chain Monte Carlo (MCMC) meth- the prior corresponds to the prior used in LDA. With ods like Gibbs samplers which directly sample θd using many documents in the training data set, locations θl ∗ o P´lya urn scheme and avoid the diﬃculty of sampling which agree with the data will obtain a large weight. the inﬁnite-dimensional G [4]; in practice, the sam- If a small α0 is chosen, parameters will form clusters pling procedure is very slow and thus impractical for whereas if a large α0 , many representative parameters high dimensional data like text. In Bayesian statistics, will result. Thus Dirichlet enhancement serves two the Dirichlet-multinomial allocation DPN in [6] has of- purposes: it increases the ﬂexibility in representing ten been applied as a ﬁnite approximation to DP (see the posterior distribution of mixing weights and en- [6, 9]), which takes on the form courages a clustered solution leading to insights into N the document corpus. GN = πl δ θ l , ∗ l=1 The DP prior oﬀers two advantages against usual doc- ument clustering methods. First, there is no need to where π = {π1 , . . . , πN } is an N -vector of proba- specify the number of clusters. The ﬁnally resulting bility weights sampled once from a Dirichlet prior ∗ clustering structure is constrained by the DP prior, Dir(α0 /N, . . . , α0 /N ), and θl , l = 1, . . . , N , are but also adapted to the empirical observations. Sec- i.i.d. sampled from the base distribution G0 . It has ond, the number of clusters is not ﬁxed. Although been shown that the limiting case of DPN is DP the parameter α0 is a control parameter to tune the [6, 9, 12], and more importantly DPN demonstrates tendency for forming clusters, the DP prior allows the similar stick breaking properties and leads to a simi- creation of new clusters if the current model cannot lar clustering eﬀect [6]. If N is suﬃciently large with respect to our sample size D, DPN gives a good ap- derived, but it turns out to be very slow and inappli- proximation to DP. cable to high dimensional data like text, since for each word we have to sample a latent variable z. Therefore Under the DPN model, the plate representation of our in this section we suggest eﬃcient variational infer- model is illustrated in Fig. 2(c). The likelihood of the ence. whole collection D is D N LDPN (D|α0 , λ, β) = p(wd |θ ∗ , cd ; β) π θ ∗ d=1 cd =1 3.1 Variational Inference p(cd |π) dP (θ ∗ ; G0 ) dP (π; α0 ) (6) The idea of variational mean-ﬁeld inference is to propose a joint distribution Q(π, θ ∗ , c, z) condi- where cd is the indicator variable saying which unique tioned on some free parameters, and then en- ∗ force Q to approximate the a posteriori distribu- value θl document d takes on. The likelihood of doc- ument d is therefore written as tions of interests by minimizing the KL-divergence DKL (Q p(π, θ ∗ , c, z|D, α0 , λ, β)) with respect to those Nd k free parameters. We propose a variational distribution p(wd |θ ∗ , cd ; β) = ∗ p(wd,n |zd,n ; β)p(zd,n |θcd ). Q over latent variables as the following n=1 zd,n =1 2.4 Connections to PLSA and LDA Q(π, θ ∗ , c, z|η, γ, ψ, φ) = Q(π|η)· From the application point of view, PLSA and LDA N D D Nd ∗ both aim to discover the latent dimensions of data Q(θl |γl ) Q(cd |ψd ) Q(zd,n |φd,n ) with the emphasis on indexing. The proposed Dirich- l=1 d=1 d=1 n=1 let enhanced semantic analysis retains the strengths (7) of PLSA and LDA, and further explores the cluster- ing structure of data. The model is a generalization of LDA. If we let α0 → ∞, the model becomes identi- where η, γ, ψ, φ are variational parameters, each tai- cal to LDA, since the sampled G becomes identical to loring the variational a posteriori distribution to each the ﬁnite Dirichlet base distribution G0 . This extreme latent variable. In particular, η speciﬁes an N - case makes documents mutually independent given G0 , dimensional Dirichlet distribution for π, γl speciﬁes ∗ since θd are i.i.d. sampled from G0 . If G0 itself is not a k-dimensional Dirichlet distribution for distinct θl , suﬃciently expressive, the model is not able to cap- ψd speciﬁes an N -dimensional multinomial for the in- ture the dependency between documents. The Dirich- dicator cd of document d, and φd,n speciﬁes a k- let enhancement elegantly solves this problem. With dimensional multinomial over latent topics for word a moderate α0 , the model allows G to deviate away wd,n . It turns out that the minimization of the KL- from G0 , giving modelling ﬂexibilities to explore the divergence is equivalent to the maximization of a richer structure of data. The exchangeability may not lower bound of the ln p(D|α0 , λ, β) derived by apply- exist within the whole collection, but between groups ing Jensen’s inequality [10]. Please see the Appendix ∗ of documents with respective atoms θl sampled from for details of the derivation. The lower bound is then G0 . On the other hand, the increased ﬂexibility does given as not lead to overﬁtting, because inference and learn- ing are done in a Bayesian setting, averaging over the D Nd number of mixture components and the states of the latent variables. LQ (D) = EQ [ln p(wd,n |zd,n , β)p(zd,n |θ ∗ , cd )] d=1 n=1 D 3 Inference and Learning + EQ [ln p(π|α0 )] + EQ [ln p(cd |π)] (8) d=1 In this section we consider model inference and N learning based on the DPN model. As seen + EQ [ln p(θl |G0 )] − EQ [ln Q(π, θ ∗ , c, z)]. ∗ from Fig. 2(c), the inference needs to calculate the l=1 a posteriori joint distribution of latent variables p(π, θ ∗ , c, z|D, α0 , λ, β), which requires to compute Eq. (6). This integral is however analytically infeasi- The optimum is found setting the partial derivatives ble. A straightforward Gibbs sampling method can be with respect to each variational parameter to be zero, which gives rise to the following updates parts of L in Eq. (8) involving α0 and λ: α0 L[α0 ] = ln Γ(α0 ) − N ln Γ N k N φd,n,i ∝ βi,wd,n exp ψd,l Ψ(γl,i ) − Ψ( γl,j ) α0 N N l=1 j=1 +( − 1) Ψ(ηl ) − Ψ( ηj ) , N (9) l=1 j=1 k k Nd N k k ψd,l ∝ exp Ψ(γl,i ) − Ψ( γl,j ) φd,n,i L[λ] = ln Γ( λi ) − ln Γ(λi ) i=1 j=1 n=1 l=1 i=1 i=1 k k N + Ψ(ηl ) − Ψ( ηj ) (10) + (λi − 1) Ψ(γl,i ) − Ψ( γl,j ) . i=1 j=1 j=1 D Nd Estimates for α0 and λ are found by maximization of γl,i = ψd,l φd,n,i + λi (11) these objective functions using standard methods like d=1 n=1 Newton-Raphson method as suggested in [3]. D α0 ηl = ψd,l + (12) d=1 N 4 Empirical Study 4.1 Toy Data where Ψ(·) is the digamma function, the ﬁrst deriva- tive of the log Gamma function. Some details of the We ﬁrst apply the model on a toy problem with derivation of these formula can be found in Appendix. k = 5 latent topics and a dictionary containing 200 We ﬁnd that the updates are quite interpretable. For words. The assumed probabilities of generating words example, in Eq. (9) φd,n,i is the a posteriori probability from topics, i.e. the parameters β, are illustrated in of latent topic i given one word wd,n . It is determined Fig. 3(d), in which each colored line corresponds to a both by the corresponding entry in the β matrix that topic and assigns non-zero probabilities to a subset of can be seen as a likelihood term, and by the possi- words. For each run we generate data with the follow- bility that document d selects topic i, i.e., the prior ing steps: (1) one cluster number M is chosen between term. Here the prior is itself a weighted average of 5 and 12; (2) generate M document clusters, each of ∗ which is deﬁned by a combination of topics; (3) gener- diﬀerent θl s to which d is assigned. In Eq. (12) ηl is the a posteriori weight of πl , and turns out to be ate each document d, d = 1, . . . , 100, by ﬁrst randomly ∗ selecting a cluster and then generating 40 words ac- the tradeoﬀ between empirical responses at θl and the prior speciﬁed by α0 . Finally since the parameters are cording to the corresponding topic combinations. For coupled, the variational inference is done by iteratively DPN we select N = 100 and we aim to examine the performing Eq. (9) to Eq. (12) until convergence. performance for discovering the latent topics and the document clustering structure. In Fig. 3(a)-(c) we illustrate the process of clustering 3.2 Parameter Estimation documents over EM iterations with a run containing 6 document clusters. In Fig. 3(a), we show the initial Following the empirical Bayesian framework, we can random assignment ψd,l of each document d to a clus- estimate the hyper parameters α0 , λ, and β by itera- ter l. After one EM step documents begin to accumu- tively maximizing the lower bound LQ both with re- late to a reduced number of clusters (Fig. 3(b)), and spect to the variational parameters (as described by converge to exactly 6 clusters after 5 steps (Fig. 3(c)). Eq. (9)-Eq. (12)) and the model parameters, holding The learned word distribution of topics β is shown in the remaining parameters ﬁxed. This iterative proce- Fig. 3(e) and is very similar to the true distribution. dure is also referred to as variational EM [10]. It is easy to derive the update for β: By varying M , the true number of document clusters, we examine if our model can ﬁnd the correct M . To de- termine the number of clusters, we run the variational D Nd inference and obtain for each document a weight vector βi,j ∝ φd,n,i δj (wd,n ) (13) ψd,l of clusters. Then each document takes the cluster d=1 n=1 with largest weight as its assignment, and we calculate the cluster number as the number of non-empty clus- where δj (wd,n ) = 1 if wd,n = j, and 0 otherwise. For ters. For each setting of M from 5 to 12, we randomize the remaining parameters, let’s ﬁrst write down the the data for 20 trials and obtain the curve in Fig. 3(f) (a) (b) (c) (d ) (e) (f ) Figure 3: Experimental results for the toy problem. (a)-(c) show the document-cluster assignments ψd,l over the variational inference for a run with 6 document clusters: (a) Initial random assignments; (b) Assignments after one iteration; (c) Assignments after ﬁve iterations (ﬁnal). The multinomial parameter matrix β of true values and estimated values are given in (d) and (e), respectively. Each line gives the probabilities of generating the 200 words, with wave mountains for high probabilities. (f) shows the learned number of clusters with respect to the true number with mean and error bar. which shows the average performance and the vari- comparison results with diﬀerent number k of latent ance. In 37% of the runs we get perfect results, and topics. Our model outperforms LDA and PLSI in all in another 43% runs the learned values only deviate the runs, which indicates that the ﬂexibility introduced from the truth by one. However, we also ﬁnd that the by DP enhancement does not produce overﬁtting and model tends to get slightly fewer than M clusters when results in a better generalization performance. M is large. The reason might be that, only 100 doc- uments are not suﬃcient for learning a large number M of clusters. 4.3 Clustering 4.2 Document Modelling We compare the proposed model with PLSI and LDA In our last experiment we demonstrate that our ap- on two text data sets. The ﬁrst one is a subset of proach is suitable to ﬁnd relevant document clusters. the Reuters-21578 data set which contains 3000 docu- We select four categories, autos, motorcycles, baseball ments and 20334 words. The second one is taken from and hockey from the 20-newsgroups data set with 446 the 20-newsgroup data set and has 2000 documents documents in each topic. Fig. 4(c) illustrates one clus- with 8014 words. The comparison metric is perplexity, tering result, in which we set topic number k = 5 and conventionally used in language modelling. For a test found 6 document clusters. In the ﬁgure the docu- document set, it is formally deﬁned as ments are indexed according to their true category labels, so we can clearly see that the result is quite meaningful. Documents from one category show sim- Perplexity(Dtest ) = exp − ln p(Dtest )/ |wd | . ilar membership to the learned clusters, and diﬀerent d categories can be distinguished very easily. The ﬁrst two categories are not clearly separated because they We follow the formula in [3] to calculate the perplexity are both talking about vehicles and share many terms, for PLSI. In our algorithm N is set to be the number while the rest of the categories, baseball and hockey, of training documents. Fig. 4(a) and (b) show the are ideally detected. (a) (b) (c) Figure 4: (a) and (b): Perplexity results on Reuters-21578 and 20-newsgroups for DELSA, PLSI and LDA; (c): Clustering result on 20-newsgroups dataset. 5 Conclusions and Future Work [2] D. M. Blei and M. I. Jordan. Variational meth- ods for the Dirichlet process. In Proceedings of the This paper proposes a Dirichlet enhanced latent se- 21st International Conference on Machine Learn- mantic analysis model for analyzing co-occurrence ing, 2004. data like text, which retains the strength of previous [3] D. M. Blei, A. Ng, and M. I. Jordan. Latent approaches to ﬁnd latent topics, and further introduces Dirichlet Allocation. Journal of Machine Learn- additional modelling ﬂexibilities to uncover the clus- ing Research, 3:993–1022, 2003. tering structure of data. For inference and learning, we [4] M. D. Escobar and M. West. Bayesian density adopt a variational mean-ﬁeld approximation based on estimation and inference using mixtures. Journal a ﬁnite alternative of DP. Experiments are performed of the American Statistical Association, 90(430), on a toy data set and two text data sets. The ex- June 1995. periments show that our model can discover both the latent semantics and meaningful clustering structures. [5] T. S. Ferguson. A Bayesian analysis of some non- parametric problems. Annals of Statistics, 1:209– In addition to our approach, alternative methods for 230, 1973. approximate inference in DP have been proposed us- ing expectation propagation (EP) [11] or variational [6] P. J. Green and S. Richardson. Modelling hetero- methods [16, 2]. Our approach is most similar to the geneity with and without the Dirichlet process. work of Blei and Jordan [2] who applied mean-ﬁeld ap- unpublished paper, 2000. proximation for the inference in DP based on a trun- [7] T. Hofmann. Probabilistic Latent Semantic In- cated DP (TDP). Their approach was formulated in dexing. In Proceedings of the 22nd Annual ACM context of general exponential-family mixture models SIGIR Conference, pages 50–57, Berkeley, Cali- [2]. Conceptually, DPN appears to be simpler than fornia, August 1999. TDP in the sense that the a posteriori of G is a sym- [8] H. Ishwaran and L. F. James. Gibbs sampling metric Dirichlet while TDP ends up with a generalized methods for stick-breaking priors. Journal of the Dirichlet (see [8]). In another sense, TDP seems to be American Statistical Association, 96(453):161– a tighter approximation to DP. Future work will in- 173, 2001. clude a comparison of the various DP approximations. [9] H. Ishwaran and M. Zarepour. Exact and ap- proximate sum-representations for the Dirichlet Acknowledgements process. Can. J. Statist, 30:269–283, 2002. [10] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and The authors thank the anonymous reviewers for their L. K. Saul. An introduction to variational meth- valuable comments. Shipeng Yu gratefully acknowl- ods for graphical models. Machine Learning, edges the support through a Siemens scholarship. 37(2):183–233, 1999. [11] T. Minka and Z. Ghahramani. Expectation prop- References agation for inﬁnite mixtures. In NIPS’03 Work- [1] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. shop on Nonparametric Bayesian Methods and The inﬁnite hidden Markov model. In Advances in Inﬁnite Models, 2003. Neural Information Processing Systems (NIPS) [12] R. M. Neal. Markov chain sampling methods 14, 2002. for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249– The other terms can be derived as follows: 265, 2000. D Nd [13] C. E. Rasmussen and Z. Ghahramani. Inﬁnite EQ [ln p(zd,n |θ ∗ , cd )] = mixtures of gaussian process experts. In Ad- d=1 n=1 vances in Neural Information Processing Systems D Nd k N k 14, 2002. ψd,l φd,n,i Ψ(γl,i ) − Ψ( γl,j ) , d=1 n=1 i=1 l=1 j=1 [14] J. Sethuraman. A constructive deﬁnition of α0 Dirichlet priors. Statistica Sinica, 4:639–650, EQ [ln p(π|α0 )] = ln Γ(α0 ) − N ln Γ( ) 1994. N N N [15] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. α0 + −1 Ψ(ηl ) − Ψ( ηj ) , Blei. Hierarchical Dirichlet processes. Technical N j=1 l=1 Report 653, Department of Statistics, University D D N N of California, Berkeley, 2004. EQ [ln p(cd |π)] = ψd,l Ψ(ηl ) − Ψ( ηj ) , [16] V. Tresp and K. Yu. An introduction to non- d=1 d=1 l=1 j=1 parametric hierarchical bayesian modelling with N N k k ∗ a focus on multi-agent learning. In Proceedings of EQ [ln p(θl |G0 )] = ln Γ( λi ) − ln Γ(λi ) the Hamilton Summer School on Switching and l=1 l=1 i=1 i=1 Learning in Feedback Systems. Lecture Notes in k k Computing Science, 2004. + (λi − 1) Ψ(γl,i ) − Ψ( γl,j ) , i=1 j=1 [17] K. Yu, V. Tresp, and S. Yu. A nonparametric N N hierarchical Bayesian framework for information EQ [ln Q(π, θ ∗ , c, z)] = ln Γ( ηl ) − ln Γ(ηl ) ﬁltering. In Proceedings of 27th Annual Interna- l=1 l=1 tional ACM SIGIR Conference, 2004. N N + (ηl − 1) Ψ(ηl ) − Ψ( ηj ) l=1 j=1 Appendix N k k + ln Γ( γl,i ) − ln Γ(γl,i ) To simplify the notation, we denote Ξ for all the la- l=1 i=1 i=1 tent variables {π, θ ∗ , c, z}. With the variational form k k Eq. (7), we apply Jensen’s inequality to the likelihood + (γl,i − 1) Ψ(γl,i ) − Ψ( γl,j ) Eq. (6) and obtain i=1 j=1 D N D Nd k ln p(D|α0 , λ, β) + ψd,l ln ψd,l + φd,n,i ln φd,n,i . = ln p(D, Ξ|α0 , λ, β)dθ ∗ dπ d=1 l=1 d=1 n=1 i=1 π θ∗ c z Q(Ξ)p(D, Ξ|α0 , λ, β) ∗ = ln dθ dπ Diﬀerentiating the lower bound with respect to dif- π θ∗ Q(Ξ) c z ferent latent variables gives the variational E-step in Eq. (9) to Eq. (12). M-step can also be obtained by ≥ Q(Ξ) ln p(D, Ξ|α0 , λ, β)dθ ∗ dπ π θ∗ c z considering the lower bound with respect to β, λ and α0 . − Q(Ξ) ln Q(Ξ)dθ ∗ dπ π θ∗ c z = EQ [ln p(D, Ξ|α0 , λ, β)] − EQ [ln Q(Ξ)], which results in Eq. (8). To write out each term in Eq. (8) explicitly, we have, for the ﬁrst term, D Nd D Nd k EQ [ln p(wd,n |zd,n , β)] = φd,n,i ln βi,ν , d=1 n=1 d=1 n=1 i=1 where ν is the index of word wd,n .

DOCUMENT INFO

Shared By:

Categories:

Tags:
latent semantic analysis, volker tresp, kai yu, hans-peter kriegel, machine learning, gibbs sampling, latent dirichlet allocation, dirichlet process, international conference, probabilistic latent semantic analysis, zhao xu, relational models, variational parameters, bayesian networks, on machine

Stats:

views: | 16 |

posted: | 3/31/2010 |

language: | English |

pages: | 8 |

OTHER DOCS BY akg15343

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.