Dirichlet Enhanced Latent Semantic Analysis
Document Sample


Dirichlet Enhanced Latent Semantic Analysis
Kai Yu Shipeng Yu Volker Tresp
Siemens Corporate Technology Institute for Computer Science Siemens Corporate Technology
D-81730 Munich, Germany University of Munich D-81730 Munich, Germany
Kai.Yu@siemens.com D-80538 Munich, Germany Volker.Tresp@siemens.com
spyu@dbs.informatik.uni-muenchen.de
Abstract of latent topics. Latent Dirichlet allocation (LDA) [3]
generalizes PLSI by treating the topic mixture param-
This paper describes nonparametric Bayesian eters (i.e. a multinomial over topics) as variables drawn
treatments for analyzing records containing from a Dirichlet distribution. Its Bayesian treatment
occurrences of items. The introduced model avoids overfitting and the model is generalizable to
retains the strength of previous approaches new data (the latter is problematic for PLSI). How-
that explore the latent factors of each record ever, the parametric Dirichlet distribution can be a
(e.g. topics of documents), and further uncov- limitation in applications which exhibit a richer struc-
ers the clustering structure of records, which ture. As an illustration, consider Fig. 1 (a) that shows
reflects the statistical dependencies of the la- the empirical distribution of three topics. We see that
tent factors. The nonparametric model in- the probability that all three topics are present in a
duced by a Dirichlet process (DP) flexibly document (corresponding to the center of the plot) is
adapts model complexity to reveal the clus- near zero. In contrast, a Dirichlet distribution fitted to
tering structure of the data. To avoid the the data (Fig. 1 (b)) would predict the highest proba-
problems of dealing with infinite dimensions, bility density for exactly that case. The reason is the
we further replace the DP prior by a simpler limiting expressiveness of a simple Dirichlet distribu-
alternative, namely Dirichlet-multinomial al- tion.
location (DMA), which maintains the main This paper employs a more general nonparametric
modelling properties of the DP. Instead of re- Bayesian approach to explore not only latent topics
lying on Markov chain Monte Carlo (MCMC) and their probabilities, but also complex dependen-
for inference, this paper applies efficient vari- cies between latent topics which might, for example,
ational inference based on DMA. The pro- be expressed as a complex clustering structure. The
posed approach yields encouraging empirical key innovation is to replace the parametric Dirichlet
results on both a toy problem and text data. prior distribution in LDA by a flexible nonparamet-
The results show that the proposed algorithm ric distribution G(·) that is a sample generated from
uncovers not only the latent factors, but also a Dirichlet process (DP) or its finite approximation,
the clustering structure. Dirichlet-multinomial allocation (DMA). The Dirich-
let distribution of LDA becomes the base distribution
for the Dirichlet process. In this Dirichlet enhanced
1 Introduction model, the posterior distribution of the topic mixture
for a new document converges to a flexible mixture
We consider the problem of modelling a large corpus of model in which both mixture weights and mixture pa-
high-dimensional discrete records. Our assumption is rameters can be learned from the data. Thus the a
that a record can be modelled by latent factors which posteriori distribution is able to represent the distribu-
account for the co-occurrence of items in a record. To tion of topics more truthfully. After convergence of the
ground the discussion, in the following we will iden- learning procedure, typically only a few components
tify records with documents, latent factors with (la- with non-negligible weights remain; thus the model is
tent) topics and items with words. Probabilistic la- able to naturally output clusters of documents.
tent semantic indexing (PLSI) [7] was one of the first
Nonparametric Bayesian modelling has attracted con-
approaches that provided a probabilistic approach to-
siderable attentions from the learning community
wards modelling text documents as being composed
butions
wd,n |zd,n ; β ∼ Mult(zd,n , β) (1)
zd,n |θd ∼ Mult(θd ). (2)
wd,n is generated given its latent topic zd,n , which
takes value {1, . . . , k}. β is a k × |V | multinomial pa-
(a) (b) rameter matrix, j βi,j = 1, where βz,wd,n specifies
the probability of generating word wd,n given topic
Figure 1: Consider a 2-dimensional simplex represent-
z. θd denotes the parameters of a multinomial distri-
ing 3 topics (recall that the probabilities have to sum
bution of document d over topics for wd , satisfying
to one): (a) We see the probability distribution of k
θd,i ≥ 0, i=1 θd,i = 1.
topics in documents which forms a ring-like distribu-
tion. Dark color indicates low density; (b) The 3- In the LDA model, θd is generated from a k-
dimensional Dirichlet distribution that maximizes the dimensional Dirichlet distribution G0 (θ) = Dir(θ|λ)
likelihood of samples. with parameter λ ∈ Rk×1 . In our Dirichlet enhanced
model, we assume that θd is generated from distribu-
tion G(θ), which itself is a random sample generated
(e.g. [1, 13, 2, 15, 17, 16]). A potential problem with from a Dirichlet process (DP) [5]
this class of models is that inference typically relies on
G|G0 , α0 ∼ DP(G0 , α0 ), (3)
MCMC approximations, which might be prohibitively
slow in dealing with the large collection of documents where nonnegative scalar α0 is the precision parame-
in our setting. Instead, we tackle the problem by a ter, and G0 (θ) is the base distribution, which is identi-
less expensive variational mean-field inference based cal to the Dirichlet distribution. It turns out that the
on the DMA model. The resultant updates turn out to distribution G(θ) sampled from a DP can be written
be quite interpretable. Finally we observed very good as
∞
empirical performance of the proposed algorithm in
both toy data and textual document, especially in the G(·) = πl δθl (·)
∗ (4)
l=1
latter case, where meaningful clusters are discovered.
∞
where πl ≥ 0, l πl = 1, δθ (·) are point mass distri-
This paper is organized as follows. The next section ∗
butions concentrated at θ, and θl are countably infi-
introduces Dirichlet enhanced latent semantic analy-
nite variables i.i.d. sampled from G0 [14]. The proba-
sis. In Section 3 we present inference and learning
bility weights πl are solely depending on α0 via a stick-
algorithms based on a variational approximation. Sec-
breaking process, which is defined in the next subsec-
tion 4 presents experimental results using a toy data
tion. The generative model summarized by Fig. 2(a)
set and two document data sets. In Section 5 we
is conditioned on (k × |V | + k + 1) parameters, i.e. β,
present conclusions.
λ and α0 .
Finally the likelihood of the collection D is given by
2 Dirichlet Enhanced Latent Semantic D
Analysis LDP (D|α0 , λ, β) = p(G; α0 , λ) p(θd |G)
G d=1 θd
Nd k
Following the notation in [3], we consider a corpus
D containing D documents. Each document d is p(wd,n |zd,n ; β)p(zd,n |θd ) dθd dG.
a sequence of Nd words that is denoted by wd = n=1 zd,n =1
{wd,1 , . . . , wd,Nd }, where wd,n is a variable for the n-th (5)
word in wd and denotes the index of the corresponding
word in a vocabulary V . Note that a same word may In short, G is sampled once for the whole corpus D, θd
occur several times in the sequence wd . is sampled once for each document d, and topic zd,n
sampled once for the n-th word wd,n in d.
2.1 The Proposed Model 2.2 Stick Breaking and Dirichlet Enhancing
We assume that each document is a mixture of k latent The representation of a sample from the DP-prior in
topics and words in each document are generated by Eq. (4) is generated in the stick breaking process in
∗
repeatedly sampling topics and words using the distri- which infinite number of pairs (πl , θl ) are generated.
(a) (b) (c)
Figure 2: Plate models for latent semantic analysis. (a) Latent semantic analysis with DP prior; (b) An equivalent
representation, where cd is the indicator variable saying which cluster document d takes on out of the infinite
clusters induced by DP; (c) Latent semantic analysis with a finite approximation of DP (see Sec. 2.3).
∗
θl is sampled independently from G0 and πl is defined explain upcoming data very well, which is particularly
as suitable for our setting where dictionary is fixed while
l−1
documents can be growing.
π1 = B 1 , πl = Bl (1 − Bj ),
j=1 By applying the stick breaking representation, our
where Bl are i.i.d. sampled from Beta distribution model obtains the equivalent representation in
∗
Beta(1, α0 ). Thus, with a small α0 , the first “sticks” Fig. 2(b). An infinite number of θl are generated from
πl will be large with little left for the remaining sticks. the base distribution and the new indicator variable cd
∗
Conversely, if α0 is large, the first sticks πl and all indicates which θl is assigned to document d. If more
∗
subsequent sticks will be small and the πl will be more than one document is assigned to the same θl , cluster-
evenly distributed. In conclusion, the base distribu- ing occurs. π = {π1 , . . . , π∞ } is a vector of probability
tion determines the locations of the point masses and weights generated from the stick breaking process.
α0 determines the distribution of probability weights.
The distribution is nonzero at an infinite number of 2.3 Dirichlet-Multinomial Allocation (DMA)
discrete points. If α0 is selected to be small the am- ∗
plitudes of only a small number of discrete points will Since infinite number of pairs (πl , θl ) are generated in
be significant. Note, that both locations and weights the stick breaking process, it is usually very difficult to
are not fixed but take on new values each time a new deal with the unknown distribution G. For inference
sample of G is generated. Since E(G) = G0 , initially, there exist Markov chain Monte Carlo (MCMC) meth-
the prior corresponds to the prior used in LDA. With ods like Gibbs samplers which directly sample θd using
many documents in the training data set, locations θl ∗ o
P´lya urn scheme and avoid the difficulty of sampling
which agree with the data will obtain a large weight. the infinite-dimensional G [4]; in practice, the sam-
If a small α0 is chosen, parameters will form clusters pling procedure is very slow and thus impractical for
whereas if a large α0 , many representative parameters high dimensional data like text. In Bayesian statistics,
will result. Thus Dirichlet enhancement serves two the Dirichlet-multinomial allocation DPN in [6] has of-
purposes: it increases the flexibility in representing ten been applied as a finite approximation to DP (see
the posterior distribution of mixing weights and en- [6, 9]), which takes on the form
courages a clustered solution leading to insights into N
the document corpus. GN = πl δ θ l ,
∗
l=1
The DP prior offers two advantages against usual doc-
ument clustering methods. First, there is no need to where π = {π1 , . . . , πN } is an N -vector of proba-
specify the number of clusters. The finally resulting bility weights sampled once from a Dirichlet prior
∗
clustering structure is constrained by the DP prior, Dir(α0 /N, . . . , α0 /N ), and θl , l = 1, . . . , N , are
but also adapted to the empirical observations. Sec- i.i.d. sampled from the base distribution G0 . It has
ond, the number of clusters is not fixed. Although been shown that the limiting case of DPN is DP
the parameter α0 is a control parameter to tune the [6, 9, 12], and more importantly DPN demonstrates
tendency for forming clusters, the DP prior allows the similar stick breaking properties and leads to a simi-
creation of new clusters if the current model cannot lar clustering effect [6]. If N is sufficiently large with
respect to our sample size D, DPN gives a good ap- derived, but it turns out to be very slow and inappli-
proximation to DP. cable to high dimensional data like text, since for each
word we have to sample a latent variable z. Therefore
Under the DPN model, the plate representation of our
in this section we suggest efficient variational infer-
model is illustrated in Fig. 2(c). The likelihood of the
ence.
whole collection D is
D N
LDPN (D|α0 , λ, β) = p(wd |θ ∗ , cd ; β)
π θ ∗ d=1 cd =1
3.1 Variational Inference
p(cd |π) dP (θ ∗ ; G0 ) dP (π; α0 ) (6) The idea of variational mean-field inference is to
propose a joint distribution Q(π, θ ∗ , c, z) condi-
where cd is the indicator variable saying which unique tioned on some free parameters, and then en-
∗ force Q to approximate the a posteriori distribu-
value θl document d takes on. The likelihood of doc-
ument d is therefore written as tions of interests by minimizing the KL-divergence
DKL (Q p(π, θ ∗ , c, z|D, α0 , λ, β)) with respect to those
Nd k
free parameters. We propose a variational distribution
p(wd |θ ∗ , cd ; β) = ∗
p(wd,n |zd,n ; β)p(zd,n |θcd ).
Q over latent variables as the following
n=1 zd,n =1
2.4 Connections to PLSA and LDA Q(π, θ ∗ , c, z|η, γ, ψ, φ) = Q(π|η)·
From the application point of view, PLSA and LDA N D D Nd
∗
both aim to discover the latent dimensions of data Q(θl |γl ) Q(cd |ψd ) Q(zd,n |φd,n )
with the emphasis on indexing. The proposed Dirich- l=1 d=1 d=1 n=1
let enhanced semantic analysis retains the strengths (7)
of PLSA and LDA, and further explores the cluster-
ing structure of data. The model is a generalization
of LDA. If we let α0 → ∞, the model becomes identi- where η, γ, ψ, φ are variational parameters, each tai-
cal to LDA, since the sampled G becomes identical to loring the variational a posteriori distribution to each
the finite Dirichlet base distribution G0 . This extreme latent variable. In particular, η specifies an N -
case makes documents mutually independent given G0 , dimensional Dirichlet distribution for π, γl specifies
∗
since θd are i.i.d. sampled from G0 . If G0 itself is not a k-dimensional Dirichlet distribution for distinct θl ,
sufficiently expressive, the model is not able to cap- ψd specifies an N -dimensional multinomial for the in-
ture the dependency between documents. The Dirich- dicator cd of document d, and φd,n specifies a k-
let enhancement elegantly solves this problem. With dimensional multinomial over latent topics for word
a moderate α0 , the model allows G to deviate away wd,n . It turns out that the minimization of the KL-
from G0 , giving modelling flexibilities to explore the divergence is equivalent to the maximization of a
richer structure of data. The exchangeability may not lower bound of the ln p(D|α0 , λ, β) derived by apply-
exist within the whole collection, but between groups ing Jensen’s inequality [10]. Please see the Appendix
∗
of documents with respective atoms θl sampled from for details of the derivation. The lower bound is then
G0 . On the other hand, the increased flexibility does given as
not lead to overfitting, because inference and learn-
ing are done in a Bayesian setting, averaging over the
D Nd
number of mixture components and the states of the
latent variables. LQ (D) = EQ [ln p(wd,n |zd,n , β)p(zd,n |θ ∗ , cd )]
d=1 n=1
D
3 Inference and Learning + EQ [ln p(π|α0 )] + EQ [ln p(cd |π)] (8)
d=1
In this section we consider model inference and N
learning based on the DPN model. As seen + EQ [ln p(θl |G0 )] − EQ [ln Q(π, θ ∗ , c, z)].
∗
from Fig. 2(c), the inference needs to calculate the l=1
a posteriori joint distribution of latent variables
p(π, θ ∗ , c, z|D, α0 , λ, β), which requires to compute
Eq. (6). This integral is however analytically infeasi- The optimum is found setting the partial derivatives
ble. A straightforward Gibbs sampling method can be with respect to each variational parameter to be zero,
which gives rise to the following updates parts of L in Eq. (8) involving α0 and λ:
α0
L[α0 ] = ln Γ(α0 ) − N ln Γ
N k N
φd,n,i ∝ βi,wd,n exp ψd,l Ψ(γl,i ) − Ψ( γl,j ) α0
N N
l=1 j=1 +( − 1) Ψ(ηl ) − Ψ( ηj ) ,
N
(9) l=1 j=1
k k Nd N k k
ψd,l ∝ exp Ψ(γl,i ) − Ψ( γl,j ) φd,n,i L[λ] = ln Γ( λi ) − ln Γ(λi )
i=1 j=1 n=1 l=1 i=1 i=1
k k
N
+ Ψ(ηl ) − Ψ( ηj ) (10) + (λi − 1) Ψ(γl,i ) − Ψ( γl,j ) .
i=1 j=1
j=1
D Nd Estimates for α0 and λ are found by maximization of
γl,i = ψd,l φd,n,i + λi (11) these objective functions using standard methods like
d=1 n=1 Newton-Raphson method as suggested in [3].
D
α0
ηl = ψd,l + (12)
d=1
N 4 Empirical Study
4.1 Toy Data
where Ψ(·) is the digamma function, the first deriva-
tive of the log Gamma function. Some details of the We first apply the model on a toy problem with
derivation of these formula can be found in Appendix. k = 5 latent topics and a dictionary containing 200
We find that the updates are quite interpretable. For words. The assumed probabilities of generating words
example, in Eq. (9) φd,n,i is the a posteriori probability from topics, i.e. the parameters β, are illustrated in
of latent topic i given one word wd,n . It is determined Fig. 3(d), in which each colored line corresponds to a
both by the corresponding entry in the β matrix that topic and assigns non-zero probabilities to a subset of
can be seen as a likelihood term, and by the possi- words. For each run we generate data with the follow-
bility that document d selects topic i, i.e., the prior ing steps: (1) one cluster number M is chosen between
term. Here the prior is itself a weighted average of 5 and 12; (2) generate M document clusters, each of
∗ which is defined by a combination of topics; (3) gener-
different θl s to which d is assigned. In Eq. (12) ηl
is the a posteriori weight of πl , and turns out to be ate each document d, d = 1, . . . , 100, by first randomly
∗ selecting a cluster and then generating 40 words ac-
the tradeoff between empirical responses at θl and the
prior specified by α0 . Finally since the parameters are cording to the corresponding topic combinations. For
coupled, the variational inference is done by iteratively DPN we select N = 100 and we aim to examine the
performing Eq. (9) to Eq. (12) until convergence. performance for discovering the latent topics and the
document clustering structure.
In Fig. 3(a)-(c) we illustrate the process of clustering
3.2 Parameter Estimation
documents over EM iterations with a run containing
6 document clusters. In Fig. 3(a), we show the initial
Following the empirical Bayesian framework, we can random assignment ψd,l of each document d to a clus-
estimate the hyper parameters α0 , λ, and β by itera- ter l. After one EM step documents begin to accumu-
tively maximizing the lower bound LQ both with re- late to a reduced number of clusters (Fig. 3(b)), and
spect to the variational parameters (as described by converge to exactly 6 clusters after 5 steps (Fig. 3(c)).
Eq. (9)-Eq. (12)) and the model parameters, holding The learned word distribution of topics β is shown in
the remaining parameters fixed. This iterative proce- Fig. 3(e) and is very similar to the true distribution.
dure is also referred to as variational EM [10]. It is
easy to derive the update for β: By varying M , the true number of document clusters,
we examine if our model can find the correct M . To de-
termine the number of clusters, we run the variational
D Nd
inference and obtain for each document a weight vector
βi,j ∝ φd,n,i δj (wd,n ) (13) ψd,l of clusters. Then each document takes the cluster
d=1 n=1
with largest weight as its assignment, and we calculate
the cluster number as the number of non-empty clus-
where δj (wd,n ) = 1 if wd,n = j, and 0 otherwise. For ters. For each setting of M from 5 to 12, we randomize
the remaining parameters, let’s first write down the the data for 20 trials and obtain the curve in Fig. 3(f)
(a) (b) (c)
(d ) (e) (f )
Figure 3: Experimental results for the toy problem. (a)-(c) show the document-cluster assignments ψd,l over the
variational inference for a run with 6 document clusters: (a) Initial random assignments; (b) Assignments after
one iteration; (c) Assignments after five iterations (final). The multinomial parameter matrix β of true values
and estimated values are given in (d) and (e), respectively. Each line gives the probabilities of generating the
200 words, with wave mountains for high probabilities. (f) shows the learned number of clusters with respect to
the true number with mean and error bar.
which shows the average performance and the vari- comparison results with different number k of latent
ance. In 37% of the runs we get perfect results, and topics. Our model outperforms LDA and PLSI in all
in another 43% runs the learned values only deviate the runs, which indicates that the flexibility introduced
from the truth by one. However, we also find that the by DP enhancement does not produce overfitting and
model tends to get slightly fewer than M clusters when results in a better generalization performance.
M is large. The reason might be that, only 100 doc-
uments are not sufficient for learning a large number
M of clusters.
4.3 Clustering
4.2 Document Modelling
We compare the proposed model with PLSI and LDA In our last experiment we demonstrate that our ap-
on two text data sets. The first one is a subset of proach is suitable to find relevant document clusters.
the Reuters-21578 data set which contains 3000 docu- We select four categories, autos, motorcycles, baseball
ments and 20334 words. The second one is taken from and hockey from the 20-newsgroups data set with 446
the 20-newsgroup data set and has 2000 documents documents in each topic. Fig. 4(c) illustrates one clus-
with 8014 words. The comparison metric is perplexity, tering result, in which we set topic number k = 5 and
conventionally used in language modelling. For a test found 6 document clusters. In the figure the docu-
document set, it is formally defined as ments are indexed according to their true category
labels, so we can clearly see that the result is quite
meaningful. Documents from one category show sim-
Perplexity(Dtest ) = exp − ln p(Dtest )/ |wd | . ilar membership to the learned clusters, and different
d categories can be distinguished very easily. The first
two categories are not clearly separated because they
We follow the formula in [3] to calculate the perplexity are both talking about vehicles and share many terms,
for PLSI. In our algorithm N is set to be the number while the rest of the categories, baseball and hockey,
of training documents. Fig. 4(a) and (b) show the are ideally detected.
(a) (b) (c)
Figure 4: (a) and (b): Perplexity results on Reuters-21578 and 20-newsgroups for DELSA, PLSI and LDA; (c):
Clustering result on 20-newsgroups dataset.
5 Conclusions and Future Work [2] D. M. Blei and M. I. Jordan. Variational meth-
ods for the Dirichlet process. In Proceedings of the
This paper proposes a Dirichlet enhanced latent se- 21st International Conference on Machine Learn-
mantic analysis model for analyzing co-occurrence ing, 2004.
data like text, which retains the strength of previous [3] D. M. Blei, A. Ng, and M. I. Jordan. Latent
approaches to find latent topics, and further introduces Dirichlet Allocation. Journal of Machine Learn-
additional modelling flexibilities to uncover the clus- ing Research, 3:993–1022, 2003.
tering structure of data. For inference and learning, we
[4] M. D. Escobar and M. West. Bayesian density
adopt a variational mean-field approximation based on
estimation and inference using mixtures. Journal
a finite alternative of DP. Experiments are performed
of the American Statistical Association, 90(430),
on a toy data set and two text data sets. The ex-
June 1995.
periments show that our model can discover both the
latent semantics and meaningful clustering structures. [5] T. S. Ferguson. A Bayesian analysis of some non-
parametric problems. Annals of Statistics, 1:209–
In addition to our approach, alternative methods for 230, 1973.
approximate inference in DP have been proposed us-
ing expectation propagation (EP) [11] or variational [6] P. J. Green and S. Richardson. Modelling hetero-
methods [16, 2]. Our approach is most similar to the geneity with and without the Dirichlet process.
work of Blei and Jordan [2] who applied mean-field ap- unpublished paper, 2000.
proximation for the inference in DP based on a trun- [7] T. Hofmann. Probabilistic Latent Semantic In-
cated DP (TDP). Their approach was formulated in dexing. In Proceedings of the 22nd Annual ACM
context of general exponential-family mixture models SIGIR Conference, pages 50–57, Berkeley, Cali-
[2]. Conceptually, DPN appears to be simpler than fornia, August 1999.
TDP in the sense that the a posteriori of G is a sym- [8] H. Ishwaran and L. F. James. Gibbs sampling
metric Dirichlet while TDP ends up with a generalized methods for stick-breaking priors. Journal of the
Dirichlet (see [8]). In another sense, TDP seems to be American Statistical Association, 96(453):161–
a tighter approximation to DP. Future work will in- 173, 2001.
clude a comparison of the various DP approximations.
[9] H. Ishwaran and M. Zarepour. Exact and ap-
proximate sum-representations for the Dirichlet
Acknowledgements process. Can. J. Statist, 30:269–283, 2002.
[10] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and
The authors thank the anonymous reviewers for their L. K. Saul. An introduction to variational meth-
valuable comments. Shipeng Yu gratefully acknowl- ods for graphical models. Machine Learning,
edges the support through a Siemens scholarship. 37(2):183–233, 1999.
[11] T. Minka and Z. Ghahramani. Expectation prop-
References agation for infinite mixtures. In NIPS’03 Work-
[1] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. shop on Nonparametric Bayesian Methods and
The infinite hidden Markov model. In Advances in Infinite Models, 2003.
Neural Information Processing Systems (NIPS) [12] R. M. Neal. Markov chain sampling methods
14, 2002. for Dirichlet process mixture models. Journal
of Computational and Graphical Statistics, 9:249– The other terms can be derived as follows:
265, 2000. D Nd
[13] C. E. Rasmussen and Z. Ghahramani. Infinite EQ [ln p(zd,n |θ ∗ , cd )] =
mixtures of gaussian process experts. In Ad- d=1 n=1
vances in Neural Information Processing Systems D Nd k N k
14, 2002. ψd,l φd,n,i Ψ(γl,i ) − Ψ( γl,j ) ,
d=1 n=1 i=1 l=1 j=1
[14] J. Sethuraman. A constructive definition of
α0
Dirichlet priors. Statistica Sinica, 4:639–650, EQ [ln p(π|α0 )] = ln Γ(α0 ) − N ln Γ( )
1994. N
N N
[15] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. α0
+ −1 Ψ(ηl ) − Ψ( ηj ) ,
Blei. Hierarchical Dirichlet processes. Technical N j=1
l=1
Report 653, Department of Statistics, University D D N N
of California, Berkeley, 2004. EQ [ln p(cd |π)] = ψd,l Ψ(ηl ) − Ψ( ηj ) ,
[16] V. Tresp and K. Yu. An introduction to non- d=1 d=1 l=1 j=1
parametric hierarchical bayesian modelling with N N k k
∗
a focus on multi-agent learning. In Proceedings of EQ [ln p(θl |G0 )] = ln Γ( λi ) − ln Γ(λi )
the Hamilton Summer School on Switching and l=1 l=1 i=1 i=1
Learning in Feedback Systems. Lecture Notes in k k
Computing Science, 2004. + (λi − 1) Ψ(γl,i ) − Ψ( γl,j ) ,
i=1 j=1
[17] K. Yu, V. Tresp, and S. Yu. A nonparametric
N N
hierarchical Bayesian framework for information
EQ [ln Q(π, θ ∗ , c, z)] = ln Γ( ηl ) − ln Γ(ηl )
filtering. In Proceedings of 27th Annual Interna-
l=1 l=1
tional ACM SIGIR Conference, 2004.
N N
+ (ηl − 1) Ψ(ηl ) − Ψ( ηj )
l=1 j=1
Appendix N k k
+ ln Γ( γl,i ) − ln Γ(γl,i )
To simplify the notation, we denote Ξ for all the la- l=1 i=1 i=1
tent variables {π, θ ∗ , c, z}. With the variational form k k
Eq. (7), we apply Jensen’s inequality to the likelihood + (γl,i − 1) Ψ(γl,i ) − Ψ( γl,j )
Eq. (6) and obtain i=1 j=1
D N D Nd k
ln p(D|α0 , λ, β)
+ ψd,l ln ψd,l + φd,n,i ln φd,n,i .
= ln p(D, Ξ|α0 , λ, β)dθ ∗ dπ d=1 l=1 d=1 n=1 i=1
π θ∗ c z
Q(Ξ)p(D, Ξ|α0 , λ, β) ∗
= ln dθ dπ Differentiating the lower bound with respect to dif-
π θ∗ Q(Ξ)
c z ferent latent variables gives the variational E-step in
Eq. (9) to Eq. (12). M-step can also be obtained by
≥ Q(Ξ) ln p(D, Ξ|α0 , λ, β)dθ ∗ dπ
π θ∗ c z
considering the lower bound with respect to β, λ and
α0 .
− Q(Ξ) ln Q(Ξ)dθ ∗ dπ
π θ∗ c z
= EQ [ln p(D, Ξ|α0 , λ, β)] − EQ [ln Q(Ξ)],
which results in Eq. (8).
To write out each term in Eq. (8) explicitly, we have,
for the first term,
D Nd D Nd k
EQ [ln p(wd,n |zd,n , β)] = φd,n,i ln βi,ν ,
d=1 n=1 d=1 n=1 i=1
where ν is the index of word wd,n .
Related docs
Get documents about "