Dirichlet Enhanced Latent Semantic Analysis

Document Sample
Dirichlet Enhanced Latent Semantic Analysis Powered By Docstoc
					                   Dirichlet Enhanced Latent Semantic Analysis



              Kai Yu                             Shipeng Yu                      Volker Tresp
    Siemens Corporate Technology        Institute for Computer Science   Siemens Corporate Technology
      D-81730 Munich, Germany                University of Munich          D-81730 Munich, Germany
        Kai.Yu@siemens.com                D-80538 Munich, Germany          Volker.Tresp@siemens.com
                                     spyu@dbs.informatik.uni-muenchen.de


                     Abstract                            of latent topics. Latent Dirichlet allocation (LDA) [3]
                                                         generalizes PLSI by treating the topic mixture param-
    This paper describes nonparametric Bayesian          eters (i.e. a multinomial over topics) as variables drawn
    treatments for analyzing records containing          from a Dirichlet distribution. Its Bayesian treatment
    occurrences of items. The introduced model           avoids overfitting and the model is generalizable to
    retains the strength of previous approaches          new data (the latter is problematic for PLSI). How-
    that explore the latent factors of each record       ever, the parametric Dirichlet distribution can be a
    (e.g. topics of documents), and further uncov-       limitation in applications which exhibit a richer struc-
    ers the clustering structure of records, which       ture. As an illustration, consider Fig. 1 (a) that shows
    reflects the statistical dependencies of the la-      the empirical distribution of three topics. We see that
    tent factors. The nonparametric model in-            the probability that all three topics are present in a
    duced by a Dirichlet process (DP) flexibly            document (corresponding to the center of the plot) is
    adapts model complexity to reveal the clus-          near zero. In contrast, a Dirichlet distribution fitted to
    tering structure of the data. To avoid the           the data (Fig. 1 (b)) would predict the highest proba-
    problems of dealing with infinite dimensions,         bility density for exactly that case. The reason is the
    we further replace the DP prior by a simpler         limiting expressiveness of a simple Dirichlet distribu-
    alternative, namely Dirichlet-multinomial al-        tion.
    location (DMA), which maintains the main             This paper employs a more general nonparametric
    modelling properties of the DP. Instead of re-       Bayesian approach to explore not only latent topics
    lying on Markov chain Monte Carlo (MCMC)             and their probabilities, but also complex dependen-
    for inference, this paper applies efficient vari-      cies between latent topics which might, for example,
    ational inference based on DMA. The pro-             be expressed as a complex clustering structure. The
    posed approach yields encouraging empirical          key innovation is to replace the parametric Dirichlet
    results on both a toy problem and text data.         prior distribution in LDA by a flexible nonparamet-
    The results show that the proposed algorithm         ric distribution G(·) that is a sample generated from
    uncovers not only the latent factors, but also       a Dirichlet process (DP) or its finite approximation,
    the clustering structure.                            Dirichlet-multinomial allocation (DMA). The Dirich-
                                                         let distribution of LDA becomes the base distribution
                                                         for the Dirichlet process. In this Dirichlet enhanced
1   Introduction                                         model, the posterior distribution of the topic mixture
                                                         for a new document converges to a flexible mixture
We consider the problem of modelling a large corpus of   model in which both mixture weights and mixture pa-
high-dimensional discrete records. Our assumption is     rameters can be learned from the data. Thus the a
that a record can be modelled by latent factors which    posteriori distribution is able to represent the distribu-
account for the co-occurrence of items in a record. To   tion of topics more truthfully. After convergence of the
ground the discussion, in the following we will iden-    learning procedure, typically only a few components
tify records with documents, latent factors with (la-    with non-negligible weights remain; thus the model is
tent) topics and items with words. Probabilistic la-     able to naturally output clusters of documents.
tent semantic indexing (PLSI) [7] was one of the first
                                                         Nonparametric Bayesian modelling has attracted con-
approaches that provided a probabilistic approach to-
                                                         siderable attentions from the learning community
wards modelling text documents as being composed
                                                                 butions

                                                                                 wd,n |zd,n ; β ∼ Mult(zd,n , β)                 (1)
                                                                                      zd,n |θd ∼ Mult(θd ).                      (2)

                                                                 wd,n is generated given its latent topic zd,n , which
                                                                 takes value {1, . . . , k}. β is a k × |V | multinomial pa-
              (a)                           (b)                  rameter matrix,         j βi,j = 1, where βz,wd,n specifies
                                                                 the probability of generating word wd,n given topic
Figure 1: Consider a 2-dimensional simplex represent-
                                                                 z. θd denotes the parameters of a multinomial distri-
ing 3 topics (recall that the probabilities have to sum
                                                                 bution of document d over topics for wd , satisfying
to one): (a) We see the probability distribution of                         k
                                                                 θd,i ≥ 0, i=1 θd,i = 1.
topics in documents which forms a ring-like distribu-
tion. Dark color indicates low density; (b) The 3-               In the LDA model, θd is generated from a k-
dimensional Dirichlet distribution that maximizes the            dimensional Dirichlet distribution G0 (θ) = Dir(θ|λ)
likelihood of samples.                                           with parameter λ ∈ Rk×1 . In our Dirichlet enhanced
                                                                 model, we assume that θd is generated from distribu-
                                                                 tion G(θ), which itself is a random sample generated
(e.g. [1, 13, 2, 15, 17, 16]). A potential problem with          from a Dirichlet process (DP) [5]
this class of models is that inference typically relies on
                                                                                  G|G0 , α0 ∼ DP(G0 , α0 ),                      (3)
MCMC approximations, which might be prohibitively
slow in dealing with the large collection of documents           where nonnegative scalar α0 is the precision parame-
in our setting. Instead, we tackle the problem by a              ter, and G0 (θ) is the base distribution, which is identi-
less expensive variational mean-field inference based             cal to the Dirichlet distribution. It turns out that the
on the DMA model. The resultant updates turn out to              distribution G(θ) sampled from a DP can be written
be quite interpretable. Finally we observed very good            as
                                                                                                 ∞
empirical performance of the proposed algorithm in
both toy data and textual document, especially in the                                   G(·) =         πl δθl (·)
                                                                                                            ∗                    (4)
                                                                                                 l=1
latter case, where meaningful clusters are discovered.
                                                                                     ∞
                                                                 where πl ≥ 0, l πl = 1, δθ (·) are point mass distri-
This paper is organized as follows. The next section                                                ∗
                                                                 butions concentrated at θ, and θl are countably infi-
introduces Dirichlet enhanced latent semantic analy-
                                                                 nite variables i.i.d. sampled from G0 [14]. The proba-
sis. In Section 3 we present inference and learning
                                                                 bility weights πl are solely depending on α0 via a stick-
algorithms based on a variational approximation. Sec-
                                                                 breaking process, which is defined in the next subsec-
tion 4 presents experimental results using a toy data
                                                                 tion. The generative model summarized by Fig. 2(a)
set and two document data sets. In Section 5 we
                                                                 is conditioned on (k × |V | + k + 1) parameters, i.e. β,
present conclusions.
                                                                 λ and α0 .
                                                                 Finally the likelihood of the collection D is given by
2     Dirichlet Enhanced Latent Semantic                                                                        D
      Analysis                                                   LDP (D|α0 , λ, β) =          p(G; α0 , λ)                p(θd |G)
                                                                                          G                    d=1   θd
                                                                            Nd      k
Following the notation in [3], we consider a corpus
D containing D documents. Each document d is                                             p(wd,n |zd,n ; β)p(zd,n |θd ) dθd dG.
a sequence of Nd words that is denoted by wd =                             n=1 zd,n =1

{wd,1 , . . . , wd,Nd }, where wd,n is a variable for the n-th                                                                   (5)
word in wd and denotes the index of the corresponding
word in a vocabulary V . Note that a same word may               In short, G is sampled once for the whole corpus D, θd
occur several times in the sequence wd .                         is sampled once for each document d, and topic zd,n
                                                                 sampled once for the n-th word wd,n in d.

2.1   The Proposed Model                                         2.2   Stick Breaking and Dirichlet Enhancing

We assume that each document is a mixture of k latent            The representation of a sample from the DP-prior in
topics and words in each document are generated by               Eq. (4) is generated in the stick breaking process in
                                                                                                      ∗
repeatedly sampling topics and words using the distri-           which infinite number of pairs (πl , θl ) are generated.
                    (a)                                       (b)                                            (c)

Figure 2: Plate models for latent semantic analysis. (a) Latent semantic analysis with DP prior; (b) An equivalent
representation, where cd is the indicator variable saying which cluster document d takes on out of the infinite
clusters induced by DP; (c) Latent semantic analysis with a finite approximation of DP (see Sec. 2.3).

 ∗
θl is sampled independently from G0 and πl is defined                explain upcoming data very well, which is particularly
as                                                                  suitable for our setting where dictionary is fixed while
                                    l−1
                                                                    documents can be growing.
            π1 = B 1 ,    πl = Bl         (1 − Bj ),
                                    j=1                             By applying the stick breaking representation, our
where Bl are i.i.d. sampled from Beta distribution                  model obtains the equivalent representation in
                                                                                                            ∗
Beta(1, α0 ). Thus, with a small α0 , the first “sticks”             Fig. 2(b). An infinite number of θl are generated from
πl will be large with little left for the remaining sticks.         the base distribution and the new indicator variable cd
                                                                                      ∗
Conversely, if α0 is large, the first sticks πl and all              indicates which θl is assigned to document d. If more
                                                                                                                       ∗
subsequent sticks will be small and the πl will be more             than one document is assigned to the same θl , cluster-
evenly distributed. In conclusion, the base distribu-               ing occurs. π = {π1 , . . . , π∞ } is a vector of probability
tion determines the locations of the point masses and               weights generated from the stick breaking process.
α0 determines the distribution of probability weights.
The distribution is nonzero at an infinite number of                 2.3   Dirichlet-Multinomial Allocation (DMA)
discrete points. If α0 is selected to be small the am-                                                    ∗
plitudes of only a small number of discrete points will             Since infinite number of pairs (πl , θl ) are generated in
be significant. Note, that both locations and weights                the stick breaking process, it is usually very difficult to
are not fixed but take on new values each time a new                 deal with the unknown distribution G. For inference
sample of G is generated. Since E(G) = G0 , initially,              there exist Markov chain Monte Carlo (MCMC) meth-
the prior corresponds to the prior used in LDA. With                ods like Gibbs samplers which directly sample θd using
many documents in the training data set, locations θl     ∗           o
                                                                    P´lya urn scheme and avoid the difficulty of sampling
which agree with the data will obtain a large weight.               the infinite-dimensional G [4]; in practice, the sam-
If a small α0 is chosen, parameters will form clusters              pling procedure is very slow and thus impractical for
whereas if a large α0 , many representative parameters              high dimensional data like text. In Bayesian statistics,
will result. Thus Dirichlet enhancement serves two                  the Dirichlet-multinomial allocation DPN in [6] has of-
purposes: it increases the flexibility in representing               ten been applied as a finite approximation to DP (see
the posterior distribution of mixing weights and en-                [6, 9]), which takes on the form
courages a clustered solution leading to insights into                                           N
the document corpus.                                                                     GN =          πl δ θ l ,
                                                                                                              ∗

                                                                                                 l=1
The DP prior offers two advantages against usual doc-
ument clustering methods. First, there is no need to                where π = {π1 , . . . , πN } is an N -vector of proba-
specify the number of clusters. The finally resulting                bility weights sampled once from a Dirichlet prior
                                                                                                     ∗
clustering structure is constrained by the DP prior,                Dir(α0 /N, . . . , α0 /N ), and θl , l = 1, . . . , N , are
but also adapted to the empirical observations. Sec-                i.i.d. sampled from the base distribution G0 . It has
ond, the number of clusters is not fixed. Although                   been shown that the limiting case of DPN is DP
the parameter α0 is a control parameter to tune the                 [6, 9, 12], and more importantly DPN demonstrates
tendency for forming clusters, the DP prior allows the              similar stick breaking properties and leads to a simi-
creation of new clusters if the current model cannot                lar clustering effect [6]. If N is sufficiently large with
respect to our sample size D, DPN gives a good ap-                           derived, but it turns out to be very slow and inappli-
proximation to DP.                                                           cable to high dimensional data like text, since for each
                                                                             word we have to sample a latent variable z. Therefore
Under the DPN model, the plate representation of our
                                                                             in this section we suggest efficient variational infer-
model is illustrated in Fig. 2(c). The likelihood of the
                                                                             ence.
whole collection D is
                                         D      N
 LDPN (D|α0 , λ, β) =                                  p(wd |θ ∗ , cd ; β)
                             π       θ ∗ d=1   cd =1
                                                                             3.1   Variational Inference

                         p(cd |π) dP (θ ∗ ; G0 ) dP (π; α0 ) (6)             The idea of variational mean-field inference is to
                                                                             propose a joint distribution Q(π, θ ∗ , c, z) condi-
where cd is the indicator variable saying which unique                       tioned on some free parameters, and then en-
       ∗                                                                     force Q to approximate the a posteriori distribu-
value θl document d takes on. The likelihood of doc-
ument d is therefore written as                                              tions of interests by minimizing the KL-divergence
                                                                             DKL (Q p(π, θ ∗ , c, z|D, α0 , λ, β)) with respect to those
                        Nd       k
                                                                             free parameters. We propose a variational distribution
p(wd |θ ∗ , cd ; β) =                                            ∗
                                       p(wd,n |zd,n ; β)p(zd,n |θcd ).
                                                                             Q over latent variables as the following
                        n=1 zd,n =1


2.4   Connections to PLSA and LDA                                            Q(π, θ ∗ , c, z|η, γ, ψ, φ) = Q(π|η)·
From the application point of view, PLSA and LDA                                          N                 D                 D   Nd
                                                                                                 ∗
both aim to discover the latent dimensions of data                                            Q(θl |γl )         Q(cd |ψd )             Q(zd,n |φd,n )
with the emphasis on indexing. The proposed Dirich-                                    l=1                 d=1                d=1 n=1

let enhanced semantic analysis retains the strengths                                                                                              (7)
of PLSA and LDA, and further explores the cluster-
ing structure of data. The model is a generalization
of LDA. If we let α0 → ∞, the model becomes identi-                          where η, γ, ψ, φ are variational parameters, each tai-
cal to LDA, since the sampled G becomes identical to                         loring the variational a posteriori distribution to each
the finite Dirichlet base distribution G0 . This extreme                      latent variable. In particular, η specifies an N -
case makes documents mutually independent given G0 ,                         dimensional Dirichlet distribution for π, γl specifies
                                                                                                                                    ∗
since θd are i.i.d. sampled from G0 . If G0 itself is not                    a k-dimensional Dirichlet distribution for distinct θl ,
sufficiently expressive, the model is not able to cap-                         ψd specifies an N -dimensional multinomial for the in-
ture the dependency between documents. The Dirich-                           dicator cd of document d, and φd,n specifies a k-
let enhancement elegantly solves this problem. With                          dimensional multinomial over latent topics for word
a moderate α0 , the model allows G to deviate away                           wd,n . It turns out that the minimization of the KL-
from G0 , giving modelling flexibilities to explore the                       divergence is equivalent to the maximization of a
richer structure of data. The exchangeability may not                        lower bound of the ln p(D|α0 , λ, β) derived by apply-
exist within the whole collection, but between groups                        ing Jensen’s inequality [10]. Please see the Appendix
                                        ∗
of documents with respective atoms θl sampled from                           for details of the derivation. The lower bound is then
G0 . On the other hand, the increased flexibility does                        given as
not lead to overfitting, because inference and learn-
ing are done in a Bayesian setting, averaging over the
                                                                                              D   Nd
number of mixture components and the states of the
latent variables.                                                             LQ (D) =                 EQ [ln p(wd,n |zd,n , β)p(zd,n |θ ∗ , cd )]
                                                                                          d=1 n=1
                                                                                                                    D
3     Inference and Learning                                                          + EQ [ln p(π|α0 )] +               EQ [ln p(cd |π)]         (8)
                                                                                                                   d=1
In this section we consider model inference and                                           N
learning based on the DPN model.                 As seen                              +           EQ [ln p(θl |G0 )] − EQ [ln Q(π, θ ∗ , c, z)].
                                                                                                            ∗

from Fig. 2(c), the inference needs to calculate the                                      l=1
a posteriori joint distribution of latent variables
p(π, θ ∗ , c, z|D, α0 , λ, β), which requires to compute
Eq. (6). This integral is however analytically infeasi-                      The optimum is found setting the partial derivatives
ble. A straightforward Gibbs sampling method can be                          with respect to each variational parameter to be zero,
which gives rise to the following updates                                           parts of L in Eq. (8) involving α0 and λ:
                                                                                                                          α0
                                                                                    L[α0 ] = ln Γ(α0 ) − N ln Γ
                                 N                              k                                                         N
φd,n,i ∝ βi,wd,n exp                   ψd,l Ψ(γl,i ) − Ψ(            γl,j )                                  α0
                                                                                                                          N                      N
                              l=1                              j=1                                      +(      − 1)               Ψ(ηl ) − Ψ(         ηj ) ,
                                                                                                             N
                                                                              (9)                                        l=1                     j=1

                        k                         k             Nd                               N               k             k

  ψd,l ∝ exp                 Ψ(γl,i ) − Ψ(            γl,j )          φd,n,i            L[λ] =         ln Γ(         λi ) −         ln Γ(λi )
                     i=1                        j=1            n=1                               l=1           i=1            i=1
                                                                                                             k                                   k
                                 N
             + Ψ(ηl ) − Ψ(             ηj )                               (10)                          +         (λi − 1) Ψ(γl,i ) − Ψ(              γl,j )    .
                                                                                                            i=1                                 j=1
                                 j=1
              D    Nd                                                               Estimates for α0 and λ are found by maximization of
   γl,i =               ψd,l φd,n,i + λi                                  (11)      these objective functions using standard methods like
             d=1 n=1                                                                Newton-Raphson method as suggested in [3].
              D
                            α0
      ηl =         ψd,l +                                                 (12)
             d=1
                            N                                                       4     Empirical Study

                                                                                    4.1    Toy Data
where Ψ(·) is the digamma function, the first deriva-
tive of the log Gamma function. Some details of the                                 We first apply the model on a toy problem with
derivation of these formula can be found in Appendix.                               k = 5 latent topics and a dictionary containing 200
We find that the updates are quite interpretable. For                                words. The assumed probabilities of generating words
example, in Eq. (9) φd,n,i is the a posteriori probability                          from topics, i.e. the parameters β, are illustrated in
of latent topic i given one word wd,n . It is determined                            Fig. 3(d), in which each colored line corresponds to a
both by the corresponding entry in the β matrix that                                topic and assigns non-zero probabilities to a subset of
can be seen as a likelihood term, and by the possi-                                 words. For each run we generate data with the follow-
bility that document d selects topic i, i.e., the prior                             ing steps: (1) one cluster number M is chosen between
term. Here the prior is itself a weighted average of                                5 and 12; (2) generate M document clusters, each of
           ∗                                                                        which is defined by a combination of topics; (3) gener-
different θl s to which d is assigned. In Eq. (12) ηl
is the a posteriori weight of πl , and turns out to be                              ate each document d, d = 1, . . . , 100, by first randomly
                                                 ∗                                  selecting a cluster and then generating 40 words ac-
the tradeoff between empirical responses at θl and the
prior specified by α0 . Finally since the parameters are                             cording to the corresponding topic combinations. For
coupled, the variational inference is done by iteratively                           DPN we select N = 100 and we aim to examine the
performing Eq. (9) to Eq. (12) until convergence.                                   performance for discovering the latent topics and the
                                                                                    document clustering structure.
                                                                                    In Fig. 3(a)-(c) we illustrate the process of clustering
3.2    Parameter Estimation
                                                                                    documents over EM iterations with a run containing
                                                                                    6 document clusters. In Fig. 3(a), we show the initial
Following the empirical Bayesian framework, we can                                  random assignment ψd,l of each document d to a clus-
estimate the hyper parameters α0 , λ, and β by itera-                               ter l. After one EM step documents begin to accumu-
tively maximizing the lower bound LQ both with re-                                  late to a reduced number of clusters (Fig. 3(b)), and
spect to the variational parameters (as described by                                converge to exactly 6 clusters after 5 steps (Fig. 3(c)).
Eq. (9)-Eq. (12)) and the model parameters, holding                                 The learned word distribution of topics β is shown in
the remaining parameters fixed. This iterative proce-                                Fig. 3(e) and is very similar to the true distribution.
dure is also referred to as variational EM [10]. It is
easy to derive the update for β:                                                    By varying M , the true number of document clusters,
                                                                                    we examine if our model can find the correct M . To de-
                                                                                    termine the number of clusters, we run the variational
                            D     Nd
                                                                                    inference and obtain for each document a weight vector
                   βi,j ∝                φd,n,i δj (wd,n )                (13)      ψd,l of clusters. Then each document takes the cluster
                            d=1 n=1
                                                                                    with largest weight as its assignment, and we calculate
                                                                                    the cluster number as the number of non-empty clus-
where δj (wd,n ) = 1 if wd,n = j, and 0 otherwise. For                              ters. For each setting of M from 5 to 12, we randomize
the remaining parameters, let’s first write down the                                 the data for 20 trials and obtain the curve in Fig. 3(f)
                  (a)                                      (b)                                     (c)




                  (d )                                     (e)                                     (f )

Figure 3: Experimental results for the toy problem. (a)-(c) show the document-cluster assignments ψd,l over the
variational inference for a run with 6 document clusters: (a) Initial random assignments; (b) Assignments after
one iteration; (c) Assignments after five iterations (final). The multinomial parameter matrix β of true values
and estimated values are given in (d) and (e), respectively. Each line gives the probabilities of generating the
200 words, with wave mountains for high probabilities. (f) shows the learned number of clusters with respect to
the true number with mean and error bar.

which shows the average performance and the vari-                comparison results with different number k of latent
ance. In 37% of the runs we get perfect results, and             topics. Our model outperforms LDA and PLSI in all
in another 43% runs the learned values only deviate              the runs, which indicates that the flexibility introduced
from the truth by one. However, we also find that the             by DP enhancement does not produce overfitting and
model tends to get slightly fewer than M clusters when           results in a better generalization performance.
M is large. The reason might be that, only 100 doc-
uments are not sufficient for learning a large number
M of clusters.
                                                                 4.3   Clustering
4.2   Document Modelling

We compare the proposed model with PLSI and LDA                  In our last experiment we demonstrate that our ap-
on two text data sets. The first one is a subset of               proach is suitable to find relevant document clusters.
the Reuters-21578 data set which contains 3000 docu-             We select four categories, autos, motorcycles, baseball
ments and 20334 words. The second one is taken from              and hockey from the 20-newsgroups data set with 446
the 20-newsgroup data set and has 2000 documents                 documents in each topic. Fig. 4(c) illustrates one clus-
with 8014 words. The comparison metric is perplexity,            tering result, in which we set topic number k = 5 and
conventionally used in language modelling. For a test            found 6 document clusters. In the figure the docu-
document set, it is formally defined as                           ments are indexed according to their true category
                                                                 labels, so we can clearly see that the result is quite
                                                                 meaningful. Documents from one category show sim-
 Perplexity(Dtest ) = exp − ln p(Dtest )/       |wd | .          ilar membership to the learned clusters, and different
                                            d                    categories can be distinguished very easily. The first
                                                                 two categories are not clearly separated because they
We follow the formula in [3] to calculate the perplexity         are both talking about vehicles and share many terms,
for PLSI. In our algorithm N is set to be the number             while the rest of the categories, baseball and hockey,
of training documents. Fig. 4(a) and (b) show the                are ideally detected.
                  (a)                                      (b)                                     (c)

Figure 4: (a) and (b): Perplexity results on Reuters-21578 and 20-newsgroups for DELSA, PLSI and LDA; (c):
Clustering result on 20-newsgroups dataset.


5   Conclusions and Future Work                                   [2] D. M. Blei and M. I. Jordan. Variational meth-
                                                                      ods for the Dirichlet process. In Proceedings of the
This paper proposes a Dirichlet enhanced latent se-                   21st International Conference on Machine Learn-
mantic analysis model for analyzing co-occurrence                     ing, 2004.
data like text, which retains the strength of previous            [3] D. M. Blei, A. Ng, and M. I. Jordan. Latent
approaches to find latent topics, and further introduces               Dirichlet Allocation. Journal of Machine Learn-
additional modelling flexibilities to uncover the clus-                ing Research, 3:993–1022, 2003.
tering structure of data. For inference and learning, we
                                                                  [4] M. D. Escobar and M. West. Bayesian density
adopt a variational mean-field approximation based on
                                                                      estimation and inference using mixtures. Journal
a finite alternative of DP. Experiments are performed
                                                                      of the American Statistical Association, 90(430),
on a toy data set and two text data sets. The ex-
                                                                      June 1995.
periments show that our model can discover both the
latent semantics and meaningful clustering structures.            [5] T. S. Ferguson. A Bayesian analysis of some non-
                                                                      parametric problems. Annals of Statistics, 1:209–
In addition to our approach, alternative methods for                  230, 1973.
approximate inference in DP have been proposed us-
ing expectation propagation (EP) [11] or variational              [6] P. J. Green and S. Richardson. Modelling hetero-
methods [16, 2]. Our approach is most similar to the                  geneity with and without the Dirichlet process.
work of Blei and Jordan [2] who applied mean-field ap-                 unpublished paper, 2000.
proximation for the inference in DP based on a trun-              [7] T. Hofmann. Probabilistic Latent Semantic In-
cated DP (TDP). Their approach was formulated in                      dexing. In Proceedings of the 22nd Annual ACM
context of general exponential-family mixture models                  SIGIR Conference, pages 50–57, Berkeley, Cali-
[2]. Conceptually, DPN appears to be simpler than                     fornia, August 1999.
TDP in the sense that the a posteriori of G is a sym-             [8] H. Ishwaran and L. F. James. Gibbs sampling
metric Dirichlet while TDP ends up with a generalized                 methods for stick-breaking priors. Journal of the
Dirichlet (see [8]). In another sense, TDP seems to be                American Statistical Association, 96(453):161–
a tighter approximation to DP. Future work will in-                   173, 2001.
clude a comparison of the various DP approximations.
                                                                  [9] H. Ishwaran and M. Zarepour. Exact and ap-
                                                                      proximate sum-representations for the Dirichlet
Acknowledgements                                                      process. Can. J. Statist, 30:269–283, 2002.
                                                                 [10] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and
The authors thank the anonymous reviewers for their                   L. K. Saul. An introduction to variational meth-
valuable comments. Shipeng Yu gratefully acknowl-                     ods for graphical models. Machine Learning,
edges the support through a Siemens scholarship.                      37(2):183–233, 1999.
                                                                 [11] T. Minka and Z. Ghahramani. Expectation prop-
References                                                            agation for infinite mixtures. In NIPS’03 Work-
 [1] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen.                  shop on Nonparametric Bayesian Methods and
     The infinite hidden Markov model. In Advances in                  Infinite Models, 2003.
     Neural Information Processing Systems (NIPS)                [12] R. M. Neal. Markov chain sampling methods
     14, 2002.                                                        for Dirichlet process mixture models. Journal
    of Computational and Graphical Statistics, 9:249–                         The other terms can be derived as follows:
    265, 2000.                                                                 D    Nd
[13] C. E. Rasmussen and Z. Ghahramani. Infinite                                           EQ [ln p(zd,n |θ ∗ , cd )] =
     mixtures of gaussian process experts. In Ad-                             d=1 n=1
     vances in Neural Information Processing Systems                                  D    Nd        k     N                                                k
     14, 2002.                                                                                                   ψd,l φd,n,i Ψ(γl,i ) − Ψ(                       γl,j ) ,
                                                                                     d=1 n=1 i=1 l=1                                                     j=1
[14] J. Sethuraman.                   A constructive definition of
                                                                                                                                             α0
     Dirichlet priors.                Statistica Sinica, 4:639–650,           EQ [ln p(π|α0 )] = ln Γ(α0 ) − N ln Γ(                            )
     1994.                                                                                                                                   N
                                                                                                                                   N                             N
[15] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M.                                                                 α0
                                                                                                           +           −1                Ψ(ηl ) − Ψ(                  ηj ) ,
     Blei. Hierarchical Dirichlet processes. Technical                                                              N                                           j=1
                                                                                                                                l=1
     Report 653, Department of Statistics, University                          D                                     D     N                                    N
     of California, Berkeley, 2004.                                                 EQ [ln p(cd |π)] =                         ψd,l Ψ(ηl ) − Ψ(                      ηj ) ,
[16] V. Tresp and K. Yu. An introduction to non-                              d=1                                   d=1 l=1                                  j=1
     parametric hierarchical bayesian modelling with                           N                                      N                  k                  k
                                                                                              ∗
     a focus on multi-agent learning. In Proceedings of                             EQ [ln p(θl |G0 )] =                       ln Γ(           λi ) −             ln Γ(λi )
     the Hamilton Summer School on Switching and                              l=1                                    l=1                 i=1             i=1
     Learning in Feedback Systems. Lecture Notes in                                                         k                                           k
     Computing Science, 2004.                                                                         +          (λi − 1) Ψ(γl,i ) − Ψ(                         γl,j )    ,
                                                                                                           i=1                                         j=1
[17] K. Yu, V. Tresp, and S. Yu. A nonparametric
                                                                                                                               N                   N
     hierarchical Bayesian framework for information
                                                                              EQ [ln Q(π, θ ∗ , c, z)] = ln Γ(                       ηl ) −            ln Γ(ηl )
     filtering. In Proceedings of 27th Annual Interna-
                                                                                                                               l=1              l=1
     tional ACM SIGIR Conference, 2004.
                                                                                          N                                          N
                                                                                     +          (ηl − 1) Ψ(ηl ) − Ψ(                     ηj )
                                                                                          l=1                                      j=1
Appendix                                                                                  N                     k               k
                                                                                     +            ln Γ(             γl,i ) −           ln Γ(γl,i )
To simplify the notation, we denote Ξ for all the la-                                     l=1               i=1                i=1
tent variables {π, θ ∗ , c, z}. With the variational form                                         k                                                k
Eq. (7), we apply Jensen’s inequality to the likelihood                                     +            (γl,i − 1) Ψ(γl,i ) − Ψ(                      γl,j )
Eq. (6) and obtain                                                                               i=1                                            j=1
                                                                                          D      N                             D       Nd      k
 ln p(D|α0 , λ, β)
                                                                                     +                   ψd,l ln ψd,l +                            φd,n,i ln φd,n,i .
      = ln                            p(D, Ξ|α0 , λ, β)dθ ∗ dπ                            d=1 l=1                              d=1 n=1 i=1
                π    θ∗   c       z
                                       Q(Ξ)p(D, Ξ|α0 , λ, β) ∗
      = ln                                                  dθ dπ             Differentiating the lower bound with respect to dif-
                π    θ∗                      Q(Ξ)
                          c       z                                           ferent latent variables gives the variational E-step in
                                                                              Eq. (9) to Eq. (12). M-step can also be obtained by
      ≥                           Q(Ξ) ln p(D, Ξ|α0 , λ, β)dθ ∗ dπ
            π   θ∗    c       z
                                                                              considering the lower bound with respect to β, λ and
                                                                              α0 .
          −                           Q(Ξ) ln Q(Ξ)dθ ∗ dπ
                π    θ∗   c       z
      = EQ [ln p(D, Ξ|α0 , λ, β)] − EQ [ln Q(Ξ)],

which results in Eq. (8).
To write out each term in Eq. (8) explicitly, we have,
for the first term,
D   Nd                                        D   Nd   k
          EQ [ln p(wd,n |zd,n , β)] =                      φd,n,i ln βi,ν ,
d=1 n=1                                      d=1 n=1 i=1

where ν is the index of word wd,n .