Information Extraction and Information Integration by dffhrtcv3

VIEWS: 0 PAGES: 56

									    An Introduction to Latent Dirichlet
            Allocation (LDA)




1
                            LDA
    • A generative probabilistic model for collections of
      discrete data such text corpora.
    • Is a three-level hierarchical Bayesian model, in
      which each item of a collection is modeled as finite
      mixture over an underlying set of topics.
    • Each topic again is modeled as an infinite mixture of
      an underlying set of topic probabilities.
    • It has natural advantages over unigram model and
      probabilistic LSI model.

2
                History (1) – text processing
    • IR – text to real number vector(Baeza-Yates and Ribeiro-Neto, 1999),
      tfidf (Salton and McGill, 1983)
       – Tfidf – shortcoming: (1) Lengthy and (2) Cannot model inter- and
         intra- document statistical structure
    • LSI – dimension reduction (Deerwester et al., 1990)
       – Advantages: achieve significant compression in large collections
         and capture synonymy and polysemy.
    • Generative probabilistic model – to study the ability of LSI
      (Papadimitriou et al., 1998)
       – Why LSI, we can model the data directly using maximum
         likelihood or Bayesian methods.


3
          History (2) – text processing
• Probabilistic LSI – also aspect model. Milestone (Hofmann,
  1999).
    – P(wi|θj), d={w1, …, wN}, and θ={θ1, …, θk}. each word is generated
      from a single model θj. Document d is considered to be a mixing
      proportions for these mixture components θ, that is a list of numbers
      (the mixing proportions for topics).
    – Disadvantage: no probabilistic model at document level.
       • The number of parameters grows linearly with the size of corpora.
       • It is not clear to assign probability to document outside of the collection. (does
         not make any assumptions about how the mixture weights θ are generated,
         making it difficult to test the generalizability of the model to new documents. )




4
                        Notation
    • D={d1, …, dM}, d={w1, …, wN},
    • and θ={θ1, …, θk},
      equivalently D={w1, …, wM}. (Bold variable
      denotes vector.)
    • Suppose we have V distinct words in the whole data
      set.




5
                                   LDA
    • The basic idea: Documents are represented as
      random mixtures over latent topics, where each topic
      is characterized by a distribution over words.
    • For each document d, we generate as follows:

       1. Choose N    Poisson( )
       2. Choose θ Dir(α)                k topics z

       3. For each of the N words wn :                  β is a k  V matrix
           (a) Choose a topic zn    Multinomial(θ)       with ij =p(wi | z j )

           (b) Choose a word wn from p( wn | zn , β),
             a multinomial probability conditioned on the topic zn
6
            Dirichlet Random Variables θ
    • A k-dimensional Dirichlet random variables θ can take
      values in the (k-1)-simplex (a k-vector θ lies in the (k-1)-
                  θi  0,  i 1 θi  1 , and thus the probability density
                            k
      simplex if
      can be:
                                     ( i 1 α i )
                                                k

                      p (θ | α )                         θ1 i 1
                                                           α
                                                                    θα k 1
                                     
                                         k                           k
                                         i 1
                                                (α i )

      where α is a k-vector parameter with αi>0, and Γ(x) is a
      gamma function


7
                Multinomial Distribution
    •   Each trial can end in exactly one of k categories
    •   n independent trials
    •   Probability a trial results in category i is pi
    •   Yi is the number of trials resulting in category I
    •   p1+…+pk = 1
    •   Y1+…+Yk = n




8
    Multinomial Distribution




9
     Conjugation




10
     Entropy




11
                     Joint Distribution
 • Given the parameters α and β, the joint distribution
   of a topic mixture θ, a set of N topics z, and a set of
   N words w is given by:
                                               N
                p (θ, z , w | α )  p (θ | α ) p ( zn | θ) p( wn | z n , β)
                                              n 1
     where p(zn|θ) is simply θi for the unique i such that
     zni=1. Integrating over θ and summing over z we
     obtain the marginal distribution of a document
                                      N                              
            p(w | α, β)   p(θ | α)    p( zn | θ) p( wn | zn , β)  dθ
                                      n 1 z                         
                                                                               (a)
                                             n                       
12
                Joint Distribution (cont.)
 • Finally, taking the product of the marginal
   probabilities of single documents, we can obtain the
   probability of a corpus:
                        M              Nd                                  
         p( D | α, β)    p(θd | α)    p( zdn | θd ) p( wdn | zdn , β)  dθd
                                       n 1 z                              
                        d 1                  nd                           

                   β
                                                Similar model can be referred to
                                                hierarchical models (Gelman et al., 1995),
                                                or more precisely as conditionally
     α         θ         z        w             independent hierarchical models (Kass
                                      N
                                          M     and Steffey, 1989).

13
       Relationships with Other Latent
                   Models
                                                          N
                  w                          p (w )   p( wn )
                       N                                 n 1
                           M




                                                                 N

              z        w                   p ( w )   p ( z ) p ( wn | z )
                           N                         z          n 1
                               M




          d        z           w           p(d , wn )  p(d ) p(wn | z) p( z | d )
                                   N                            z
                                       M


 Problems: (1) p(z|d) only model the documents in the training data set and cannot for the
 unseen document; and (2) the parameter kV+kM grows linearly in M and thus overfitting.
14
     Graphical Interpretation
      The probability density of the Dirichlet
      distribution when K=3 for various parameter
      vectors α. Clockwise from top left: α=(6, 2, 2),
      (3, 7, 5), (6, 2, 6), (2, 3, 4).

                    ( i 1 α i )
                                   k

     p (θ | α )                             θ1 i 1
                                              α
                                                       θα k 1
                    
                            k                           k
                            i 1
                                   (α i )


               
                    k
     θi  0,        i 1
                           θi  1




15
       Graphical Interpretation (cont.)




 • The Dirichlet prior on the topic-word distributions can be interpreted
   as forces on the topic locations with higher β moving the topic
   locations away from the corners of the simplex.
16
                    Matrix Interpretation




 • In the topic model, the word-document co-occurrence matrix is split into two parts:
   a topic matrix Φ and a document matrix Θ. Note that the diagonal matrix D in LSA
   can be absorbed in the matrix U or V, making the similarity between the two
   representations even clearer.
17
     Inference and Parameter Estimation
 • The key inferential problem is that of computing the
   posterior distribution of the hidden variables given a
   document:
                                           p (θ, z , w | α, β)
                     p (θ, z | w , α, β) 
                                              p ( w | α, β)
 • However, such distribution is intractable to compute in
   general. For normalization in the above distribution, we
   have to marginalize over the hidden variables and write the
   Equation (a) in terms of the model parameters:
                        ( i αi )  k αi 1   N k V              
        p ( w | α, β)               θi      (θiβij )
                                                              wnj
                                                                     dθ
                        i (αi )  i 1   n 1 i 1 j 1
                                               
                                                                    
                                                                    

18
     Inference and Parameter Estimation
 • The key inferential problem is that of computing the
   posterior distribution of the hidden variables given a
   document:
                                            p (θ, z , w | α, β)
                      p (θ, z | w , α, β) 
                                               p ( w | α, β)
 • However, such distribution is intractable to compute in
   general. For normalization in the above distribution, we
                                  N
   have to marginalize over the hidden variables and write the    
        p(w | α, β)   p(θ | α)    p( zn | θ) p( wn | zn , β)  dθ
   Equation (a) in terms of the model parameters: 
                                  n 1 z
                                        n                         
                         ( i αi )  k αi 1   N k V              
         p ( w | α, β)               θi      (θiβij )
                                                               wnj
                                                                      dθ
                         i (αi )  i 1   n 1 i 1 j 1
                                                
                                                                     
                                                                     

19
     Inference and Parameter Estimation
                   (cont.)
                        ( i αi )  k αi 1   N k V              
        p ( w | α, β)               θi      (θiβij )
                                                              wnj
                                                                     dθ
                        i (αi )  i 1   n 1 i 1 j 1
                                               
                                                                    
                                                                    

 • This function is intractable due to the coupling between θ
   and β in the summation over latent topics (Dickey, 1983).
 • Rather than the intractable exact inference, we can use some
   other approximate inference algorithms, e.g., Laplace
   approximation, variational approximation, and Markov
   chain Monte Carlo (Jordan, 1999).


20
                  Variational Inference
 • Here, we introduce a simple convexity-based variational
   algorithm for inference in LDA.
 • The basic idea here is to make use of Jensen’s inequality to
   obtain an adjustable lower bound on the log likelihood
   (Jordan, 1999).
                             p ( , z , w |  ,  )q( , z )
 log p(w | α, β)  log                                     d
                         z              q ( , z )
                  q ( , z )log p ( , z , w |  ,  )d    q ( , z )log q ( , z )d
                    z                                             z


 • A simple way to obtain a tractable family of lower bounds is
   to consider simple modifications of the original graphical
   model in which some of the edges and nodes are removed.
21
         Variational Inference (cont.)
 • Hence, by dropping edges between θ, z, and w, and w
   nodes, and also endow the resulting simplified graphical
   model with free variational parameters, we obtain a family
   of distributions on the latent variables:
                                                 N
                    q (θ, z | γ , )  q (θ | γ ) q ( zn | n )
                                                n 1

 • where the Dirichlet parameter γ and the multinomial
   parameters (Φ1, …, ΦN) are the free variational parameters.
                β
                                                             γ     Φ

     α      θ         z        w                            θ      z
                                   N                                   N
                                       M                                   M

22
     How to determine the vatiational
               parameters
 • We can set up an optimization problem to determine the
   values of the variational parameters γ and Φ.
 • We can define the optimization function as minimizing the
   Kullback-Leibler (KL) divergence between the variational
   distribution and the true posterior p(θ,z|w,α,β):

      (γ* , * )  arg min D(q(θ, z | γ, ) || p(θ, z | w, α, β)   (5)
                    ( γ , )

 • This minimization can be achieved by an iterative fixed-
   point method.



23
                    Variational Inference
 • We now discuss how to set the parameter γ and Φ via an
   optimization procedure.
 • Following Jordan et al. (1999), we have a lower bound of
   the log likelihood of a document using Jensen’s inequality:

q(w | α, β)  log   p(θ, z, w | α, β)d
                     z

                         p(θ, z, w | α, β)q(θ, z )
            log                                 d
                     z           q(θ, z )
              q (θ, z )log p (θ, z , w | α, β)d    q (θ, z )log q(θ, z )d
                z                                            z

            E q [log p (θ, z , w | α, β)]  E q [log q (θ, z )]
24
        Variational Inference (cont.)
 • Then from the above formula, we see that the
   Jensen’s inequality provides us with a lower bound
   on the log likelihood for an arbitrary variational
   distribution q(θ,z|γ,Φ).
 • It can be easily verified that the difference between
   the left-hand side and the right-hand side of the
   above equation is the KL divergence between the
   variational posterior probability and the true
   posterior probability.


25
           Variational Inference (cont.)
 • That is, letting L(, γ; α, β) denote the right-hand
   side of the above equation we have:

     log p(w | α, β)  L(γ, ; α, β)  D(q(θ, z | γ, ) || p(θ, z | w, α, β))

 • This means that maximizing the lower bound L(, γ; α, β)
   w.r.t. γ and Φ is equivalent to minimizing the KL
   divergence between the variational posterior
   probability and the true posterior probability, the
   optimization problem in equation (5).

26
           Variational Inference (cont.)
 • We can expand the above equation
     L(, γ; α, β)  E q [log p(θ | α)]  E q [log p(z | θ)]  E q [log p( w | z, β)]
                     Eq [log q(θ)]  Eq [log q(z )]
 • By extending it again, we can have
                                              k                 k
       L(, γ; α, β)  log ( j 1 α j )   log (α i )   (α i  1)(  ( γ i )   (  j 1 γ j ))
                                         k                                                    k

                                             i 1              i 1
                         N    k
                       ni ( ( γ i )   ( j 1 γ j ))
                                                     k

                        n 1 i 1
                         N    k     V
                       ni wnj log βij
                        n 1 i 1 j 1
                                              k                 k
                      log  ( j 1 γ j )   log ( γ i )   ( γ i  1)( ( γ i )   ( j 1 γ j ))
                                         k                                                    k

                                             i 1              i 1
                         N    k                                                      (15)
                       ni log ni
27                      n 1 i 1
     Entropy




28
                   Variaitonal Multinomial
  • We first maximize Eq. (15) w.r.t. Φni, the probability that
    the n-th word is generated by latent topic i.

  • We form the Lagrangian by isolating the terms which
    contain Φni and adding the appropriate Lagrange multipliers.
    Let βiv be p(wvn=1|zi=1) for the appropriate v. (recall that
    each wn is a vector of size V with exactly one component
    equal to one; we can select the unique v such that wvn=1):


L[ ni ]  ni ( ( γ i )   ( j 1 γ j ))  ni log βiv  ni log ni  n ( j 1 ni  1)
                                k                                                 k




29
      Variaitonal Multinomial (cont.)
 • Taking derivatives w.r.t. Φni, we obtain:
         L
                ( γ i )   ( j 1 γ j )  log βiv  log ni  1  
                                k

         ni
 • Setting this to zero yields the maximizing value of
   the variational parameter Φni :

                   ni  βiv exp( ( γ i )   ( j 1 γ j ))
                                                    k




30
                           Variational Dirichlet
 • Next we maximize equation (15) w.r.t. γi, the i-th
   component of the posterior Dirichlet parameters, the
   terms containing γi are:
                 k                                             N
     L[ γ ]   (α i  1)( ( γ i )   ( j 1 γ j ))   ni ( ( γ i )   ( j 1 γ j ))
                                               k                                          k

                i 1                                          n 1
                                                    k
          log ( j 1 γ j )  log ( γ i )   ( γ i  1)( ( γ i )   ( j 1 γ j ))
                       k                                                              k

                                                   i 1
 • By simplifying
          k
L[ γ ]   (α i   n 1 ni  γ i )( ( γ i )   ( j 1 γ j ))  log ( j 1 γ j )  log ( γ i )
                       N                                  k                   k

         i 1


31
              Variational Dirichlet (cont.)
 • Taking the derivative w.r.t. γi:

     L                                                        k
            (γ i )(αi  n 1 ni  γ i )   ( j 1 γ j ) (αi  n 1 ni  γi )
             '             N                   '  k                   N

     γ i                                                    i 1

 • Setting this equation to zero yields a maximum at:

                                   γ i  α i   n1 ni
                                                  N




32
       Solve the Optimization Problem
 • Derivate the KL divergence and setting them equal
   to zero, we obtain the following update equations:
               ni  βiwn exp{Eq [log(θi ) | γ]}                 (6)

                           γ i  α i   n1 ni
                                            N
                                                                 (7)
     where the expectation in the multinomial update can
     be computed as follows:
             E q [log(θi ) | γ ]   ( γ i )   ( j 1 γ j )
                                                      k

                                                                       (8)
     where ψ is the first derivative of the logΓ function
     which is computable via Taylor approximations
33
     (Abramowitz and Stegun, 1970).
                 Computing E[log(θi|α)]
 • Recall that a distribution is in the exponential family
   if it can be written in the form:
                         p( x |  )  h( x)exp{TT ( x)  A()}
   where η is the natural parameter, T(x) is the
   sufficient statistic, and A(η) is the log of the
   normalization factor.
 • As we can write the Dirichlet in this form by
   exponentiating the log of Eq.: p(θ|α)
     p(θ | α)  exp{( i 1 (α i  1)log θi )  log ( i 1 α i )   i 1 log (α i )}
                           k                                k            k




34
         Computing E[log(θi|α)] (cont.)
 • From this form we see that the natural parameter of
   the Dirichlet is ηi=αi-1 and the sufficient statistic is
   T(θi)=logθi. Moreover, based on the general fact that
   the derivative of the log normalization factor w.r.t.
   the natural parameter is equal to the expectation of
   the sufficient statistic, we obtain:
                 E[log(θi ) | α ]   (α i )   (  j 1 α j )
                                                       k


     where ψ is the digamma function, the first derivative
     of the log Gamma function.

35
           Variational Inference Algorithm
                           1
     (1) initialize   for all i and n
                      0
                      ni                        Each iteration requires
                           k
                                N                 O((N+1)k) operations
     (2) initialize γ i  α i    for all i
                                k               For a single document
     (3) repeat                                   the iteration number
     (4)    for n =1 to N
                                                  is on the order of the
     (5)       for i =1 to k
                                                  number of words in it
     (6)          ni1  βiwn exp( (γ ti ))
                   t

                                                Thus, the total number
     (7)       normalize ni1 to sum to 1
                          t

                                                  of operations roughly
            γ ti  α   n 1 ni1
                           N   t
     (8)
                                                  on the order of N2k
     (9) until convergence

36
              Parameter Estimation
 • We can use a empirical Bayes method for parameter
   estimation. In particular, we wish to find parameters α and β
   that maximize the marginal log likelihood:
                                  M
                     log(α, β)   log p ( w d | α, β)
                                 d 1
 • The quantity p(w|α, β) can be computed by the variational
   inference as described above. Thus, we can find
   approximate empirical Bayes estimates for the LDA model
   via an alternating variational EM procedure that maximizes
   a lower bound w.r.t. the variational parameters γ and Φ, and
   then fixed values of the variational parameters, maximizes
   the lower bound w.r.t. the model parameter α and β.
37
                         Variational EM
 1. (E-step) For each document, find the optimizing values of the
    variational parameters * *                  . This is done as described in
                             {γ d , d : d  D}
    the previous section.
 2. (M-step) Maximize the resulting lower bound on the log likelihood
    w.r.t. the model parameters α and β. This corresponds to finding
    maximum likelihood estimates with expected sufficient statistics for
    each document under the approximate posterior which is computed
    in the E-step. Actually, the update for the conditional multinomial
    parameter β can be written out analytically:
                                   M    Nd
                             β ij   dni wdn
                                        *     j
                                                             (9)
                                   d  1 n 1
     The update for α can be implemented using an efficient Newton-
       Raphson method. These two steps are repeated until converges.

38
            Parameter Estimation
 • We now consider how to obtain empirical Bayes
   estimates of the model parameters α and β.
 • We solve this problem by using the variational
   lower bound as a surrogate for the marginal log
   likelihood, with the variational parameters Φ and γ
   fixed to the values found by variational inference.
 • We then obtain empirical Bayes estimates by
   maximizing this lower bound w.r.t. the model
   parameters.

39
        Parameter Estimation (cont.)
 • Recall our approach for finding the empirical Bayes
   estimates is based on a variational EM procedure.
 • In the variational E-step, we maximize the bound
    L(, γ; α, β) w.r.t. the variational parameter γ and Φ. In
   the M-step, which we describe in this section, we
   maximize the bound w.r.t. the model parameters α
   and β. The overall procedure can thus be viewed as
   coordinate ascent in L.



40
          Conditional Multinomials
 • To maximize w.r.t. β, we isolate terms and add
   Lagrange multipliers:
             M   Nd    k   V                     k
       L[β]   dni w log βij   i ( j 1 βij  1)
                                     j                 V
                                    dn
            d 1 n 1 i 1 j 1                 i 1

 • Taking the derivative w.r.t. βij and set it to zero, we
   have
                                    M    Nd
                               βij   dni wdn
                                               j

                                    d 1 n 1




41
                                      Dirichlet
 • First, we have
            M
                                     k              k
                                                                                          
     L[α ]   log ( j 1α j )   log (αi )   (αi  1)((γ di )  ( j 1 γ dj )) 
                       k                                                    k

             d 1                  i 1           i 1                                   
 • Taking derivative w.r.t. αi, we obtain:
                L
                                                                                   
                                                     M
                     M (( j 1 α j )  (αi ))   ( γ di )  ( j 1 γ dj )
                            k                                        k

                αi                                 d 1
 • This derivative depends on αi, where j<>i, and we therefore
   must use an iterative method to find the maximal α. In
   particular, the Hessian is in the form found in equation (10):
                          L
                                  (i, j )M  (αi )   ( j 1α j )
                                              '         '  k

                         αiα j
42
                 Dirichlet (cont.)
 • Finally, we can use the same algorithm to find an
   empirical Bayes point estimate of η, the scalar
   parameter for the exchangeable Dirichlet in the
   smoothed LDA model as mentioned above.




43
                       Smoothing
 • Simple Laplace smoothing is no longer justified as a
   maximum a posteriori method in LDA setting.
 • We can then assume that each row in βkxV is independently
   drawn from an exchangeable Dirichlet distribution. That is
   to treat βi as random variables that are endowed with a
   posterior distribution, conditioned on the data.

                       η       β
                                   k



                   α       θ           z   w
                                               N
                                                   M




44
                       Smoothing Model
 • Thus we obtain a variational approach to Bayesian
   inference:
                                              k                N
          q (β1:k , θ1:M , z1:M | η, γ, )   Dir (βi | ηi ) qd (θd , z n | n , γ d )
                                             i 1             n 1
     where qd (θd , zn | n , γd ) is the variational distribution
     defined for LDA as above and the update for the
     new variational parameter η is as follow:
                                         M        Nd
                         ηij  η   dni wdn
                                       *     j

                                        d 1 n 1



45
     Applications




46
             Document Modeling
 • Perplexity is used to indicate the generalization
   performance of a method.
 • Specifically, we estimate a document modeling and
   use this model to describe the new data set.

                                          
                                              M
                                                     log p( wd )
           perplexity ( Dtest   )  exp{     d 1
                                                                   }
                                              
                                                     M
                                                     d 1
                                                            Nd
 • LDA outperforms the other models including pLSI,
   Smoothed Unigram, and Smoothed Mixt. Unigrams.


47
          Document Classification
 • We can use the LDA model results as the features
   for classifiers. In this way, say 50 topics, we can
   reduce the feature space by 99.6%.
 • The experimental results show that such feature
   reduction may decrease the accuracy only a little.




48
                  Collaborative Filtering
 • We can learn a model on a fully observed set of users. Then
   for each unobserved user, we are shown all but one of the
   movies preferred by that user and are asked to predict what
   the held-out movie is.
 • Precisely, define the predictive perplexity on M test uses as:

                                                
                                                    M
                                                           log p( wd , Nd | w d ,1:Nd 1 )
     predictive  perplexity( Dtest   )  exp{     d 1
                                                                                             }
                                                                    M




49
     Other Applications




50
                  The Naïve Bayes model


                  c                     w
                                                N

                                                               N
 c   arg max   p(c | w)  p(c) p( w | c)  p (c) p ( wn | c)
           c                                                   n 1



Object class         Prior prob. of         Image likelihood
 decision          the object classes        given the class




51                                                                 Csurka et al. 2004
52   Csurka et al. 2004
         Hierarchical Bayesian text models

Probabilistic Latent Semantic Analysis (pLSA)


     d        z       w
                          N
 D




                              “face”

53                                              Sivic et al. ICCV 2005
         Hierarchical Bayesian text models



 “beach”


Latent Dirichlet Allocation (LDA)


     c               z      w
                  N
 D
54                                  Fei-Fei et al. ICCV 2005
                     Summary
 • Based on the exchangeability assumption
 • Can be viewed as a dimensionality reduction
   technique
 • Exact inference is intractable, we can approximate
   instead
 • Applications in other collection – images and
   caption for example.




55
     End of The Talk !




56

								
To top