Markov Random Topic Fields

Document Sample
Markov Random Topic Fields Powered By Docstoc
					                                Markov Random Topic Fields
                                                        e
                                             Hal Daum´ III
                                          School of Computing
                                           University of Utah
                                        Salt Lake City, UT 84112
                                            me@hal3.name


                    Abstract                                 of the form of the distance metric used to specify
                                                             the edge potentials. Again, this has a significant
    Most approaches to topic modeling as-                    impact on performance. Finally, we consider the
    sume an independence between docu-                       use of multiple graphs for a single model and find
    ments that is frequently violated. We                    that the power of combined graphs also leads to
    present an topic model that makes use                    significantly better models.
    of one or more user-specified graphs de-
    scribing relationships between documents.                2     Background
    These graph are encoded in the form of a                 Probabilistic topic models propose that text can
    Markov random field over topics and serve                 be considered as a mixture of words drawn from
    to encourage related documents to have                   one or more “topics” (Deerwester et al., 1990;
    similar topic structures. Experiments on                 Blei et al., 2003). The model we build on is la-
    show upwards of a 10% improvement in                     tent Dirichlet allocation (Blei et al., 2003) (hence-
    modeling performance.                                    forth, LDA). LDA stipulates the following gener-
                                                             ative model for a document collection:
1   Introduction                                                 1. For each document d = 1 . . . D:
One often wishes to apply topic models to large                     (a) Choose a topic mixture θ d ∼ Dir(α)
document collections. In these large collections,
                                                                    (b) For each word in d, n = 1 . . . Nd :
we usually have meta-information about how one
document relates to another. Perhaps two docu-                           i. Choose a topic zdn ∼ Mult(θ d )
ments share an author; perhaps one document cites                       ii. Choose a word wdn ∼ Mult(β zdn )
another; perhaps two documents are published in              Here, α is a hyperparameter vector of length K,
the same journal or conference. We often believe             where K is the desired number of topics. Each
that documents related in such a way should have             document has a topic distribution θ d over these
similar topical structures. We encode this in a              K topics and each word is associated with pre-
probabilistic fashion by imposing an (undirected)            cisely one topic (indicated by zdn ). Each topic
Markov random field (MRF) on top of a standard                k = 1 . . . K is a unigram distribution over words
topic model (see Section 3). The edge potentials             (aka, a multinomial) parameterized by a vector
in the MRF encode the fact that “connected” doc-             β k . The associated graphical model for LDA is
uments should share similar topic structures, mea-           shown in Figure 1. Here, we have added a few
sured by some parameterized distance function.               additional hyperparameters: we place a Gam(a, b)
Inference in the resulting model is complicated              prior independently on each component of α and
by the addition of edge potentials in the MRF.               a Dir(η, . . . , η) prior on each of the βs.
We demonstrate that a hybrid Gibbs/Metropolis-                  The joint distribution over all random variables
Hastings sampler is able to efficiently explore the           specified by LDA is:
posterior distribution (see Section 4).                                           Y
   In experiments (Section 5), we explore several            p(α, θ, z, β, w) =       Gam(αk | a, b)Dir(β k | η)        (1)
variations on our basic model. The first is to ex-                     Y
                                                                                  k
                                                                                         Y
plore the importance of being able to tune the                             Dir(θd | α)       Mult(zdn | θd )Mult(wdn | β zdn )
                                                                       d                 n
strength of the potentials in the MRF as part of the
inference procedure. This turns out to be of utmost             Many inference methods have been developed
importance. The second is to study the importance            for this model; the approach upon which we


                                                       293
                   Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 293–296,
                            Suntec, Singapore, 4 August 2009. c 2009 ACL and AFNLP
          a    b            η         β                                        Doc 1                        Doc 2
                                          K                                      θ         z       w
                                                                                                       N
                                                                                                              θ      z      w
                                                                                                                                    N



                                                                       Doc 6                                        Doc 3
         α         θ        z         w                                  θ     z       w                              θ         z       w
                                                                                               N                                            N
                                          N
                                              D
                                                                               Doc 5                        Doc 4
                                                                                           z       w                 z      w
          Figure 1: Graphical model for LDA.                                     θ
                                                                                                       N
                                                                                                              θ
                                                                                                                                    N


                                                                  Figure 2: Example Markov Random Topic Field (variables
build is the collapsed Gibbs sampler (Griffiths and                α and β are excluded for clarify).
Steyvers, 2006). Here, the random variables β and
θ are analytically integrated out. The main sam-                  Of if they are both published at EMNLP. Or if they
pling variables are the zdn indicators (as well as                are published in the same year, or come out of the
the hyperparameters: η and a, b). The conditional                 same institution, or many other possibilities.
distribution for zdn conditioned on all other vari-                  Regardless of the source of this notion of simi-
ables in the model gives the following Gibbs sam-                 larity, we suppose that we can represent the rela-
pling distribution p(zdn = k):                                    tionship between documents in the form of a graph
                          #−dn
                                                                  G = (V, E). The vertices in this graph are the doc-
          #−dn + αk
            z=k             z=k,w=wdn + η
      P      −dn        P     −dn
                                                      (2)         uments and the edges indicate relatedness. Note
         k (#z=k + αk )  k (#z=k ,w=wdn + η)
                                                                  that the resulting model will not be fully genera-
   Here, #−dn denotes the number of times event
                χ                                                 tive, but is still probabilistically well defined.
χ occurs in the entire corpus, excluding word n
                                                                  3.1 Single Graph
in document d. Intuitively, the first term is a
                                                                  There are multiple possibilities for augmenting
(smoothed) relative frequency of topic k occur-
                                                                  LDA with such graph structure. We could “link”
ring; the second term is a (smoothed) relative fre-
                                                                  the topic distributions θ over related documents;
quency of topic k giving rise to word wdn .
                                                                  we could “like” the topic indicators z over related
   A Markov random field specifies a joint dis-
                                                                  documents. We consider the former because it
tribution over a collection of random variables
                                                                  leads to a more natural model. The idea is to “un-
x1 , . . . , xN . An undirected graph structure stip-
                                                                  roll” the D-plate in the graphical model for LDA
ulates how the joint distribution factorizes over
                                                                  (Figure 1) and connect (via undirected links) the
these variables. Given a graph G = (V, E), where
                                                                  θ variables associated with connected documents.
V = {x1 , . . . , xN }, let C denote a subset of all
                                                                  Figure 2 shows an example MRTF over six docu-
the cliques of G. Then, the MRF specifies the joint
                                 1                                ments, with thick edges connecting the θ variables
distribution as: p(x) = Z c∈C ψc (xc ). Here,
                                                                  of “related” documents. Note that each θ still has
Z =           x   c∈C ψc (xc ) is the partition function,         α as a parent and each w has β as a parent: these
xc is the subset of x contained in clique c and ψc
                                                                  are left off for figure clarity.
is any non-negative function that measures how
                                                                     The model is a straightforward “integration” of
“good” a particular configuration of variables xc
                                                                  LDA and an MRF specified by the document re-
is. The ψs are called potential functions.
                                                                  lationships G. We begin with the joint distribution
3   Markov Random Topic Fields                                    specified by LDA (see Eq (1)) and add in edge po-
                                                                  tentials for each edge in the document graph G that
Suppose that we have access to a collection of
                                                                  “encourage” the topic distributions of neighboring
documents, but do not believe that these docu-
                                                                  documents to be similar. The potentials all have
ments are all independent. In this case, the gener-
                                                                  the form:
ative story of LDA no longer makes sense: related
documents are more likely to have “similar” topic                     ψd,d (θ d , θ d ) = exp −            d,d   ρ(θ d , θ d )              (3)
structures. For instance, in the scientific commu-                    Here, d,d is a “measure of strength” of the im-
nity, if paper A cites paper B, we would (a priori)               portance of the connection between d and d (and
expect the topic distributions for papers A and B                 will be inferred as part of the model). ρ is a dis-
to be related. Similarly, if two papers share an au-              tance metric measuring the dissimilarity between
thor, we might expect them to be topically related.               θ d and θ d . For now, this is Euclidean distance


                                                            294
(i.e., ρ(θ d , θ d ) = ||θ d − θ d ||); later, we show                                140
                                                                                                                               auth
that alternative distance metrics are preferable.                                     130                                      book
   Adding the graph structure necessitates the ad-                                                                             cite
                                                                                                                               http
dition of hyperparameters e for every edge e ∈ E.                                     120
                                                                                                                               time




                                                                         perplexity
                                                                                                                               *none*
We place an exponential prior on each 1/ e with                                       110                                      year
parameter λ: p( e | λ) = λ exp(−λ/ e ). Finally,
                                                                                      100
we place a vague Gam(λa , λb ) prior on λ.
                                                                                       90
3.2 Multiple Graphs
In many applications, there may be multiple                                            80
                                                                                            0   200          400         600            800
graphs that apply to the same data set, G1 , . . . , GJ .                                              # of iterations

In this case, we construct a single MRF based on
the union of these graph structures. Each edge now                               Figure 3: Held-out perplexity for different graphs.
has L-many parameters (one for each graph j) j .      e
Each graph also has its own exponential prior pa-                    document, we run 10 Metropolis steps; the accep-
rameter λj . Together, this yields:                                  tance rates are roughly 25%.
                                   j
 ψd,d (θ d , θ d ) = exp −         d,d   ρ(θ d , θ d ) (4)           5                Experiments
                               j
  Here, the sum ranges only over those graphs                        Our experiments are on a collection for 7441 doc-
that have (d, d ) in their edge set.                                 ument abstracts crawled from CiteSeer. The crawl
                                                                     was seeded with a collection of ten documents
4     Inference                                                      from each of: ACL, EMNLP, SIGIR, ICML,
Inference in MRTFs is somewhat complicated                           NIPS, UAI. This yields 650 thousand words of text
from inference in LDA, due to the introduction                       after remove stop words. We use the following
of the additional potential functions. In partic-                    graphs (number in parens is the number of edges):
ular, while it is possible to analytically integrate
out θ in LDA (due to multinomial/Dirichlet con-                      auth: shared author (47k)
jugacy), this is no longer possible in MRTFs. This                   book: shared booktitle/journal (227k)
means that we must explicitly represent (and sam-                    cite: one cites the other (18k)
ple over) the topic distributions θ in the MRTF.                     http: source file from same domain (147k)
   This means that we must sample over the fol-                      time: published within one year (4122k)
lowing set of variables: α, θ, z, and λ. Sam-                        year: published in the same year (2101k)
pling for α remains unchanged from the LDA
case. Sampling for variables except θ is easy:                       Other graph structures are of course possible, but
                             #−dn                                    these were the most straightforward to cull.
                               z=k,w=wdn +       η
    zdn = k :         θdk        −dn
                                                         (5)            The first thing we look at is convergence of
                            k (#z=k ,w=wdn       + η)                the samplers for the different graphs. See Fig-
     1/   d,d   ∼ Exp λ + ρ(θ d , θ d )                  (6)         ure 3. Here, we can see that the author graph and
                                                                     the citation graph provide improved perplexity to
            λ ∼ Gam λa + |E| , λb +                  e   (7)         the straightforward LDA model (called “*none*”),
                                             e                       and that convergence occurs in a few hundred iter-
   The latter two follow from simple conjugacy.                      ations. Due to their size, the final two graphs led
When we use multiple graphs, we assign a sepa-                       to significantly slower inference than the first four,
rate λ for each graph.                                               so results with those graphs are incomplete.
   For sampling θ, we resort to a Metropolis-
Hastings step. Our proposal distribution is the                      Tuning Graph Parameters. The next item we
Dirichlet posterior over θ, given all the current as-                investigate is whether it is important to tune the
signments. The acceptance probability then just                      graph connectivity weights (the and λ variables).
depends on the graph distances. In particular,                       It turns out this is incredibly important; see Fig-
once θ d is drawn from the posterior Dirichlet, the                  ure 4. This is the same set of results as Figure 3,
acceptance probability becomes d ∈N (d) ψd,d ,                       but without and λ tuning. We see that the graph-
where N (d) denotes the neighbors of d. For each                     based methods do not improve over the baseline.


                                                               295
                 140
                                                                                               *none*    92.1
                                                               auth
                 130                                           book                               http   92.2
                                                               cite
                                                                                                 book    90.2
                                                               http
                 120
                                                               *none*                             cite   88.4
    perplexity



                                                               time                              auth    87.9
                 110                                           year
                                                                                          book+http      89.9
                 100
                                                                                            cite+http    88.6
                                                                                           auth+http     88.0
                  90
                                                                                           book+cite     86.9
                  80                                                                      auth+book      85.1
                       0   200         400            600               800
                                 # of iterations                                           auth+cite     84.3
                                                                                      book+cite+http     87.9
Figure 4: Held-out perplexity for difference graph struc-                             auth+cite+http     85.5
tures without graph parameter tuning.                                                                    85.3
                                                                                     auth+book+http
                                                                                     auth+book+cite      83.7
                 140
                                                                                                 all     83.1
                 130
                                                   Bhattacharyya
                                                                                    Table 1: Comparison of held-out perplexities for vary-
                 120
                                                                                    ing graph structures with two standard deviation error bars;
                                                   Hellinger
                                                                                    grouped by number of graphs. Grey bars are indistinguish-
    perplexity




                 110
                                                   Euclidean                        able from best model in previous group; blue bars are at least
                                                   Logit                            two stddevs better; red bars are at least four stddevs better.
                 100

                  90                                                                Using data from the scientific domain, we have
                                                                                    shown that we can achieve significant reductions
                  80
                       0   200         400            600               800         in perplexity on held-out data using these mod-
                                 # of iterations                                    els. Our model resembles recent work on hyper-
Figure 5: Held-out perplexity for different distance metrics.                       text topic models (Gruber et al., 2008; Sun et al.,
                                                                                    2008) and blog influence (Nallapati and Cohen,
                                                                                    2008), but is specifically tailored toward undi-
Distance Metric. Next, we investigate the use of                                    rected models. Ours is an alternative to the re-
different distance metrics. We experiments with                                     cently proposed Markov Topic Models approach
Bhattacharyya, Hellinger, Euclidean and logistic-                                   (Wang et al., 2009). While the goal of these two
Euclidean. See Figure 5 (this is just for the auth                                  models is similar, the approaches differ fairly dra-
graph). Here, we see that Bhattacharyya and                                         matically: we use the graph structure to inform
Hellinger (well motivated distances for probability                                 the per-document topic distributions; they use the
distributions) outperform the Euclidean metrics.                                    graph structure to inform the unigram models as-
                                                                                    sociated with each topic. It would be worthwhile
Using Multiple Graphs Finally, we compare
                                                                                    to directly compare these two approaches.
results using combinations of graphs. Here, we
run every sampler for 500 iterations and compute                                    References
standard deviations based on ten runs (year and                                     David Blei, Andrew Ng, and Michael Jordan. 2003. Latent
time are excluded). The results are in Table 1.                                       Dirichlet allocation. JMLR, 3.
Here, we can see that adding graphs (almost) al-                                    Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer,
                                                                                      George W. Furnas, and Richard A. Harshman. 1990. In-
ways helps and never hurts. By adding all the                                         dexing by latent semantic analysis. JASIS, 41(6).
graphs together, we are able to achieve an abso-                                    Tom Griffiths and Mark Steyvers. 2006. Probabilistic topic
lute reduction in perplexity of 9 points (roughly                                     models. In Latent Semantic Analysis: A Road to Meaning.
10%). As discussed, this hinges on the tuning of                                    Amit Gruber, Michal Rosen-Zvi, , and Yair Weiss. 2008.
the graph parameters to allow different graphs to                                     Latent topic models for hypertext. In UAI.
                                                                                    Ramesh Nallapati and William Cohen. 2008. Link-PLSA-
have different amounts of influence.                                                   LDA: A new unsupervised model for topics and influence
                                                                                      of blogs. In Conference for Webblogs and Social Media.
6                Discussion                                                         Congkai Sun, Bin Gao, Zhenfu Cao, and Hang Li. 2008.
We have presented a graph-augmented model for                                         HTM: A topic model for hypertexts. In EMNLP.
topic models and shown that a simple combined                                       Chong Wang, Bo Thiesson, Christopher Meek, and David
                                                                                      Blei. 2009. Markov topic models. In AI-Stats.
Gibbs/MH sampler is efficient in these models.


                                                                              296