Document Sample

Markov Random Topic Fields e Hal Daum´ III School of Computing University of Utah Salt Lake City, UT 84112 me@hal3.name Abstract of the form of the distance metric used to specify the edge potentials. Again, this has a signiﬁcant Most approaches to topic modeling as- impact on performance. Finally, we consider the sume an independence between docu- use of multiple graphs for a single model and ﬁnd ments that is frequently violated. We that the power of combined graphs also leads to present an topic model that makes use signiﬁcantly better models. of one or more user-speciﬁed graphs de- scribing relationships between documents. 2 Background These graph are encoded in the form of a Probabilistic topic models propose that text can Markov random ﬁeld over topics and serve be considered as a mixture of words drawn from to encourage related documents to have one or more “topics” (Deerwester et al., 1990; similar topic structures. Experiments on Blei et al., 2003). The model we build on is la- show upwards of a 10% improvement in tent Dirichlet allocation (Blei et al., 2003) (hence- modeling performance. forth, LDA). LDA stipulates the following gener- ative model for a document collection: 1 Introduction 1. For each document d = 1 . . . D: One often wishes to apply topic models to large (a) Choose a topic mixture θ d ∼ Dir(α) document collections. In these large collections, (b) For each word in d, n = 1 . . . Nd : we usually have meta-information about how one document relates to another. Perhaps two docu- i. Choose a topic zdn ∼ Mult(θ d ) ments share an author; perhaps one document cites ii. Choose a word wdn ∼ Mult(β zdn ) another; perhaps two documents are published in Here, α is a hyperparameter vector of length K, the same journal or conference. We often believe where K is the desired number of topics. Each that documents related in such a way should have document has a topic distribution θ d over these similar topical structures. We encode this in a K topics and each word is associated with pre- probabilistic fashion by imposing an (undirected) cisely one topic (indicated by zdn ). Each topic Markov random ﬁeld (MRF) on top of a standard k = 1 . . . K is a unigram distribution over words topic model (see Section 3). The edge potentials (aka, a multinomial) parameterized by a vector in the MRF encode the fact that “connected” doc- β k . The associated graphical model for LDA is uments should share similar topic structures, mea- shown in Figure 1. Here, we have added a few sured by some parameterized distance function. additional hyperparameters: we place a Gam(a, b) Inference in the resulting model is complicated prior independently on each component of α and by the addition of edge potentials in the MRF. a Dir(η, . . . , η) prior on each of the βs. We demonstrate that a hybrid Gibbs/Metropolis- The joint distribution over all random variables Hastings sampler is able to efﬁciently explore the speciﬁed by LDA is: posterior distribution (see Section 4). Y In experiments (Section 5), we explore several p(α, θ, z, β, w) = Gam(αk | a, b)Dir(β k | η) (1) variations on our basic model. The ﬁrst is to ex- Y k Y plore the importance of being able to tune the Dir(θd | α) Mult(zdn | θd )Mult(wdn | β zdn ) d n strength of the potentials in the MRF as part of the inference procedure. This turns out to be of utmost Many inference methods have been developed importance. The second is to study the importance for this model; the approach upon which we 293 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 293–296, Suntec, Singapore, 4 August 2009. c 2009 ACL and AFNLP a b η β Doc 1 Doc 2 K θ z w N θ z w N Doc 6 Doc 3 α θ z w θ z w θ z w N N N D Doc 5 Doc 4 z w z w Figure 1: Graphical model for LDA. θ N θ N Figure 2: Example Markov Random Topic Field (variables build is the collapsed Gibbs sampler (Grifﬁths and α and β are excluded for clarify). Steyvers, 2006). Here, the random variables β and θ are analytically integrated out. The main sam- Of if they are both published at EMNLP. Or if they pling variables are the zdn indicators (as well as are published in the same year, or come out of the the hyperparameters: η and a, b). The conditional same institution, or many other possibilities. distribution for zdn conditioned on all other vari- Regardless of the source of this notion of simi- ables in the model gives the following Gibbs sam- larity, we suppose that we can represent the rela- pling distribution p(zdn = k): tionship between documents in the form of a graph #−dn G = (V, E). The vertices in this graph are the doc- #−dn + αk z=k z=k,w=wdn + η P −dn P −dn (2) uments and the edges indicate relatedness. Note k (#z=k + αk ) k (#z=k ,w=wdn + η) that the resulting model will not be fully genera- Here, #−dn denotes the number of times event χ tive, but is still probabilistically well deﬁned. χ occurs in the entire corpus, excluding word n 3.1 Single Graph in document d. Intuitively, the ﬁrst term is a There are multiple possibilities for augmenting (smoothed) relative frequency of topic k occur- LDA with such graph structure. We could “link” ring; the second term is a (smoothed) relative fre- the topic distributions θ over related documents; quency of topic k giving rise to word wdn . we could “like” the topic indicators z over related A Markov random ﬁeld speciﬁes a joint dis- documents. We consider the former because it tribution over a collection of random variables leads to a more natural model. The idea is to “un- x1 , . . . , xN . An undirected graph structure stip- roll” the D-plate in the graphical model for LDA ulates how the joint distribution factorizes over (Figure 1) and connect (via undirected links) the these variables. Given a graph G = (V, E), where θ variables associated with connected documents. V = {x1 , . . . , xN }, let C denote a subset of all Figure 2 shows an example MRTF over six docu- the cliques of G. Then, the MRF speciﬁes the joint 1 ments, with thick edges connecting the θ variables distribution as: p(x) = Z c∈C ψc (xc ). Here, of “related” documents. Note that each θ still has Z = x c∈C ψc (xc ) is the partition function, α as a parent and each w has β as a parent: these xc is the subset of x contained in clique c and ψc are left off for ﬁgure clarity. is any non-negative function that measures how The model is a straightforward “integration” of “good” a particular conﬁguration of variables xc LDA and an MRF speciﬁed by the document re- is. The ψs are called potential functions. lationships G. We begin with the joint distribution 3 Markov Random Topic Fields speciﬁed by LDA (see Eq (1)) and add in edge po- tentials for each edge in the document graph G that Suppose that we have access to a collection of “encourage” the topic distributions of neighboring documents, but do not believe that these docu- documents to be similar. The potentials all have ments are all independent. In this case, the gener- the form: ative story of LDA no longer makes sense: related documents are more likely to have “similar” topic ψd,d (θ d , θ d ) = exp − d,d ρ(θ d , θ d ) (3) structures. For instance, in the scientiﬁc commu- Here, d,d is a “measure of strength” of the im- nity, if paper A cites paper B, we would (a priori) portance of the connection between d and d (and expect the topic distributions for papers A and B will be inferred as part of the model). ρ is a dis- to be related. Similarly, if two papers share an au- tance metric measuring the dissimilarity between thor, we might expect them to be topically related. θ d and θ d . For now, this is Euclidean distance 294 (i.e., ρ(θ d , θ d ) = ||θ d − θ d ||); later, we show 140 auth that alternative distance metrics are preferable. 130 book Adding the graph structure necessitates the ad- cite http dition of hyperparameters e for every edge e ∈ E. 120 time perplexity *none* We place an exponential prior on each 1/ e with 110 year parameter λ: p( e | λ) = λ exp(−λ/ e ). Finally, 100 we place a vague Gam(λa , λb ) prior on λ. 90 3.2 Multiple Graphs In many applications, there may be multiple 80 0 200 400 600 800 graphs that apply to the same data set, G1 , . . . , GJ . # of iterations In this case, we construct a single MRF based on the union of these graph structures. Each edge now Figure 3: Held-out perplexity for different graphs. has L-many parameters (one for each graph j) j . e Each graph also has its own exponential prior pa- document, we run 10 Metropolis steps; the accep- rameter λj . Together, this yields: tance rates are roughly 25%. j ψd,d (θ d , θ d ) = exp − d,d ρ(θ d , θ d ) (4) 5 Experiments j Here, the sum ranges only over those graphs Our experiments are on a collection for 7441 doc- that have (d, d ) in their edge set. ument abstracts crawled from CiteSeer. The crawl was seeded with a collection of ten documents 4 Inference from each of: ACL, EMNLP, SIGIR, ICML, Inference in MRTFs is somewhat complicated NIPS, UAI. This yields 650 thousand words of text from inference in LDA, due to the introduction after remove stop words. We use the following of the additional potential functions. In partic- graphs (number in parens is the number of edges): ular, while it is possible to analytically integrate out θ in LDA (due to multinomial/Dirichlet con- auth: shared author (47k) jugacy), this is no longer possible in MRTFs. This book: shared booktitle/journal (227k) means that we must explicitly represent (and sam- cite: one cites the other (18k) ple over) the topic distributions θ in the MRTF. http: source ﬁle from same domain (147k) This means that we must sample over the fol- time: published within one year (4122k) lowing set of variables: α, θ, z, and λ. Sam- year: published in the same year (2101k) pling for α remains unchanged from the LDA case. Sampling for variables except θ is easy: Other graph structures are of course possible, but #−dn these were the most straightforward to cull. z=k,w=wdn + η zdn = k : θdk −dn (5) The ﬁrst thing we look at is convergence of k (#z=k ,w=wdn + η) the samplers for the different graphs. See Fig- 1/ d,d ∼ Exp λ + ρ(θ d , θ d ) (6) ure 3. Here, we can see that the author graph and the citation graph provide improved perplexity to λ ∼ Gam λa + |E| , λb + e (7) the straightforward LDA model (called “*none*”), e and that convergence occurs in a few hundred iter- The latter two follow from simple conjugacy. ations. Due to their size, the ﬁnal two graphs led When we use multiple graphs, we assign a sepa- to signiﬁcantly slower inference than the ﬁrst four, rate λ for each graph. so results with those graphs are incomplete. For sampling θ, we resort to a Metropolis- Hastings step. Our proposal distribution is the Tuning Graph Parameters. The next item we Dirichlet posterior over θ, given all the current as- investigate is whether it is important to tune the signments. The acceptance probability then just graph connectivity weights (the and λ variables). depends on the graph distances. In particular, It turns out this is incredibly important; see Fig- once θ d is drawn from the posterior Dirichlet, the ure 4. This is the same set of results as Figure 3, acceptance probability becomes d ∈N (d) ψd,d , but without and λ tuning. We see that the graph- where N (d) denotes the neighbors of d. For each based methods do not improve over the baseline. 295 140 *none* 92.1 auth 130 book http 92.2 cite book 90.2 http 120 *none* cite 88.4 perplexity time auth 87.9 110 year book+http 89.9 100 cite+http 88.6 auth+http 88.0 90 book+cite 86.9 80 auth+book 85.1 0 200 400 600 800 # of iterations auth+cite 84.3 book+cite+http 87.9 Figure 4: Held-out perplexity for difference graph struc- auth+cite+http 85.5 tures without graph parameter tuning. 85.3 auth+book+http auth+book+cite 83.7 140 all 83.1 130 Bhattacharyya Table 1: Comparison of held-out perplexities for vary- 120 ing graph structures with two standard deviation error bars; Hellinger grouped by number of graphs. Grey bars are indistinguish- perplexity 110 Euclidean able from best model in previous group; blue bars are at least Logit two stddevs better; red bars are at least four stddevs better. 100 90 Using data from the scientiﬁc domain, we have shown that we can achieve signiﬁcant reductions 80 0 200 400 600 800 in perplexity on held-out data using these mod- # of iterations els. Our model resembles recent work on hyper- Figure 5: Held-out perplexity for different distance metrics. text topic models (Gruber et al., 2008; Sun et al., 2008) and blog inﬂuence (Nallapati and Cohen, 2008), but is speciﬁcally tailored toward undi- Distance Metric. Next, we investigate the use of rected models. Ours is an alternative to the re- different distance metrics. We experiments with cently proposed Markov Topic Models approach Bhattacharyya, Hellinger, Euclidean and logistic- (Wang et al., 2009). While the goal of these two Euclidean. See Figure 5 (this is just for the auth models is similar, the approaches differ fairly dra- graph). Here, we see that Bhattacharyya and matically: we use the graph structure to inform Hellinger (well motivated distances for probability the per-document topic distributions; they use the distributions) outperform the Euclidean metrics. graph structure to inform the unigram models as- sociated with each topic. It would be worthwhile Using Multiple Graphs Finally, we compare to directly compare these two approaches. results using combinations of graphs. Here, we run every sampler for 500 iterations and compute References standard deviations based on ten runs (year and David Blei, Andrew Ng, and Michael Jordan. 2003. Latent time are excluded). The results are in Table 1. Dirichlet allocation. JMLR, 3. Here, we can see that adding graphs (almost) al- Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. In- ways helps and never hurts. By adding all the dexing by latent semantic analysis. JASIS, 41(6). graphs together, we are able to achieve an abso- Tom Grifﬁths and Mark Steyvers. 2006. Probabilistic topic lute reduction in perplexity of 9 points (roughly models. In Latent Semantic Analysis: A Road to Meaning. 10%). As discussed, this hinges on the tuning of Amit Gruber, Michal Rosen-Zvi, , and Yair Weiss. 2008. the graph parameters to allow different graphs to Latent topic models for hypertext. In UAI. Ramesh Nallapati and William Cohen. 2008. Link-PLSA- have different amounts of inﬂuence. LDA: A new unsupervised model for topics and inﬂuence of blogs. In Conference for Webblogs and Social Media. 6 Discussion Congkai Sun, Bin Gao, Zhenfu Cao, and Hang Li. 2008. We have presented a graph-augmented model for HTM: A topic model for hypertexts. In EMNLP. topic models and shown that a simple combined Chong Wang, Bo Thiesson, Christopher Meek, and David Blei. 2009. Markov topic models. In AI-Stats. Gibbs/MH sampler is efﬁcient in these models. 296

DOCUMENT INFO

Shared By:

Categories:

Tags:
Machine Translation, Yang Liu, Statistical Machine Translation, Gary Geunbae Lee, Hae-Chang Rim, Hal Daumé, Natural Language Processing, Latent Dirichlet Allocation, Tae Lee, Manabu Sassano

Stats:

views: | 8 |

posted: | 5/13/2011 |

language: | Indonesian |

pages: | 4 |

OTHER DOCS BY gjjur4356

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.