Docstoc

Automatic Labeling of Multinomial Topic Models

Document Sample
Automatic Labeling of Multinomial Topic Models Powered By Docstoc
					 Automatic Labeling of
Multinomial Topic Models

    Qiaozhu Mei, Xuehua Shen,
         ChengXiang Zhai

University of Illinois at Urbana-Champaign
                   Outline
• Background: statistical topic models
• Labeling a topic model
  – Criteria and challenge
• Our approach: a probabilistic framework
• Experiments
• Summary


                                            2
  Statistical Topic Models for Text Mining
  Text                                      Topic models
Collections                          (Multinomial distributions)

                                           term        0.16
                 Probabilistic             relevance   0.08    Subtopic discovery
                Topic Modeling             weight      0.07
                                           feedback    0.04      Topical pattern
              PLSA [Hofmann 99]                                     analysis
                                           independ.    0.03
              LDA [Blei et al. 03]         model       0.03

 …            Author-Topic
              [Steyvers et al. 04]
              Pachinko allocation
                                           …
                                            web        0.21
                                                                Summarization

                                                               Opinion comparison
                                            search     0.10
              [Li & McCallum 06]
                                            link       0.08

                                                                     …
              CPLSA
              [Mei & Zhai 06]               graph      0.05


                                          …
                                            …
              Topic over time

              …
              [Wang et al. 06]


                                                                             3
     Topic Models: Hard to Interpret
• Use top words                        term          0.16
   – automatic, but hard to make sense relevance 0.08       insulin
                                       weight        0.07
                                       feedback      0.04   foraging
         Term, relevance,                                   foragers
         weight, feedback              independence 0.03
                                                            collected

• Human generated labels
   – Make sense, but cannot scale up
                                       model
                                       frequent

                                       document
                                                      ?
                                                     0.03
                                                     0.02
                                       probabilistic 0.02
                                                     0.02
                                                            grains
                                                            loads
                                                            collection
                                                            nectar
                                       …                    …
               Retrieval Models

           Question: Can we automatically generate
           understandable labels for topics?
                                                                    4
             What is a Good Label?
Retrieval               •   Semantically close (relevance)
models                  •   Understandable – phrases?
term           0.1599
relevance      0.0752   •   High coverage inside topic
weight         0.0660   •   Discriminative across topics
feedback       0.0372
independence 0.0311             iPod Nano
model          0.0310
frequent      0.0233
probabilistic 0.0188        じょうほうけんさく
document      0.0173
…                           Pseudo-feedback
 – Mei and Zhai 06:
 a topic in SIGIR           Information Retrieval
                                                        5
                             Our Method
                NLP Chunker        information retrieval, retrieval model,
Collection      Ngram Stat.        index structure, relevance feedback,
(e.g., SIGIR)                      …
                                       1 Candidate label pool
term         0.16
relevance    0.07          2 Relevance Score             3   Discrimination
weight       0.07                                     information retriev. 0.26 0.01
feedback     0.04      Information retrieval 0.26
                                                      retrieval models 0.20
                       retrieval models     0.19
independence 0.03                                     IR models          0.18
                       IR models            0.17
model        0.03                                     pseudo feedback 0.09
                       pseudo feedback      0.06
…                                                     ……
                       ……
filtering      0.21
                                                         4    Coverage
                                  retrieval models     0.20
collaborative 0.15
                                  IR models       0.18 0.02
… trec              0.18          pseudo feedback      0.09
      evaluation 0.10             ……
      …                           information retrieval 0.01                  6
   Relevance (Task 2): the Zero-Order
                Score
• Intuition: prefer phrases well covering top words

    p(“clustering”|) = 0.4 Clustering


                                                                                  √
                                                               Good Label (l1):
   p(“dimensional”|) = 0.3                                     “clustering
                           dimensional                           algorithm”

                                               p(clustering  a lg orithm |  )
   Latent                  algorithm
                                                p(clustering  a lg orithm)       >
   Topic 
                             birch
                                         ?                 …
                                                                   p (body  shape |  )
                                     p(“shape”|) = 0.01
                                                                     p (body  shape)
                             shape                               Bad Label (l2):
                               …                                 “body shape”
                p(w|)        body
                                     p(“body”|) = 0.001
                                                                                           7
   Relevance (Task 2): the First-Order
                Score
• Intuition: prefer phrases with similar context (distribution)

         Clustering   Clustering                      Clustering      l2: “hash join”

                             Good Label (l1)
                              “clustering
                                             dimension                key …hash join
         dimension dimension  algorithm”                               … code …hash
Topic                                                     …            table …search
                                                                       …hash join…
                                        (l,  ) = D(||l)
                                                       algorithm
         partition            Score
                      partition
                                                                      map key…hash
                            rank
                                                                      …algorithm…key
         algorithm            
                      algorithm    w
                                     p ( w |  ) PMI ( w, lkey )
                                                            |C          …hash…key
                                                                        table…join…
            …           …
P(w|)                                                     hash
           hash         hash                                       p(w | hash join)
                            p(w | clustering algorithm )

                                                                                      8
        Discrimination and Coverage
                (Tasks 3 & 4)
• Discriminative across topic:
  – High relevance to target topic, low relevance to
    other topics
     Score' (l ,  i )  Score(l ,  i )    Score(l , 1,...,i 1,i 1,...,k )

• High Coverage inside topic:
  – Use MMR strategy
      lˆ  arg max [  Score(l ,  )  (1   ) max Sim(l ' , l )]
               lL  S                                  l 'S




                                                                                    9
       Variations and Applications
• Labeling document clusters
  – Document cluster  unigram language model
  – Applicable to any task with unigram language model
• Context sensitive labels
  – Label of a topic is sensitive to the context
  – An alternative way to approach contextual text mining

   tree, prune, root, branch  “tree algorithms” in CS
                              ? in horticulture
                              ? in marketing?
                                                            10
                  Experiments
• Datasets:
  – SIGMOD abstracts; SIGIR abstracts; AP news data
  – Candidate labels: significant bigrams; NLP chunks
• Topic models:
  – PLSA, LDA
• Evaluation:
  – Human annotators to compare labels generated from
    anonymous systems
  – Order of systems randomly perturbed; score average
    over all sample topics
                                                        11
               Result Summary
• Automatic phrase labels >> top words
• 1-order relevance >> 0-order relevance
• Bigram > NLP chunks
  – Bigram works better with literature; NLP better with
    news
• System labels << human labels
  – Scientific literature is an easier task




                                                           12
        Results: Sample Topic Labels
                                          north     0.02
                    the, of, a, and,      case      0.01
                                                            iran contra
                    to, data, > 0.02      trial     0.01
                                          iran       0.01        …
                    …
                    clustering 0.02       documents 0.01
clustering algorithmtime           0.01   walsh    0.009
clustering structure                      reagan   0.009
                    clusters       0.01
         …                                charges 0.007
                    databases 0.01                          tree      0.09
                    large          0.01                     trees     0.08
                    performance 0.01           r tree       spatial   0.08
   large data, data quality       0.005      b tree …       b         0.05
  quality, high data,                                       r         0.04
  data application, …                                       disk      0.02
                                             indexing       array      0.01
                                             methods        cache     0.01
                                                                          13
Results: Context-Sensitive Labeling
 sampling        Context: Database          Context: IR
 estimation      (SIGMOD Proceedings)       (SIGIR Proceedings)
 approximation
                  selectivity estimation;    distributed retrieval;
 histogram
                    random sampling;        parameter estimation;
 selectivity      approximate answers;         mixture models;
 histograms
 …


• Explore the different meaning of a topic with different
  contexts (content switch)
• An alternative approach to contextual text mining


                                                                      14
                    Summary
• Labeling: A postprocessing step of all multinomial
  topic models
• A probabilistic approach to generate good labels
  – understandable, relevant, high coverage, discriminative
• Broadly applicable to mining tasks involving
  multinomial word distributions; context-sensitive
• Future work:
  – Labeling hierarchical topic models
  – Incorporating priors

                                                        15
              Thanks!
- Please come to our poster tonight (#40)


                                        16

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:126
posted:5/23/2010
language:Malay
pages:16