Lecture 16 Text Databases Information Retrieval Part II

Document Sample
scope of work template
							                                    Lecture 16:
                     Text Databases &
Information Retrieval: Part II
                                              Oct. 20, 2006

                                  ChengXiang Zhai


 CS511 Advanced Database Management Systems                   1
                   The Notion of Relevance

                                                      Relevance


        (Rep(q), Rep(d))                            P(r=1|q,d) r {0,1}                       P(d q) or P(q d)
        Similarity                                 Probability of Relevance                   Probabilistic inference


                                  Regression                  Generative                               Different
          Different                 Model                       Model
       rep & similarity                                                                           inference system
                                   (Fox 83)
                                                        Doc               Query
             …                                       generation         generation
                                                                                          Prob. concept         Inference
                                                                                           space model            network
  Vector space        Prob. distr.                     Classical               LM        (Wong & Yao, 95)          model
     model              model                         prob. Model          approach                         (Turtle & Croft, 91)
(Salton et al., 75) (Wong & Yao, 89)                 (Robertson &     (Ponte & Croft, 98)
                                                   Sparck Jones, 76) (Lafferty & Zhai, 01a)

      CS511 Advanced Database Management Systems                                                                           2
               What is a Statistical LM?
• A probability distribution over word sequences
   – p(“Today is Wednesday”)  0.001
   – p(“Today Wednesday is”)  0.0000000000001
   – p(“The eigenvalue is positive”)  0.00001
• Context-dependent!
• Can also be regarded as a probabilistic
  mechanism for “generating” text, thus also
  called a “generative” model



  CS511 Advanced Database Management Systems       3
                     Why is a LM Useful?
• Provides a principled way to quantify the
  uncertainties associated with natural
  language
• Allows us to answer questions like:
    – Given that we see “John” and “feels”, how likely will we see
      “happy” as opposed to “habit” as the next word?
       (speech recognition)
    – Given that we observe “baseball” three times and “game”
      once in a news article, how likely is it about “sports”?
      (text categorization, information retrieval)
    – Given that a user is interested in sports news, how likely
      would the user use “baseball” in a query?
       (information retrieval)
CS511 Advanced Database Management Systems                           4
                                        Basic Issues
• Define the probabilistic model
  – Event, Random Variables, Joint/Conditional Prob’s
  – P(w1 w2 ... wn)=f(1, 2 ,…, n)
• Estimate model parameters
  – Tune the model to best fit the data and our prior
    knowledge
  – i=?
• Apply the model to a particular task
  – Many applications

   CS511 Advanced Database Management Systems           5
     The Simplest Language Model
                                        (Unigram Model)
• Generate a piece of text by generating each word
  INDEPENDENTLY
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w )+…+p(w )=1 (N is voc. size)
                                                1   N


• Essentially a multinomial distribution over words
• A piece of text can be regarded as a sample drawn
  according to this word distribution




   CS511 Advanced Database Management Systems             6
Text Generation with Unigram LM
      (Unigram) Language Model                 Sampling
                                                           Document
                 p(w| )
                       …
                       text 0.2
                       mining 0.1                           Text mining
                       association 0.01
  Topic 1:             clustering 0.02                         paper
Text mining            …
                       food 0.00001
                       …
                       …
 Topic 2:              food 0.25                           Food nutrition
                       nutrition 0.1
  Health               healthy 0.05                           paper
                       diet 0.02
                       …
   CS511 Advanced Database Management Systems                               7
          Estimation of Unigram LM
   (Unigram) Language Model                 Estimation
                                                           Document
             p(w| )=?

                    …                                         text 10
   10/100           text ?                                   mining 5
    5/100           mining ?
                    association ?                         association 3
    3/100
                    database ?                             database 3
    3/100
                    …                                      algorithm 2
    1/100           query ?                                     …
                    …                                        query 1
                                                            efficient 1



                                                     A “text mining paper”
                                                      (total #words=100)

CS511 Advanced Database Management Systems                                   8
   Empirical distribution of words
• There are stable language-independent patterns in
  how people use natural languages
• A few words occur very frequently; most occur rarely.
  E.g., in news articles,
  – Top 4 words: 10~15% word occurrences
  – Top 50 words: 35~40% word occurrences
• The most frequent word in one corpus may be rare in
  another



  CS511 Advanced Database Management Systems          9
                                                Zipf’s Law

• rank * frequency  constant                                           F ( w) 
                                                                                      C
                                                                                   r ( w)
                                                                                               1, C  0.1



           Word                                              Most useful words (Luhn 57)
           Freq.


    Biggest                                                             Is “too rare” a problem?
data structure
 (stop words)


                                                 Word Rank (by Freq)
                                                             C
Generalized Zipf’s law:                     F ( w)                    Applicable in many domains
                                                       [r ( w)  B]

   CS511 Advanced Database Management Systems                                                            10
   Language Models for Retrieval
                                     (Ponte & Croft 98)
                                         Language Model
Document
                                          …
                                          text ?
                                          mining ?                    Query =
 Text mining                              assocation ?        “data mining algorithms”
    paper                                 clustering ?
                                          …
                                          food ?
                                          …

                                          …
                                                          ?   Which model would most
                                                              likely have generated
                                                               this query?
Food nutrition                            food ?
                                          nutrition ?
   paper                                  healthy ?
                                          diet ?
                                          …
  CS511 Advanced Database Management Systems                                        11
Ranking Docs by Query Likelihood
                              Doc LM          Query likelihood

           d1                    d1           p(q| d1)
                                                                 q
           d2                    d2           p(q| d2)



                                              p(q| dN)


           dN                     dN

 CS511 Advanced Database Management Systems                          12
              Retrieval as
        Language Model Estimation
• Document ranking based on query likelihood
           log p(q | d )   log p(w i | d )
                                             i

           where , q  w 1w 2 ...w n             Document language model


• Retrieval problem  Estimation of p(wi|d)
• Smoothing is an important issue, and
     distinguishes different approaches

CS511 Advanced Database Management Systems                                 13
   Problem with the ML Estimator
• What if a word doesn’t appear in the text?
• In general, what probability should we give a word
  that has not been observed?
• If we want to assign non-zero probabilities to such
  words, we’ll have to discount the probabilities of
  observed words
• This is what “smoothing” is about …


   CS511 Advanced Database Management Systems           14
      Language Model Smoothing
            (Illustration)

                P(w)

                                  Max. Likelihood Estimate

                                 pML ( w )       count of w
                                               count of all words




                                                                    Smoothed LM




                                                                                  w


CS511 Advanced Database Management Systems                                            15
   A General Smoothing Scheme
• All smoothing methods try to
     – discount the probability of words seen in a doc
     – re-allocate the extra probability so that unseen
       words will have a non-zero probability
• Most use a reference model (collection language
   model) to discriminate unseen words
                                              Discounted ML estimate
               pseen (w | d )                 if w is seen in d
   p(w | d )  
                d p(w | C )                  otherwise
                                              Collection language model
 CS511 Advanced Database Management Systems                               16
   Smoothing & TF-IDF Weighting
• Plug in the general smoothing scheme to the query
  likelihood retrieval formula, we obtain

                                                                Doc length normalization
                                  TF weighting                  (long doc is expected to have a smaller d)

                        pseen ( wi | d )
 log p(q | d )   [log                  ]  n log  d                         log p(w | C )
                         d p( wi | C )
                                                                                              i
                wi  d                                                           i
                            wi q

                                                IDF weighting                          Ignore for ranking

    • Smoothing with p(w|C)  TF-IDF + length
         norm.

   CS511 Advanced Database Management Systems                                                           17
                             How to Smooth?
• All smoothing methods try to
       – discount the probability of words seen in a
         document
       – re-allocate the extra counts so that unseen
         words will have a non-zero count
• Method 1 (Additive smoothing): Add a
     constant  to the counts of each word
                                 Counts of w in d
                                                                   “Add one”, Laplace smoothing
                                         c( w, d )  1
                            p( w | d ) 
                                         | d |  |V |             Vocabulary size


• Problems?                                  Length of d (total counts)


CS511 Advanced Database Management Systems                                                        18
           Other Smoothing Methods
• Method 2 (Absolute discounting): Subtract a
    constant  from the counts of each word
                                                                                       # uniq words
                                             max( c ( w;d )  ,0)  |d |u p ( w| REF )
               p (w | d )                                      |d |


• Method 3 (Linear interpolation, Jelinek-Mercer):
    “Shrink” uniformly toward p(w|REF)
                                       c( w, d )
                 p( w | d )  (1   )             p( w | REF )
                                         |d |

                                                             parameter
                                   ML estimate


CS511 Advanced Database Management Systems                                                            19
  Other Smoothing Methods (cont.)
• Method 4 (Dirichlet Prior/Bayesian):                                                    Assume
  pseudo counts p(w|REF)
                           c ( w;d )   p ( w| REF )                 c( w, d )
     p (w | d )                     |d |                  |d |
                                                            |d |               |d |  p( w | REF )
                                                                                      
                                                                        |d |
                                                                               parameter

• Method 5 (Good Turing): Assume total # unseen
  events to be n1 (# of singletons), and adjust
  the seen events in the same way
                                                                r 1
       p (w | d )       c*( w; d )
                            |d |      ; c *( w, d )  r*            nr 1 , where r  c( w, d )
                                                                 nr
               n1       2* n2
       0*        ,1*        ,..... What if nr  0? What about p  w | REF  ?
               n0        n1
  CS511 Advanced Database Management Systems                                                             20
 So, which method is the best?

                    It depends on the data and the task!
Many other sophisticated smoothing methods have been
                      proposed…
 Cross validation is generally used to choose the best
   method and/or set the smoothing parameters…
            For retrieval, Dirichlet prior performs well…




 CS511 Advanced Database Management Systems                 21
 Comparison of Three Methods

           Query Type                        JM         Dir        AD
           Title                               0.228       0.256     0.237
           Long                                0.278       0.276     0.260


                           Relative performance of JM, Dir. and AD
       precision
         0.3
                                                                      TitleQuery
          0.2
                                                                      LongQuery
          0.1

            0
                           JM                    DIR          AD
                                               Method

CS511 Advanced Database Management Systems                                         22
              Applications of Basic IR
                    Techniques




CS511 Advanced Database Management Systems   23
      Some “Basic” IR Techniques
• Stemming
• Stop words
• Weighting of terms (e.g., TF-IDF)
• Vector/Unigram representation of text
• Text similarity (e.g., cosine, KL-div)
• Relevance/pseudo feedback (e.g., Rocchio)

         They are not just for retrieval!
  CS511 Advanced Database Management Systems   24
Generality of Basic Techniques
                     t1 t2 … t n                                        tt
                                                   Term             t t tt          tt
            d1 w11 w12… w1n                                                          t
                                                 similarity
            d2 w21 w22… w2n                                               tt
                                                                           t
            …     ……                                       CLUSTERING
            dm wm1 wm2… wmn                                            d
                                                    Doc             d dd         dd
                                                 similarity          d          d d
                                                                       d          d d
                                                                           dd
             Term Weighting
                                                                         Vector
                                               Sentence
                    Tokenized text                                      centroid
                                               selection

                                             SUMMARIZATION                      d
       Stemming & Stop words


                      Raw text
                                             META-DATA/
                                             ANNOTATION           CATEGORIZATION25
CS511 Advanced Database Management Systems
                       Sample Applications
• Information Filtering
• Text Categorization
• Document/Term Clustering
• Text Summarization




  CS511 Advanced Database Management Systems   26
                     Information Filtering
• Stable & long term interest, dynamic info
    source
• System must make a delivery decision
                     document
    immediately as amy interest: “arrives”


        …                                    Filtering
                                             System




CS511 Advanced Database Management Systems               27
   A Vector-Space Filtering Model

                                                                  no
 doc                                                                            Utility
                  Scoring                     Thresholding                    Evaluation
vector
                                                                 yes
                                                                           F=3R+-2*N+
              profile vector                   threshold
                                                                           R+: yes & correct
                                                                           N+: yes & incorrect
                 Vector                        Threshold
                Learning                       Learning

                                                              Feedback
                                                             Information



 CS511 Advanced Database Management Systems                                                28
    Issues in Information Filtering
• Threshold setting
    – Crucial for binary decision making
    – Must avoid under-delivery or over-delivery
• Initialization
    – What threshold should a system start with?
• Learning from limited and biased feedback
    – Only delivered documents get feedback info
    – How to learn a threshold?
    – Exploitation vs. exploration
• Other issues (redundancy, interest shift, etc.)
  CS511 Advanced Database Management Systems        29
 Examples of Information Filtering
• News filtering
• Email filtering
• Recommending Systems
• Literature alert
• And many others



  CS511 Advanced Database Management Systems   30
                       Sample Applications
• Information Filtering
Text Categorization
• Document/Term Clustering
• Text Summarization




  CS511 Advanced Database Management Systems   31
                            Text Categorization
• Pre-given categories and labeled document
 examples (Categories may form hierarchy)
• Classify new documents
• A standard supervised learning problem
                                                                     Sports
                                              Categorization
                                                                     Business
                                                 System
                                                                     Education
                                                                 …   …
                                                     Sports
                                                                     Science
                                                     Business

                                                     Education
 CS511 Advanced Database Management Systems                                      32
 “Retrieval-based” Categorization
• Treat each category as representing an
 “information need”
• Treat examples in each category as “relevant
 documents”
• Use feedback approaches to learn a good
 “query”
• Match all the learned queries to a new document
• A document gets the category(categories)
 represented by the best matching query(queries)

  CS511 Advanced Database Management Systems     33
        Prototype-based Classifier
• Key elements (“retrieval techniques”)
      – Prototype/document representation (e.g., term vector)
      – Document-prototype distance measure (e.g., dot product)
      – Prototype vector learning: Rocchio feedback

• Example



CS511 Advanced Database Management Systems                        34
         K-Nearest Neighbor Classifier
•    Keep all training examples
•    Find k examples that are most similar to the new
     document (“neighbor” documents)
•    Assign the category that is most common in
     these neighbor documents (neighbors vote for
     the category)
•    Can be improved by considering the distance of a
     neighbor ( A closer neighbor has more influence)
•    Technical elements (“retrieval techniques”)
       – Document representation
       – Document distance measure


CS511 Advanced Database Management Systems              35
             Example of K-NN Classifier

                                             (k=4)
   (k=1)




CS511 Advanced Database Management Systems           36
  Examples of Text Categorization
• News article classification
• Meta-data annotation
• Automatic Email sorting
• Web page classification




   CS511 Advanced Database Management Systems   37
                       Sample Applications
• Information Filtering
• Text Categorization
Document/Term Clustering
• Text Summarization




  CS511 Advanced Database Management Systems   38
                The Clustering Problem
• Discover “natural structure”
• Group similar objects together
• Object can be document, term, passages
• Example




 CS511 Advanced Database Management Systems   39
          Similarity-based Clustering
                  (as opposed to “model-based”)
• Define a similarity function to measure
  similarity between two objects
• Gradually group similar objects together in a
  bottom-up fashion
• Stop when some stopping criterion is met
• Variations: different ways to compute group
  similarity based on individual object
  similarity



  CS511 Advanced Database Management Systems      40
      Similarity-induced Structure




CS511 Advanced Database Management Systems   41
 How to Compute Group Similarity?


Three Popular Methods:
Given two groups g1 and g2,

Single-link algorithm: s(g1,g2)= similarity of the closest pair

complete-link algorithm: s(g1,g2)= similarity of the farthest pair

average-link algorithm: s(g1,g2)= average of similarity of all pairs


 CS511 Advanced Database Management Systems                          42
                Three Methods Illustrated

                                             complete-link algorithm


                     g1                                         g2

                                                 ?
                                                ……


                    Single-link algorithm
                                                             average-link algorithm


CS511 Advanced Database Management Systems                                            43
 Examples of Doc/Term Clustering
• Clustering of retrieval results
• Clustering of documents in the whole collection
• Term clustering to define “concept” or “theme”
• Automatic construction of hyperlinks
• In general, very useful for text mining



  CS511 Advanced Database Management Systems        44
                       Sample Applications
• Information Filtering
• Text Categorization
• Document/Term Clustering
Text Summarization




  CS511 Advanced Database Management Systems   45
       The Summarization Problem
• Essentially “semantic compression” of text
• Selection-based vs. generation-based summary
• In general, we need a purpose for summarization,
  but it’s hard to define it




  CS511 Advanced Database Management Systems         46
 “Retrieval-based” Summarization
• Observation: term vector  summary?
• Basic approach
  – Rank “sentences”, and select top N as a summary
• Methods for ranking sentences
  – Based on term weights
  – Based on position of sentences
  – Based on the similarity of sentence and document
    vector



  CS511 Advanced Database Management Systems           47
        Simple Discourse Analysis

            ----------
            ----------
                                             vector 1
                                             vector 2
                                                        similarity

            ----------
            ----------                       vector 3
                                             …
                                                        similarity

            ----------
            ----------                       …
            ----------
            ----------
            ----------
            ----------
            ----------
            ----------
            ----------
            ----------
            ----------
            ----------                       vector n-1
                                             vector n   similarity


CS511 Advanced Database Management Systems                           48
A Simple Summarization Method

            ----------
            ----------
            ----------
            ----------                       summary
            ----------
            ----------
                                             sentence 1
                                                            Most similar
                                                          in each segment


            ----------
            ----------                       sentence 2             Doc vector
            ----------
            ----------
            ----------
            ----------                       sentence 3
            ----------
            ----------
            ----------
            ----------
CS511 Advanced Database Management Systems                                  49
        Examples of Summarization
• News summary
• Summarize retrieval results
  – Single doc summary
  – Multi-doc summary
• Summarize a cluster of documents (automatic label
  creation for clusters)




  CS511 Advanced Database Management Systems      50
                 What You Should Know
• Language models are new retrieval models with
  many advantages
• The retrieval techniques can be used to do more
  than just search




  CS511 Advanced Database Management Systems        51

						
Related docs