# Lecture 16 Text Databases Information Retrieval Part II

Shared by:
Categories
-
Stats
views:
1
posted:
3/7/2010
language:
pages:
51
Document Sample

Lecture 16:
Text Databases &
Information Retrieval: Part II
Oct. 20, 2006

ChengXiang Zhai

CS511 Advanced Database Management Systems                   1
The Notion of Relevance

Relevance

(Rep(q), Rep(d))                            P(r=1|q,d) r {0,1}                       P(d q) or P(q d)
Similarity                                 Probability of Relevance                   Probabilistic inference

Regression                  Generative                               Different
Different                 Model                       Model
rep & similarity                                                                           inference system
(Fox 83)
Doc               Query
…                                       generation         generation
Prob. concept         Inference
space model            network
Vector space        Prob. distr.                     Classical               LM        (Wong & Yao, 95)          model
model              model                         prob. Model          approach                         (Turtle & Croft, 91)
(Salton et al., 75) (Wong & Yao, 89)                 (Robertson &     (Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)

CS511 Advanced Database Management Systems                                                                           2
What is a Statistical LM?
• A probability distribution over word sequences
– p(“Today is Wednesday”)  0.001
– p(“Today Wednesday is”)  0.0000000000001
– p(“The eigenvalue is positive”)  0.00001
• Context-dependent!
• Can also be regarded as a probabilistic
mechanism for “generating” text, thus also
called a “generative” model

CS511 Advanced Database Management Systems       3
Why is a LM Useful?
• Provides a principled way to quantify the
uncertainties associated with natural
language
• Allows us to answer questions like:
– Given that we see “John” and “feels”, how likely will we see
“happy” as opposed to “habit” as the next word?
(speech recognition)
– Given that we observe “baseball” three times and “game”
once in a news article, how likely is it about “sports”?
(text categorization, information retrieval)
– Given that a user is interested in sports news, how likely
would the user use “baseball” in a query?
(information retrieval)
CS511 Advanced Database Management Systems                           4
Basic Issues
• Define the probabilistic model
– Event, Random Variables, Joint/Conditional Prob’s
– P(w1 w2 ... wn)=f(1, 2 ,…, n)
• Estimate model parameters
– Tune the model to best fit the data and our prior
knowledge
– i=?
• Apply the model to a particular task
– Many applications

CS511 Advanced Database Management Systems           5
The Simplest Language Model
(Unigram Model)
• Generate a piece of text by generating each word
INDEPENDENTLY
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w )+…+p(w )=1 (N is voc. size)
1   N

• Essentially a multinomial distribution over words
• A piece of text can be regarded as a sample drawn
according to this word distribution

CS511 Advanced Database Management Systems             6
Text Generation with Unigram LM
(Unigram) Language Model                 Sampling
Document
p(w| )
…
text 0.2
mining 0.1                           Text mining
association 0.01
Topic 1:             clustering 0.02                         paper
Text mining            …
food 0.00001
…
…
Topic 2:              food 0.25                           Food nutrition
nutrition 0.1
Health               healthy 0.05                           paper
diet 0.02
…
CS511 Advanced Database Management Systems                               7
Estimation of Unigram LM
(Unigram) Language Model                 Estimation
Document
p(w| )=?

…                                         text 10
10/100           text ?                                   mining 5
5/100           mining ?
association ?                         association 3
3/100
database ?                             database 3
3/100
…                                      algorithm 2
1/100           query ?                                     …
…                                        query 1
efficient 1

A “text mining paper”
(total #words=100)

CS511 Advanced Database Management Systems                                   8
Empirical distribution of words
• There are stable language-independent patterns in
how people use natural languages
• A few words occur very frequently; most occur rarely.
E.g., in news articles,
– Top 4 words: 10~15% word occurrences
– Top 50 words: 35~40% word occurrences
• The most frequent word in one corpus may be rare in
another

CS511 Advanced Database Management Systems          9
Zipf’s Law

• rank * frequency  constant                                           F ( w) 
C
r ( w)
  1, C  0.1

Word                                              Most useful words (Luhn 57)
Freq.

Biggest                                                             Is “too rare” a problem?
data structure
(stop words)

Word Rank (by Freq)
C
Generalized Zipf’s law:                     F ( w)                    Applicable in many domains
[r ( w)  B]

CS511 Advanced Database Management Systems                                                            10
Language Models for Retrieval
(Ponte & Croft 98)
Language Model
Document
…
text ?
mining ?                    Query =
Text mining                              assocation ?        “data mining algorithms”
paper                                 clustering ?
…
food ?
…

…
?   Which model would most
likely have generated
this query?
Food nutrition                            food ?
nutrition ?
paper                                  healthy ?
diet ?
…
CS511 Advanced Database Management Systems                                        11
Ranking Docs by Query Likelihood
Doc LM          Query likelihood

d1                    d1           p(q| d1)
q
d2                    d2           p(q| d2)

p(q| dN)

dN                     dN

CS511 Advanced Database Management Systems                          12
Retrieval as
Language Model Estimation
• Document ranking based on query likelihood
log p(q | d )   log p(w i | d )
i

where , q  w 1w 2 ...w n             Document language model

• Retrieval problem  Estimation of p(wi|d)
• Smoothing is an important issue, and
distinguishes different approaches

CS511 Advanced Database Management Systems                                 13
Problem with the ML Estimator
• What if a word doesn’t appear in the text?
• In general, what probability should we give a word
that has not been observed?
• If we want to assign non-zero probabilities to such
words, we’ll have to discount the probabilities of
observed words
• This is what “smoothing” is about …

CS511 Advanced Database Management Systems           14
Language Model Smoothing
(Illustration)

P(w)

Max. Likelihood Estimate

pML ( w )       count of w
count of all words

Smoothed LM

w

CS511 Advanced Database Management Systems                                            15
A General Smoothing Scheme
• All smoothing methods try to
– discount the probability of words seen in a doc
– re-allocate the extra probability so that unseen
words will have a non-zero probability
• Most use a reference model (collection language
model) to discriminate unseen words
Discounted ML estimate
pseen (w | d )                 if w is seen in d
p(w | d )  
 d p(w | C )                  otherwise
Collection language model
CS511 Advanced Database Management Systems                               16
Smoothing & TF-IDF Weighting
• Plug in the general smoothing scheme to the query
likelihood retrieval formula, we obtain

Doc length normalization
TF weighting                  (long doc is expected to have a smaller d)

pseen ( wi | d )
log p(q | d )   [log                  ]  n log  d                         log p(w | C )
 d p( wi | C )
i
wi  d                                                           i
wi q

IDF weighting                          Ignore for ranking

• Smoothing with p(w|C)  TF-IDF + length
norm.

CS511 Advanced Database Management Systems                                                           17
How to Smooth?
• All smoothing methods try to
– discount the probability of words seen in a
document
– re-allocate the extra counts so that unseen
words will have a non-zero count
constant  to the counts of each word
Counts of w in d
c( w, d )  1
p( w | d ) 
| d |  |V |             Vocabulary size

• Problems?                                  Length of d (total counts)

CS511 Advanced Database Management Systems                                                        18
Other Smoothing Methods
• Method 2 (Absolute discounting): Subtract a
constant  from the counts of each word
# uniq words
max( c ( w;d )  ,0)  |d |u p ( w| REF )
p (w | d )                                      |d |

• Method 3 (Linear interpolation, Jelinek-Mercer):
“Shrink” uniformly toward p(w|REF)
c( w, d )
p( w | d )  (1   )             p( w | REF )
|d |

parameter
ML estimate

CS511 Advanced Database Management Systems                                                            19
Other Smoothing Methods (cont.)
• Method 4 (Dirichlet Prior/Bayesian):                                                    Assume
pseudo counts p(w|REF)
c ( w;d )   p ( w| REF )                 c( w, d )
p (w | d )                     |d |                  |d |
|d |               |d |  p( w | REF )

|d |
parameter

• Method 5 (Good Turing): Assume total # unseen
events to be n1 (# of singletons), and adjust
the seen events in the same way
r 1
p (w | d )       c*( w; d )
|d |      ; c *( w, d )  r*            nr 1 , where r  c( w, d )
nr
n1       2* n2
0*        ,1*        ,..... What if nr  0? What about p  w | REF  ?
n0        n1
CS511 Advanced Database Management Systems                                                             20
So, which method is the best?

It depends on the data and the task!
Many other sophisticated smoothing methods have been
proposed…
Cross validation is generally used to choose the best
method and/or set the smoothing parameters…
For retrieval, Dirichlet prior performs well…

CS511 Advanced Database Management Systems                 21
Comparison of Three Methods

Title                               0.228       0.256     0.237
Long                                0.278       0.276     0.260

Relative performance of JM, Dir. and AD
precision
0.3
TitleQuery
0.2
LongQuery
0.1

0
Method

CS511 Advanced Database Management Systems                                         22
Applications of Basic IR
Techniques

CS511 Advanced Database Management Systems   23
Some “Basic” IR Techniques
• Stemming
• Stop words
• Weighting of terms (e.g., TF-IDF)
• Vector/Unigram representation of text
• Text similarity (e.g., cosine, KL-div)
• Relevance/pseudo feedback (e.g., Rocchio)

They are not just for retrieval!
CS511 Advanced Database Management Systems   24
Generality of Basic Techniques
t1 t2 … t n                                        tt
Term             t t tt          tt
d1 w11 w12… w1n                                                          t
similarity
d2 w21 w22… w2n                                               tt
t
…     ……                                       CLUSTERING
dm wm1 wm2… wmn                                            d
Doc             d dd         dd
similarity          d          d d
d          d d
dd
Term Weighting
Vector
Sentence
Tokenized text                                      centroid
selection

SUMMARIZATION                      d
Stemming & Stop words

Raw text
META-DATA/
ANNOTATION           CATEGORIZATION25
Sample Applications
• Information Filtering
• Text Categorization
• Document/Term Clustering
• Text Summarization

CS511 Advanced Database Management Systems   26
Information Filtering
• Stable & long term interest, dynamic info
source
• System must make a delivery decision
document
immediately as amy interest: “arrives”

…                                    Filtering
System

CS511 Advanced Database Management Systems               27
A Vector-Space Filtering Model

no
doc                                                                            Utility
Scoring                     Thresholding                    Evaluation
vector
yes
F=3R+-2*N+
profile vector                   threshold
R+: yes & correct
N+: yes & incorrect
Vector                        Threshold
Learning                       Learning

Feedback
Information

CS511 Advanced Database Management Systems                                                28
Issues in Information Filtering
• Threshold setting
– Crucial for binary decision making
– Must avoid under-delivery or over-delivery
• Initialization
• Learning from limited and biased feedback
– Only delivered documents get feedback info
– How to learn a threshold?
– Exploitation vs. exploration
• Other issues (redundancy, interest shift, etc.)
CS511 Advanced Database Management Systems        29
Examples of Information Filtering
• News filtering
• Email filtering
• Recommending Systems
• And many others

CS511 Advanced Database Management Systems   30
Sample Applications
• Information Filtering
Text Categorization
• Document/Term Clustering
• Text Summarization

CS511 Advanced Database Management Systems   31
Text Categorization
• Pre-given categories and labeled document
examples (Categories may form hierarchy)
• Classify new documents
• A standard supervised learning problem
Sports
Categorization
System
Education
…   …
Sports
Science

Education
CS511 Advanced Database Management Systems                                      32
“Retrieval-based” Categorization
• Treat each category as representing an
“information need”
• Treat examples in each category as “relevant
documents”
• Use feedback approaches to learn a good
“query”
• Match all the learned queries to a new document
• A document gets the category(categories)
represented by the best matching query(queries)

CS511 Advanced Database Management Systems     33
Prototype-based Classifier
• Key elements (“retrieval techniques”)
– Prototype/document representation (e.g., term vector)
– Document-prototype distance measure (e.g., dot product)
– Prototype vector learning: Rocchio feedback

• Example

CS511 Advanced Database Management Systems                        34
K-Nearest Neighbor Classifier
•    Keep all training examples
•    Find k examples that are most similar to the new
document (“neighbor” documents)
•    Assign the category that is most common in
these neighbor documents (neighbors vote for
the category)
•    Can be improved by considering the distance of a
neighbor ( A closer neighbor has more influence)
•    Technical elements (“retrieval techniques”)
– Document representation
– Document distance measure

CS511 Advanced Database Management Systems              35
Example of K-NN Classifier

(k=4)
(k=1)

CS511 Advanced Database Management Systems           36
Examples of Text Categorization
• News article classification
• Meta-data annotation
• Automatic Email sorting
• Web page classification

CS511 Advanced Database Management Systems   37
Sample Applications
• Information Filtering
• Text Categorization
Document/Term Clustering
• Text Summarization

CS511 Advanced Database Management Systems   38
The Clustering Problem
• Discover “natural structure”
• Group similar objects together
• Object can be document, term, passages
• Example

CS511 Advanced Database Management Systems   39
Similarity-based Clustering
(as opposed to “model-based”)
• Define a similarity function to measure
similarity between two objects
• Gradually group similar objects together in a
bottom-up fashion
• Stop when some stopping criterion is met
• Variations: different ways to compute group
similarity based on individual object
similarity

CS511 Advanced Database Management Systems      40
Similarity-induced Structure

CS511 Advanced Database Management Systems   41
How to Compute Group Similarity?

Three Popular Methods:
Given two groups g1 and g2,

Single-link algorithm: s(g1,g2)= similarity of the closest pair

complete-link algorithm: s(g1,g2)= similarity of the farthest pair

average-link algorithm: s(g1,g2)= average of similarity of all pairs

CS511 Advanced Database Management Systems                          42
Three Methods Illustrated

g1                                         g2

?
……

CS511 Advanced Database Management Systems                                            43
Examples of Doc/Term Clustering
• Clustering of retrieval results
• Clustering of documents in the whole collection
• Term clustering to define “concept” or “theme”
• In general, very useful for text mining

CS511 Advanced Database Management Systems        44
Sample Applications
• Information Filtering
• Text Categorization
• Document/Term Clustering
Text Summarization

CS511 Advanced Database Management Systems   45
The Summarization Problem
• Essentially “semantic compression” of text
• Selection-based vs. generation-based summary
• In general, we need a purpose for summarization,
but it’s hard to define it

CS511 Advanced Database Management Systems         46
“Retrieval-based” Summarization
• Observation: term vector  summary?
• Basic approach
– Rank “sentences”, and select top N as a summary
• Methods for ranking sentences
– Based on term weights
– Based on position of sentences
– Based on the similarity of sentence and document
vector

CS511 Advanced Database Management Systems           47
Simple Discourse Analysis

----------
----------
vector 1
vector 2
similarity

----------
----------                       vector 3
…
similarity

----------
----------                       …
----------
----------
----------
----------
----------
----------
----------
----------
----------
----------                       vector n-1
vector n   similarity

CS511 Advanced Database Management Systems                           48
A Simple Summarization Method

----------
----------
----------
----------                       summary
----------
----------
sentence 1
Most similar
in each segment

----------
----------                       sentence 2             Doc vector
----------
----------
----------
----------                       sentence 3
----------
----------
----------
----------
CS511 Advanced Database Management Systems                                  49
Examples of Summarization
• News summary
• Summarize retrieval results
– Single doc summary
– Multi-doc summary
• Summarize a cluster of documents (automatic label
creation for clusters)

CS511 Advanced Database Management Systems      50
What You Should Know
• Language models are new retrieval models with