Lecture 16 Text Databases Information Retrieval Part II
Shared by: vps11289
Categories
Tags
information retrieval, data mining, the user, information science, lecture notes in computer science, text categorization, international conference, research and development, digital libraries, the american, ir system, machine learning, how to, cross-language information retrieval, information systems
-
Stats
- views:
- 1
- posted:
- 3/7/2010
- language:
- pages:
- 51
Document Sample


Lecture 16:
Text Databases &
Information Retrieval: Part II
Oct. 20, 2006
ChengXiang Zhai
CS511 Advanced Database Management Systems 1
The Notion of Relevance
Relevance
(Rep(q), Rep(d)) P(r=1|q,d) r {0,1} P(d q) or P(q d)
Similarity Probability of Relevance Probabilistic inference
Regression Generative Different
Different Model Model
rep & similarity inference system
(Fox 83)
Doc Query
… generation generation
Prob. concept Inference
space model network
Vector space Prob. distr. Classical LM (Wong & Yao, 95) model
model model prob. Model approach (Turtle & Croft, 91)
(Salton et al., 75) (Wong & Yao, 89) (Robertson & (Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)
CS511 Advanced Database Management Systems 2
What is a Statistical LM?
• A probability distribution over word sequences
– p(“Today is Wednesday”) 0.001
– p(“Today Wednesday is”) 0.0000000000001
– p(“The eigenvalue is positive”) 0.00001
• Context-dependent!
• Can also be regarded as a probabilistic
mechanism for “generating” text, thus also
called a “generative” model
CS511 Advanced Database Management Systems 3
Why is a LM Useful?
• Provides a principled way to quantify the
uncertainties associated with natural
language
• Allows us to answer questions like:
– Given that we see “John” and “feels”, how likely will we see
“happy” as opposed to “habit” as the next word?
(speech recognition)
– Given that we observe “baseball” three times and “game”
once in a news article, how likely is it about “sports”?
(text categorization, information retrieval)
– Given that a user is interested in sports news, how likely
would the user use “baseball” in a query?
(information retrieval)
CS511 Advanced Database Management Systems 4
Basic Issues
• Define the probabilistic model
– Event, Random Variables, Joint/Conditional Prob’s
– P(w1 w2 ... wn)=f(1, 2 ,…, n)
• Estimate model parameters
– Tune the model to best fit the data and our prior
knowledge
– i=?
• Apply the model to a particular task
– Many applications
CS511 Advanced Database Management Systems 5
The Simplest Language Model
(Unigram Model)
• Generate a piece of text by generating each word
INDEPENDENTLY
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w )+…+p(w )=1 (N is voc. size)
1 N
• Essentially a multinomial distribution over words
• A piece of text can be regarded as a sample drawn
according to this word distribution
CS511 Advanced Database Management Systems 6
Text Generation with Unigram LM
(Unigram) Language Model Sampling
Document
p(w| )
…
text 0.2
mining 0.1 Text mining
association 0.01
Topic 1: clustering 0.02 paper
Text mining …
food 0.00001
…
…
Topic 2: food 0.25 Food nutrition
nutrition 0.1
Health healthy 0.05 paper
diet 0.02
…
CS511 Advanced Database Management Systems 7
Estimation of Unigram LM
(Unigram) Language Model Estimation
Document
p(w| )=?
… text 10
10/100 text ? mining 5
5/100 mining ?
association ? association 3
3/100
database ? database 3
3/100
… algorithm 2
1/100 query ? …
… query 1
efficient 1
A “text mining paper”
(total #words=100)
CS511 Advanced Database Management Systems 8
Empirical distribution of words
• There are stable language-independent patterns in
how people use natural languages
• A few words occur very frequently; most occur rarely.
E.g., in news articles,
– Top 4 words: 10~15% word occurrences
– Top 50 words: 35~40% word occurrences
• The most frequent word in one corpus may be rare in
another
CS511 Advanced Database Management Systems 9
Zipf’s Law
• rank * frequency constant F ( w)
C
r ( w)
1, C 0.1
Word Most useful words (Luhn 57)
Freq.
Biggest Is “too rare” a problem?
data structure
(stop words)
Word Rank (by Freq)
C
Generalized Zipf’s law: F ( w) Applicable in many domains
[r ( w) B]
CS511 Advanced Database Management Systems 10
Language Models for Retrieval
(Ponte & Croft 98)
Language Model
Document
…
text ?
mining ? Query =
Text mining assocation ? “data mining algorithms”
paper clustering ?
…
food ?
…
…
? Which model would most
likely have generated
this query?
Food nutrition food ?
nutrition ?
paper healthy ?
diet ?
…
CS511 Advanced Database Management Systems 11
Ranking Docs by Query Likelihood
Doc LM Query likelihood
d1 d1 p(q| d1)
q
d2 d2 p(q| d2)
p(q| dN)
dN dN
CS511 Advanced Database Management Systems 12
Retrieval as
Language Model Estimation
• Document ranking based on query likelihood
log p(q | d ) log p(w i | d )
i
where , q w 1w 2 ...w n Document language model
• Retrieval problem Estimation of p(wi|d)
• Smoothing is an important issue, and
distinguishes different approaches
CS511 Advanced Database Management Systems 13
Problem with the ML Estimator
• What if a word doesn’t appear in the text?
• In general, what probability should we give a word
that has not been observed?
• If we want to assign non-zero probabilities to such
words, we’ll have to discount the probabilities of
observed words
• This is what “smoothing” is about …
CS511 Advanced Database Management Systems 14
Language Model Smoothing
(Illustration)
P(w)
Max. Likelihood Estimate
pML ( w ) count of w
count of all words
Smoothed LM
w
CS511 Advanced Database Management Systems 15
A General Smoothing Scheme
• All smoothing methods try to
– discount the probability of words seen in a doc
– re-allocate the extra probability so that unseen
words will have a non-zero probability
• Most use a reference model (collection language
model) to discriminate unseen words
Discounted ML estimate
pseen (w | d ) if w is seen in d
p(w | d )
d p(w | C ) otherwise
Collection language model
CS511 Advanced Database Management Systems 16
Smoothing & TF-IDF Weighting
• Plug in the general smoothing scheme to the query
likelihood retrieval formula, we obtain
Doc length normalization
TF weighting (long doc is expected to have a smaller d)
pseen ( wi | d )
log p(q | d ) [log ] n log d log p(w | C )
d p( wi | C )
i
wi d i
wi q
IDF weighting Ignore for ranking
• Smoothing with p(w|C) TF-IDF + length
norm.
CS511 Advanced Database Management Systems 17
How to Smooth?
• All smoothing methods try to
– discount the probability of words seen in a
document
– re-allocate the extra counts so that unseen
words will have a non-zero count
• Method 1 (Additive smoothing): Add a
constant to the counts of each word
Counts of w in d
“Add one”, Laplace smoothing
c( w, d ) 1
p( w | d )
| d | |V | Vocabulary size
• Problems? Length of d (total counts)
CS511 Advanced Database Management Systems 18
Other Smoothing Methods
• Method 2 (Absolute discounting): Subtract a
constant from the counts of each word
# uniq words
max( c ( w;d ) ,0) |d |u p ( w| REF )
p (w | d ) |d |
• Method 3 (Linear interpolation, Jelinek-Mercer):
“Shrink” uniformly toward p(w|REF)
c( w, d )
p( w | d ) (1 ) p( w | REF )
|d |
parameter
ML estimate
CS511 Advanced Database Management Systems 19
Other Smoothing Methods (cont.)
• Method 4 (Dirichlet Prior/Bayesian): Assume
pseudo counts p(w|REF)
c ( w;d ) p ( w| REF ) c( w, d )
p (w | d ) |d | |d |
|d | |d | p( w | REF )
|d |
parameter
• Method 5 (Good Turing): Assume total # unseen
events to be n1 (# of singletons), and adjust
the seen events in the same way
r 1
p (w | d ) c*( w; d )
|d | ; c *( w, d ) r* nr 1 , where r c( w, d )
nr
n1 2* n2
0* ,1* ,..... What if nr 0? What about p w | REF ?
n0 n1
CS511 Advanced Database Management Systems 20
So, which method is the best?
It depends on the data and the task!
Many other sophisticated smoothing methods have been
proposed…
Cross validation is generally used to choose the best
method and/or set the smoothing parameters…
For retrieval, Dirichlet prior performs well…
CS511 Advanced Database Management Systems 21
Comparison of Three Methods
Query Type JM Dir AD
Title 0.228 0.256 0.237
Long 0.278 0.276 0.260
Relative performance of JM, Dir. and AD
precision
0.3
TitleQuery
0.2
LongQuery
0.1
0
JM DIR AD
Method
CS511 Advanced Database Management Systems 22
Applications of Basic IR
Techniques
CS511 Advanced Database Management Systems 23
Some “Basic” IR Techniques
• Stemming
• Stop words
• Weighting of terms (e.g., TF-IDF)
• Vector/Unigram representation of text
• Text similarity (e.g., cosine, KL-div)
• Relevance/pseudo feedback (e.g., Rocchio)
They are not just for retrieval!
CS511 Advanced Database Management Systems 24
Generality of Basic Techniques
t1 t2 … t n tt
Term t t tt tt
d1 w11 w12… w1n t
similarity
d2 w21 w22… w2n tt
t
… …… CLUSTERING
dm wm1 wm2… wmn d
Doc d dd dd
similarity d d d
d d d
dd
Term Weighting
Vector
Sentence
Tokenized text centroid
selection
SUMMARIZATION d
Stemming & Stop words
Raw text
META-DATA/
ANNOTATION CATEGORIZATION25
CS511 Advanced Database Management Systems
Sample Applications
• Information Filtering
• Text Categorization
• Document/Term Clustering
• Text Summarization
CS511 Advanced Database Management Systems 26
Information Filtering
• Stable & long term interest, dynamic info
source
• System must make a delivery decision
document
immediately as amy interest: “arrives”
… Filtering
System
CS511 Advanced Database Management Systems 27
A Vector-Space Filtering Model
no
doc Utility
Scoring Thresholding Evaluation
vector
yes
F=3R+-2*N+
profile vector threshold
R+: yes & correct
N+: yes & incorrect
Vector Threshold
Learning Learning
Feedback
Information
CS511 Advanced Database Management Systems 28
Issues in Information Filtering
• Threshold setting
– Crucial for binary decision making
– Must avoid under-delivery or over-delivery
• Initialization
– What threshold should a system start with?
• Learning from limited and biased feedback
– Only delivered documents get feedback info
– How to learn a threshold?
– Exploitation vs. exploration
• Other issues (redundancy, interest shift, etc.)
CS511 Advanced Database Management Systems 29
Examples of Information Filtering
• News filtering
• Email filtering
• Recommending Systems
• Literature alert
• And many others
CS511 Advanced Database Management Systems 30
Sample Applications
• Information Filtering
Text Categorization
• Document/Term Clustering
• Text Summarization
CS511 Advanced Database Management Systems 31
Text Categorization
• Pre-given categories and labeled document
examples (Categories may form hierarchy)
• Classify new documents
• A standard supervised learning problem
Sports
Categorization
Business
System
Education
… …
Sports
Science
Business
Education
CS511 Advanced Database Management Systems 32
“Retrieval-based” Categorization
• Treat each category as representing an
“information need”
• Treat examples in each category as “relevant
documents”
• Use feedback approaches to learn a good
“query”
• Match all the learned queries to a new document
• A document gets the category(categories)
represented by the best matching query(queries)
CS511 Advanced Database Management Systems 33
Prototype-based Classifier
• Key elements (“retrieval techniques”)
– Prototype/document representation (e.g., term vector)
– Document-prototype distance measure (e.g., dot product)
– Prototype vector learning: Rocchio feedback
• Example
CS511 Advanced Database Management Systems 34
K-Nearest Neighbor Classifier
• Keep all training examples
• Find k examples that are most similar to the new
document (“neighbor” documents)
• Assign the category that is most common in
these neighbor documents (neighbors vote for
the category)
• Can be improved by considering the distance of a
neighbor ( A closer neighbor has more influence)
• Technical elements (“retrieval techniques”)
– Document representation
– Document distance measure
CS511 Advanced Database Management Systems 35
Example of K-NN Classifier
(k=4)
(k=1)
CS511 Advanced Database Management Systems 36
Examples of Text Categorization
• News article classification
• Meta-data annotation
• Automatic Email sorting
• Web page classification
CS511 Advanced Database Management Systems 37
Sample Applications
• Information Filtering
• Text Categorization
Document/Term Clustering
• Text Summarization
CS511 Advanced Database Management Systems 38
The Clustering Problem
• Discover “natural structure”
• Group similar objects together
• Object can be document, term, passages
• Example
CS511 Advanced Database Management Systems 39
Similarity-based Clustering
(as opposed to “model-based”)
• Define a similarity function to measure
similarity between two objects
• Gradually group similar objects together in a
bottom-up fashion
• Stop when some stopping criterion is met
• Variations: different ways to compute group
similarity based on individual object
similarity
CS511 Advanced Database Management Systems 40
Similarity-induced Structure
CS511 Advanced Database Management Systems 41
How to Compute Group Similarity?
Three Popular Methods:
Given two groups g1 and g2,
Single-link algorithm: s(g1,g2)= similarity of the closest pair
complete-link algorithm: s(g1,g2)= similarity of the farthest pair
average-link algorithm: s(g1,g2)= average of similarity of all pairs
CS511 Advanced Database Management Systems 42
Three Methods Illustrated
complete-link algorithm
g1 g2
?
……
Single-link algorithm
average-link algorithm
CS511 Advanced Database Management Systems 43
Examples of Doc/Term Clustering
• Clustering of retrieval results
• Clustering of documents in the whole collection
• Term clustering to define “concept” or “theme”
• Automatic construction of hyperlinks
• In general, very useful for text mining
CS511 Advanced Database Management Systems 44
Sample Applications
• Information Filtering
• Text Categorization
• Document/Term Clustering
Text Summarization
CS511 Advanced Database Management Systems 45
The Summarization Problem
• Essentially “semantic compression” of text
• Selection-based vs. generation-based summary
• In general, we need a purpose for summarization,
but it’s hard to define it
CS511 Advanced Database Management Systems 46
“Retrieval-based” Summarization
• Observation: term vector summary?
• Basic approach
– Rank “sentences”, and select top N as a summary
• Methods for ranking sentences
– Based on term weights
– Based on position of sentences
– Based on the similarity of sentence and document
vector
CS511 Advanced Database Management Systems 47
Simple Discourse Analysis
----------
----------
vector 1
vector 2
similarity
----------
---------- vector 3
…
similarity
----------
---------- …
----------
----------
----------
----------
----------
----------
----------
----------
----------
---------- vector n-1
vector n similarity
CS511 Advanced Database Management Systems 48
A Simple Summarization Method
----------
----------
----------
---------- summary
----------
----------
sentence 1
Most similar
in each segment
----------
---------- sentence 2 Doc vector
----------
----------
----------
---------- sentence 3
----------
----------
----------
----------
CS511 Advanced Database Management Systems 49
Examples of Summarization
• News summary
• Summarize retrieval results
– Single doc summary
– Multi-doc summary
• Summarize a cluster of documents (automatic label
creation for clusters)
CS511 Advanced Database Management Systems 50
What You Should Know
• Language models are new retrieval models with
many advantages
• The retrieval techniques can be used to do more
than just search
CS511 Advanced Database Management Systems 51
Related docs
Get documents about "