Text Similarity
Dr Eamonn Keogh
Computer Science & Engineering Department
University of California - Riverside
Riverside,CA 92521
eamonn@cs.ucr.edu
Word Twain Twain Twain Twain Twain
Length Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Snodgrass
1 74 312 116 138 122 424
2 349 1146 496 532 466 2685
3 456 1394 673 741 653 2752
4 374 1177 565 591 517 2302
5 212 661 381 357 343 1431
6 127 442 249 258 207 992
7 107 367 185 215 152 896
8 84 231 125 150 103 638
9 45 181 94 83 92 465
10 27 109 51 55 45 276
11 13 50 23 30 18 152
12 8 24 8 10 12 101
13+ 9 12 8 9 9 61
1600 0.3
1400
0.25
1200
0.2
1000
Sample 1 Series1
800 0.15
Sample 2 Series2
600
0.1
400
0.05
200
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13
6
4
3
5
2
1
Information Retrieval
• Task Statement:
Build a system that retrieves documents that users
are likely to find relevant to their queries.
• This assumption underlies the field of
Information Retrieval.
Information
need Collections
How is
the query Pre-process
text input
constructed? How is
the text
Parse
processed?
Query Index
Rank
Evaluate
Terminology
Token: A natural language word “Swim”,
“Simpson”, “92513” etc
Document: Usually a web page, but more
generally any file.
Some IR History
– Roots in the scientific “Information Explosion” following
WWII
– Interest in computer-based IR from mid 1950’s
• H.P. Luhn at IBM (1958)
• Probabilistic models at Rand (Maron & Kuhns) (1960)
• Boolean system development at Lockheed (‘60s)
• Vector Space Model (Salton at Cornell 1965)
• Statistical Weighting methods and theoretical advances (‘70s)
• Refinements and Advances in application (‘80s)
• User Interfaces, Large-scale testing and application (‘90s)
Relevance
• In what ways can a document be relevant to a query?
– Answer precise question precisely.
– Who is Homer’s Boss? Montgomery Burns.
– Partially answer question.
– Where does Homer work? Power Plant.
– Suggest a source for more information.
– What is Bart’s middle name? Look in Issue 234 of Fanzine
– Give background information.
– Remind the user of other knowledge.
– Others ...
Information
need Collections
How is
the query Pre-process
text input
constructed? How is
the text
Parse
processed?
Query Index
Rank
The section that follows is about Evaluate
Content Analysis
(transforming raw text into a
computationally more manageable form)
Stemming and Morphological Analysis
• Goal: “normalize” similar words
• Morphology (“form” of words)
– Inflectional Morphology
• E.g,. inflect verb endings and noun number
• Never change grammatical class
– dog, dogs
– Bike, Biking
– Swim, Swimmer, Swimming
What about… build, building;
Examples of Stemming (using Porters algorithm)
Original Words Stemmed Words
… …
consign consign
consigned consign
consigning consign
consignment consign
consist consist
consisted consist
consistency consist
consistent consist
Porters algorithms is consistently consist
available in Java, C,
consisting consist
Lisp, Perl, Python etc
from consists consist
…
http://www.tartarus.org/
~martin/PorterStemmer/
Errors Generated by Porter
Stemmer (Krovetz 93)
Too Aggressive Too Timid
organization/organ european/europe
policy/police cylinder/cylindrical
execute/executive create/creation
arm/army search/searcher
Homework!! Play with the following URL
http://fusion.scs.carleton.ca/~dquesnel/java/stuff/PorterApplet.html
Statistical Properties of Text
• Token occurrences in text are not uniformly
distributed
• They are also not normally distributed
• They do exhibit a Zipf distribution
Government documents, 157734 tokens, 32259 unique
8164 the 969 on 1 ABC
4771 of 915 FT 1 ABFT
4005 to 883 Mr 1 ABOUT
2834 a 860 was 1 ACFT
2827 and 855 be 1 ACI
2802 in 849 Pounds 1 ACQUI
1592 The 798 TEXT 1 ACQUISITIONS
1370 for 798 PUB 1 ACSIS
1326 is 798 PROFILE 1 ADFT
1324 s 798 PAGE 1 ADVISERS
1194 that 798 HEADLINE 1 AE
973 by 798 DOCNO
Plotting Word Frequency by Rank
• Main idea: count
– How many times tokens occur in the text
• Over all texts in the collection
• Now rank these according to how often they
occur. This is called the rank.
The Corresponding Zipf Curve
Rank Freq
1 37 system
2 32 knowledg
3 24 base
4 20 problem
5 18 abstract
6 15 model
7 15 languag
8 15 implem
9 13 reason
10 13 inform
11 11 expert
12 11 analysi
13 10 rule
14 10 program
15 10 oper
16 10 evalu
17 10 comput
18 10 case
19 9 gener
20 9 form
Zipf Distribution
• The Important Points:
– a few elements occur very frequently
– a medium number of elements have medium
frequency
– many elements occur very infrequently
Zipf Distribution
• The product of the frequency of words (f) and
their rank (r) is approximately constant
– Rank = order of words’ frequency of occurrence
f C 1 / r
C N / 10
• Another way to state this is with an approximately correct
rule of thumb:
– Say the most common term occurs C times
– The second most common occurs C/2 times
– The third most common occurs C/3 times
– …
Zipf Distribution
(linear and log scale)
Illustration by Jacob Nielsen
What Kinds of Data Exhibit a
Zipf Distribution?
• Words in a text collection
– Virtually any language usage
• Library book checkout patterns
• Incoming Web Page Requests
• Outgoing Web Page Requests
• Document Size on Web
• City Sizes
• …
Consequences of Zipf
• There are always a few very frequent tokens
that are not good discriminators.
– Called “stop words” in IR
• English examples: to, from, on, and, the, ...
• There are always a large number of tokens
that occur once and can mess up algorithms.
• Medium frequency words most descriptive
Word Frequency vs. Resolving
Power (from van Rijsbergen 79)
The most frequent words are not the most descriptive.
Statistical Independence
Two events x and y are statistically
independent if the product of their
probability of their happening individually
equals their probability of happening
together.
P( x)P( y) P( x, y)
Lexical Associations
• Subjects write first word that comes to mind
– doctor/nurse; black/white (Palermo & Jenkins 64)
• Text Corpora yield similar associations
• One measure: Mutual Information (Church and Hanks 89)
P ( x, y )
I ( x, y ) log 2
P( x), P( y )
• If word occurrences were independent, the numerator and
denominator would be equal (if measured across a large
collection)
Statistical Independence
• Compute for a window of words
P( x ) P( y ) P( x, y ) if independent. abcdefghij klmnop
P( x ) f ( x ) / N
w1 w11
We' ll approximate P( x, y ) as follows : w21
1 N |w|
P ( x, y ) wi ( x, y )
N i 1
| w | length of window w (say 5)
wi words within window starting at position i
w( x, y ) number of times x and y co - occurin w
N number of wordsin collection
Interesting Associations with “Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)
I(x,y) f(x,y) f(x) x f(y) y
11.3 12 111 Honorary 621 Doctor
11.3 8 1105 Doctors 44 Dentists
10.7 30 1105 Doctors 241 Nurses
9.4 8 1105 Doctors 154 Treating
9.0 6 275 Examined 621 Doctor
8.9 11 1105 Doctors 317 Treat
8.7 25 621 Doctor 1407 Bills
Un-Interesting Associations with
“Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)
I(x,y) f(x,y) f(x) x f(y) y
0.96 6 621 doctor 73785 with
0.95 41 284690 a 1105 doctors
0.93 12 84716 is 1105 doctors
These associations were likely to happen because
the non-doctor words shown here are very common
and therefore likely to co-occur with any noun.
Associations Are Important Because…
• We may be able to discover that phrases that
should be treated as a word. I.e. “data mining”.
• We may be able to automatically discover
synonyms. I.e. “Bike” and “Bicycle”
Content Analysis Summary
• Content Analysis: transforming raw text into more
computationally useful forms
• Words in text collections exhibit interesting
statistical properties
– Word frequencies have a Zipf distribution
– Word co-occurrences exhibit dependencies
• Text documents are transformed to vectors
– Pre-processing includes tokenization, stemming,
collocations/phrases
Information
need Collections
Pre-process
text input
How is
the index
Parse Query Index constructed?
Rank
The section that follows is about
Index Construction Evaluate
Inverted Index
• This is the primary data structure for text indexes
• Main Idea:
– Invert documents into a big index
• Basic steps:
– Make a “dictionary” of all the tokens in the collection
– For each token, list all the docs it occurs in.
– Do a few things to reduce redundancy in the data structure
Inverted Indexes
We have seen “Vector files” conceptually. An
Inverted File is a vector file “inverted” so
that rows become columns and columns
become rows
docs t1 t2 t3
D1 1 0 1
D2 1 0 0
D3 0 1 1
D4 1 0 0 Terms D1 D2 D3 D4 D5 D6 D7 …
D5 1 1 1 t1 1 1 0 1 1 1 0
D6 1 1 0 t2 0 0 1 0 1 1 1
D7 0 1 0
t3 1 0 1 0 1 0 0
D8 0 1 0
D9 0 0 1
D10 0 1 1
How Are Inverted Files Created
Term Doc #
now 1
• Documents are parsed to extract tokens. is
the
time
1
1
1
These are saved with the Document ID. for
all
1
1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
Doc 1 Doc 2 their
country
1
1
it 2
was 2
Now is the time It was a dark and a
dark
2
2
for all good men stormy night in and
stormy
2
2
night 2
to come to the aid the country in 2
the 2
of their country manor. The time country
manor
2
2
was past midnight the
time
2
2
was 2
past 2
midnight 2
Term Doc # Term Doc #
now 1 a 2
How Inverted is
the
time
1
1
1
aid
all
and
1
1
2
Files are Created for
all
good
1
1
1
come
country
country
1
1
2
men 1 dark 2
to 1 for 1
• After all documents come
to
1
1
good
in
1
2
the 1 is 1
have been parsed the aid
of
1
1
it
manor
2
2
inverted file is sorted their
country
1
1
men
midnight
1
2
it 2 night 2
alphabetically. was
a
2
2
now
of
1
1
dark 2 past 2
and 2 stormy 2
stormy 2 the 1
night 2 the 1
in 2 the 2
the 2 the 2
country 2 their 1
manor 2 time 1
the 2 time 2
time 2 to 1
was 2 to 1
past 2 was 2
midnight 2 was 2
How Inverted Term
a
aid
Doc #
2
1
Term
a
aid
Doc #
2
1
Freq
1
1
all 1
Files are Created and
come
country
2
1
1
all
and
come
1
2
1
1
1
1
country 2 country 1 1
dark 2 country 2 1
for 1 dark 2 1
• Multiple term entries good
in
1
2
for 1 1
good 1 1
for a single document is
it
1
2
in 2 1
is 1 1
manor 2
are merged. men 1
it 2 1
midnight 2 manor 2 1
• Within-document term night
now
2
1
men
midnight
1
2
1
1
of 1 night 2 1
frequency information past 2 now 1 1
stormy 2 of 1 1
is compiled. the 1
past 2 1
the 1
stormy 2 1
the 2
the 2 the 1 2
their 1 the 2 2
time 1 their 1 1
time 2 time 1 1
to 1 time 2 1
to 1 to 1 2
was 2 was 2 2
was 2
How Inverted Files are Created
• Then the file can be split into
– A Dictionary file
and
– A Postings file
How Inverted Files are Created
Term
a
Doc #
2
Freq
1
Dictionary Postings
aid 1 1 Doc # Freq
Term N docs Tot Freq
all 1 1 a 1 1 2 1
and 2 1 aid 1 1 1 1
come 1 1 all 1 1 1 1
country 1 1 and 1 1 2 1
country 2 1 come 1 1 1 1
country 2 2 1 1
dark 2 1
dark 1 1 2 1
for 1 1 2 1
for 1 1
good 1 1 1 1
good 1 1
in 2 1 in 1 1 1 1
is 1 1 is 1 1 2 1
it 2 1 it 1 1 1 1
manor 2 1 manor 1 1 2 1
men 1 1 men 1 1 2 1
midnight 1 1 1 1
midnight 2 1
night 1 1 2 1
night 2 1 2 1
now 1 1
now 1 1 of 1 1 1 1
of 1 1 past 1 1 1 1
past 2 1 stormy 1 1 2 1
stormy 2 1 the 2 4 2 1
the 1 2 their 1 1 1 2
time 2 2 2 2
the 2 2
to 1 2 1 1
their 1 1
was 1 2 1 1
time 1 1 2 1
time 2 1 1 2
to 1 2 2 2
was 2 2
Inverted Indexes
• Permit fast search for individual terms
• For each term, you get a list consisting of:
– document ID
– frequency of term in doc (optional)
– position of term in doc (optional)
• These lists can be used to solve Boolean queries:
• country -> d1, d2
• manor -> d2
• country AND manor -> d2
• Also used for statistical ranking algorithms
How Inverted Files are Used
Query on
Dictionary Postings “time” AND “dark”
Term N docs Tot Freq Doc # Freq
a 1 1 2 1
aid 1 1 1 1
all
and
1
1
1
1
1
2
1
1 2 docs with “time” in
dictionary ->
come 1 1 1 1
country 2 2 1 1
dark 1 1 2 1
for
good
1
1
1
1
2
1
1
1 IDs 1 and 2 from
posting file
in 1 1 1 1
is 1 1 2 1
it 1 1 1 1
manor
men
1
1
1
1
2
2
1
1 1 doc with “dark” in
dictionary ->
midnight 1 1 1 1
night 1 1 2 1
now 1 1 2 1
of
past
1
1
1
1
1
1
1
1 ID 2 from posting
file
stormy 1 1 2 1
the 2 4 2 1
their 1 1 1 2
time 2 2 2 2
to 1 2 1 1
Therefore, only doc 2
was 1 2 1 1
2 1
1 2
2 2
satisfied the query.
Information
need Collections
Pre-process
text input
How is
the index
Parse Query Index constructed?
Rank
The section that follows is about
Querying (and Evaluate
ranking)
Simple query language: Boolean
– Terms + Connectors (or operators)
– terms
• words
• normalized (stemmed) words
• phrases Word Doc
– connectors • Cat x
• AND
• OR • Dog
•
•
NOT
NEAR (Pseudo Boolean)
• Collar x
• Leash
Boolean Queries
• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
Boolean Searching
“Measurement of the Formal Query:
width of cracks in cracks AND beams
prestressed Cracks AND Width_measurement
concrete beams” AND Prestressed_concrete
Beams Width
measurement
Relaxed Query:
Prestressed (C AND B AND P) OR
concrete (C AND B AND W) OR
(C AND W AND P) OR
(B AND W AND P)
Ordering of Retrieved Documents
• Pure Boolean has no ordering
• In practice:
– order chronologically
– order by total number of “hits” on query terms
• What if one term has more hits than others?
• Is it better to one of each term or many of one term?
Boolean Model
• Advantages
– simple queries are easy to understand
– relatively easy to implement
• Disadvantages
– difficult to specify what is wanted
– too much returned, or too little
– ordering not well determined
• Dominant language in commercial Information
Retrieval systems until the WWW
Since the Boolean model is limited, lets consider a generalization…
Vector Model
• Documents are represented as “bags of words”
• Represented as vectors when used computationally
– A vector is like an array of floating point
– Has direction and magnitude
– Each vector holds a place for every term in the collection
– Therefore, most vectors are sparse
• Smithers secretly loves Monty Burns
• Monty Burns secretly loves Smithers
Both map to…
[ Burns, loves, Monty, secretly, Smithers]
Document Vectors
One location for each word
Document ids
nova galaxy heat h’wood film role diet fur
A 10 5 3
B 5 10
C 10 8 7
D 9 10 5
E 10 10
F 9 10
G 5 7 9
H 6 10 2 8
I 7 5 1 3
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
Documents in 3D Vector Space
t3
D1
D9
D11
D3 D5
D10
D4 D2
t1
D7
D8 D6
t2
Illustration from Jurafsky & Martin
Vector Space Model
docs Homer Marge Bart Note that the query is projected
D1 * * into the same vector space as the
D2 * documents.
D3 * *
The query here is for “Marge”.
D4 *
D5 * * * We can use a vector similarity
D6 * * model to determine the best match
D7 * to our query (details in a few slides).
D8 *
D9 * But what weights should we use
D10 * * for the terms?
D11 * *
Q *
Assigning Weights to Terms
• Binary Weights
• Raw term frequency
• tf x idf
– Recall the Zipf distribution
– Want to weight terms highly if they are
• frequent in relevant documents … BUT
• infrequent in the collection as a whole
Binary Weights
• Only the presence (1) or absence (0) of a
term is included in the vector
docs t1 t2 t3
D1 1 0 1
D2 1 0 0
D3 0 1 1
D4 1 0 0
D5 1 1 1 We have already
D6 1 1 0 seen and discussed
D7 0 1 0
D8 0 1 0 this model.
D9 0 0 1
D10 0 1 1
D11 1 0 1
Raw Term Weights
• The frequency of occurrence for the term in
each document is included in the vector
docs t1 t2 t3
D1 2 0 3
D2 1 0 0
This model is open
D3 0 4 7 to exploitation by
D4 3 0 0
D5 1 6 3 websites…
D6 3 5 0 sex sex sex sex sex
D7 0 8 0
D8 0 10 0 sex sex sex sex sex
D9 0 0 1
Counts can be D10 0 3 5 sex sex sex sex sex
normalized by D11 4 0 1 sex sex sex sex sex
document lengths. sex sex sex sex sex
tf * idf Weights
• tf * idf measure:
– term frequency (tf)
– inverse document frequency (idf) -- a way to
deal with the problems of the Zipf distribution
• Goal: assign a tf * idf weight to each term
in each document
tf * idf
wik tfik * log( N / nk )
Tk term k in document Di
tfik frequencyof term Tk in document Di
idf k inverse documentfrequencyof term Tk in C
N total number of documentsin the collection C
nk the number of documentsin C that contain Tk
idf k log N
nk
Inverse Document Frequency
• IDF provides high values for rare words and
low values for common words
10000
log 0
10000
For a 10000
collection log 0.301
5000
idfk log N
of 10000
nk 10000
documents log 2.698
20
10000
log 4
1
Similarity Measures
|QD|
Simple matching (coordination level match)
|QD| Dice’s Coefficient
2
|Q|| D|
|QD|
|QD| Jaccard’s Coefficient
|QD|
1 1
|Q | | D |
2 2
Cosine Coefficient
|QD|
min(| Q |, | D |) Overlap Coefficient
Cosine
D1 (0.8, 0.3)
D2 (0.2, 0.7)
1.0
Q Q (0.4, 0.8)
D2 cos1 0.74
0.8
cos 2 0.98
0.6 2
0.4
1 D1
0.2
0.2 0.4 0.6 0.8 1.0
Vector Space Similarity Measure
Di wd i1 , wd i 2 ,...,wd it
Q wq1 , wq 2, ...,wqt w 0 if a term is absent
t
if term weights normalized : sim(Q, Di ) wqj wd ij
j 1
otherwisenormalize in the similarity comparison:
t
w
j 1
qj wd ij
sim(Q, Di )
t t
( wqj ) 2
j 1
( wd ij ) 2
j 1
Problems with Vector Space
• There is no real theoretical basis for the
assumption of a term space
– it is more for visualization that having any real
basis
– most similarity measures work about the same
regardless of model
• Terms are not really orthogonal dimensions
– Terms are not independent of all other terms
Probabilistic Models
• Rigorous formal model attempts to predict
the probability that a given document will
be relevant to a given query
• Ranks retrieved documents according to this
probability of relevance (Probability
Ranking Principle)
• Rely on accurate estimates of probabilities
Relevance Feedback
• Main Idea:
– Modify existing query based on relevance judgements
• Query Expansion: Extract terms from relevant documents and
add them to the query
• Term Re-weighing: and/or re-weight the terms already in the
query
– Two main approaches:
• Automatic (psuedo-relevance feedback)
• Users select relevant documents
– Users/system select terms from an automatically-
generated list
Definition: Relevance Feedback is the reformulation of a search query in response
to feedback provided by the user for the results of previous versions of the query.
Suppose you are interested in bovine agriculture on
the banks of the river Jordan…
Term Vector [Jordan , Bank, Bull, River]
Term Weights [ 1 , 1 , 1 , 1 ]
Search
Display Results
Gather Feedback
Update Weights
Term Vector [Jordan , Bank, Bull, River]
Term Weights [ 1.1 , 0.1 , 1.3 , 1.2 ]
Rocchio Method
n1 n2
Ri Si
Q1 Q0
i 1 n1 i 1 n2
where
Q0 the vector for theinitial query
Ri the vector for the relevant documenti
Si the vector for the non - relevant documenti
n1 the number of relevant documentschosen
n2 the number of non - relevant documentschosen
and tune the importanceof relevant and nonrelevant terms
(in some studies best to set to 0.75 and to 0.25)
Rocchio Illustration
Although we usually work in vector space for text, it is
easier to visualize Euclidian space
Original Query Term Re-weighting Query Expansion
Note that both the location of
the center, and the shape of
the query have changed
Rocchio Method
• Rocchio automatically
– re-weights terms
– adds in new terms (from relevant docs)
• Most methods perform similarly
– results heavily dependent on test collection
• Machine learning methods are proving to
work better than standard IR approaches
like Rocchio
Using Relevance Feedback
• Known to improve results
• People don’t seem to like giving feedback!
Information
need Collections
Pre-process
text input
How is
the index
Parse Query Index constructed?
Rank
The section that follows is about
Evaluation Evaluate
Evaluation
• Why Evaluate?
• What to Evaluate?
• How to Evaluate?
Why Evaluate?
• Determine if the system is desirable
• Make comparative assessments
What to Evaluate?
• How much of the information need is
satisfied.
• How much was learned about a topic.
• Incidental learning:
– How much was learned about the collection.
– How much was learned about other topics.
• How inviting the system is.
What to Evaluate?
What can be measured that reflects users’ ability
to use system? (Cleverdon 66)
– Coverage of Information
– Form of Presentation
– Effort required/Ease of Use
– Time and Space Efficiency
– Recall
effectiveness
• proportion of relevant material actually retrieved
– Precision
• proportion of retrieved material actually relevant
Relevant vs. Retrieved
All docs
Retrieved
Relevant
Precision vs. Recall
| RelRetriev ed | | RelRetriev ed |
Precision Recall
| Retrieved | | Rel in Collection |
All docs
Retrieved
Relevant
Why Precision and Recall?
Intuition:
Get as much good stuff while at the same time getting
as little junk as possible.
Retrieved vs. Relevant Documents
Very high precision, very low recall
Relevant
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 in fact)
Relevant
Retrieved vs. Relevant Documents
High recall, but low precision
Relevant
Retrieved vs. Relevant Documents
High precision, high recall (at last!)
Relevant
Precision/Recall Curves
• There is a tradeoff between Precision and Recall
• So measure Precision at different levels of Recall
• Note: this is an AVERAGE over MANY queries
precision
x
x
x
x
recall
Precision/Recall Curves
• Difficult to determine which of these two hypothetical
results is better:
precision x
x
x
x
recall
Document Cutoff Levels
• Another way to evaluate:
– Fix the number of documents retrieved at several levels:
• top 5
• top 10
• top 20
• top 50
• top 100
• top 500
– Measure precision at each of these levels
– Take (weighted) average over results
• This is a way to focus on how well the system ranks the
first k documents.
Problems with Precision/Recall
• Can’t know true recall value
– except in small collections
• Precision/Recall are related
– A combined measure sometimes more appropriate
• Assumes batch mode
– Interactive IR is important and has different criteria for
successful searches
– Assumes a strict rank ordering matters.
Relation to Contingency Table
Doc is Doc is Doc is Doc is
Relevant NOT Relevant NOT
relevant relevant
Doc is Doc is
retrieved a b retrieved N ret rel N ret rel
Doc is Doc is
NOT c d NOT N ret rel N ret rel
retrieved retrieved
• Accuracy: (a+d) / (a+b+c+d)
• Precision: a/(a+b)
• Recall: a/(a+c)
• Why don’t we use Accuracy for IR?
– (Assuming a large collection)
– Most docs aren’t relevant
– Most docs aren’t retrieved
– Inflates the accuracy value
The E-Measure
Combine Precision and Recall into one number (van
Rijsbergen 79)
1 b2
E 1 2
b 1
R P
P = precision
R = recall
b = measure of relative importance of P or R
For example,
b = 0.5 means user is twice as interested in
precision as recall
How to Evaluate?
Test Collections
TREC
• Text REtrieval Conference/Competition
– Run by NIST (National Institute of Standards & Technology)
– 2004 (November) will be 13th year
• Collection: >6 Gigabytes (5 CRDOMs), >1.5
Million Docs
– Newswire & full text news (AP, WSJ, Ziff, FT)
– Government documents (federal register, Congressional
Record)
– Radio Transcripts (FBIS)
– Web “subsets”
TREC (cont.)
• Queries + Relevance Judgments
– Queries devised and judged by “Information Specialists”
– Relevance judgments done only for those documents
retrieved -- not entire collection!
• Competition
– Various research and commercial groups compete (TREC
6 had 51, TREC 7 had 56, TREC 8 had 66)
– Results judged on precision and recall, going up to a
recall level of 1000 documents
TREC
• Benefits:
– made research systems scale to large collections (pre-
WWW)
– allows for somewhat controlled comparisons
• Drawbacks:
– emphasis on high recall, which may be unrealistic for
what most users want
– very long queries, also unrealistic
– comparisons still difficult to make, because systems are
quite different on many dimensions
– focus on batch ranking rather than interaction
– no focus on the WWW
TREC is changing
• Emphasis on specialized “tracks”
– Interactive track
– Natural Language Processing (NLP) track
– Multilingual tracks (Chinese, Spanish)
– Filtering track
– High-Precision
– High-Performance
• http://trec.nist.gov/
Homework…