Embed
Email

Text

Document Sample

Shared by: xiang
Categories
Tags
Stats
views:
7
posted:
11/10/2011
language:
English
pages:
94
Text Similarity







Dr Eamonn Keogh

Computer Science & Engineering Department

University of California - Riverside

Riverside,CA 92521

eamonn@cs.ucr.edu

Word Twain Twain Twain Twain Twain

Length Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Snodgrass

1 74 312 116 138 122 424

2 349 1146 496 532 466 2685

3 456 1394 673 741 653 2752

4 374 1177 565 591 517 2302

5 212 661 381 357 343 1431

6 127 442 249 258 207 992

7 107 367 185 215 152 896

8 84 231 125 150 103 638

9 45 181 94 83 92 465

10 27 109 51 55 45 276

11 13 50 23 30 18 152

12 8 24 8 10 12 101

13+ 9 12 8 9 9 61









1600 0.3





1400

0.25



1200

0.2

1000



Sample 1 Series1

800 0.15

Sample 2 Series2



600

0.1



400



0.05

200





0 0

1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13

6



4



3





5



2



1

Information Retrieval

• Task Statement:



Build a system that retrieves documents that users

are likely to find relevant to their queries.





• This assumption underlies the field of

Information Retrieval.

Information

need Collections



How is

the query Pre-process

text input

constructed? How is

the text

Parse

processed?

Query Index







Rank





Evaluate

Terminology



Token: A natural language word “Swim”,

“Simpson”, “92513” etc



Document: Usually a web page, but more

generally any file.

Some IR History

– Roots in the scientific “Information Explosion” following

WWII

– Interest in computer-based IR from mid 1950’s

• H.P. Luhn at IBM (1958)

• Probabilistic models at Rand (Maron & Kuhns) (1960)

• Boolean system development at Lockheed (‘60s)

• Vector Space Model (Salton at Cornell 1965)

• Statistical Weighting methods and theoretical advances (‘70s)

• Refinements and Advances in application (‘80s)

• User Interfaces, Large-scale testing and application (‘90s)

Relevance

• In what ways can a document be relevant to a query?

– Answer precise question precisely.

– Who is Homer’s Boss? Montgomery Burns.

– Partially answer question.

– Where does Homer work? Power Plant.

– Suggest a source for more information.

– What is Bart’s middle name? Look in Issue 234 of Fanzine

– Give background information.

– Remind the user of other knowledge.

– Others ...

Information

need Collections



How is

the query Pre-process

text input

constructed? How is

the text

Parse

processed?

Query Index







Rank







The section that follows is about Evaluate

Content Analysis

(transforming raw text into a

computationally more manageable form)

Stemming and Morphological Analysis



• Goal: “normalize” similar words

• Morphology (“form” of words)

– Inflectional Morphology

• E.g,. inflect verb endings and noun number

• Never change grammatical class

– dog, dogs

– Bike, Biking

– Swim, Swimmer, Swimming





What about… build, building;

Examples of Stemming (using Porters algorithm)

Original Words Stemmed Words

… …

consign consign

consigned consign

consigning consign

consignment consign

consist consist

consisted consist

consistency consist

consistent consist

Porters algorithms is consistently consist

available in Java, C,

consisting consist

Lisp, Perl, Python etc

from consists consist



http://www.tartarus.org/

~martin/PorterStemmer/

Errors Generated by Porter

Stemmer (Krovetz 93)

Too Aggressive Too Timid

organization/organ european/europe

policy/police cylinder/cylindrical

execute/executive create/creation

arm/army search/searcher





Homework!! Play with the following URL

http://fusion.scs.carleton.ca/~dquesnel/java/stuff/PorterApplet.html

Statistical Properties of Text

• Token occurrences in text are not uniformly

distributed

• They are also not normally distributed

• They do exhibit a Zipf distribution

Government documents, 157734 tokens, 32259 unique





8164 the 969 on 1 ABC

4771 of 915 FT 1 ABFT

4005 to 883 Mr 1 ABOUT

2834 a 860 was 1 ACFT

2827 and 855 be 1 ACI

2802 in 849 Pounds 1 ACQUI

1592 The 798 TEXT 1 ACQUISITIONS

1370 for 798 PUB 1 ACSIS

1326 is 798 PROFILE 1 ADFT

1324 s 798 PAGE 1 ADVISERS

1194 that 798 HEADLINE 1 AE

973 by 798 DOCNO

Plotting Word Frequency by Rank



• Main idea: count

– How many times tokens occur in the text

• Over all texts in the collection

• Now rank these according to how often they

occur. This is called the rank.

The Corresponding Zipf Curve

Rank Freq

1 37 system

2 32 knowledg

3 24 base

4 20 problem

5 18 abstract

6 15 model

7 15 languag

8 15 implem

9 13 reason

10 13 inform

11 11 expert

12 11 analysi

13 10 rule

14 10 program

15 10 oper

16 10 evalu

17 10 comput

18 10 case

19 9 gener

20 9 form

Zipf Distribution



• The Important Points:

– a few elements occur very frequently

– a medium number of elements have medium

frequency

– many elements occur very infrequently

Zipf Distribution

• The product of the frequency of words (f) and

their rank (r) is approximately constant

– Rank = order of words’ frequency of occurrence







f  C 1 / r

C  N / 10

• Another way to state this is with an approximately correct

rule of thumb:

– Say the most common term occurs C times

– The second most common occurs C/2 times

– The third most common occurs C/3 times

– …

Zipf Distribution

(linear and log scale)









Illustration by Jacob Nielsen

What Kinds of Data Exhibit a

Zipf Distribution?

• Words in a text collection

– Virtually any language usage

• Library book checkout patterns

• Incoming Web Page Requests

• Outgoing Web Page Requests

• Document Size on Web

• City Sizes

• …

Consequences of Zipf

• There are always a few very frequent tokens

that are not good discriminators.

– Called “stop words” in IR

• English examples: to, from, on, and, the, ...

• There are always a large number of tokens

that occur once and can mess up algorithms.

• Medium frequency words most descriptive

Word Frequency vs. Resolving

Power (from van Rijsbergen 79)

The most frequent words are not the most descriptive.

Statistical Independence

Two events x and y are statistically

independent if the product of their

probability of their happening individually

equals their probability of happening

together.



P( x)P( y)  P( x, y)

Lexical Associations

• Subjects write first word that comes to mind

– doctor/nurse; black/white (Palermo & Jenkins 64)

• Text Corpora yield similar associations

• One measure: Mutual Information (Church and Hanks 89)





P ( x, y )

I ( x, y )  log 2

P( x), P( y )



• If word occurrences were independent, the numerator and

denominator would be equal (if measured across a large

collection)

Statistical Independence

• Compute for a window of words

P( x )  P( y )  P( x, y ) if independent. abcdefghij klmnop

P( x )  f ( x ) / N

w1 w11

We' ll approximate P( x, y ) as follows : w21



1 N |w|

P ( x, y )   wi ( x, y )

N i 1

| w | length of window w (say 5)

wi  words within window starting at position i

w( x, y )  number of times x and y co - occurin w

N  number of wordsin collection

Interesting Associations with “Doctor”

(AP Corpus, N=15 million, Church & Hanks 89)



I(x,y) f(x,y) f(x) x f(y) y

11.3 12 111 Honorary 621 Doctor

11.3 8 1105 Doctors 44 Dentists

10.7 30 1105 Doctors 241 Nurses

9.4 8 1105 Doctors 154 Treating

9.0 6 275 Examined 621 Doctor

8.9 11 1105 Doctors 317 Treat

8.7 25 621 Doctor 1407 Bills

Un-Interesting Associations with

“Doctor”

(AP Corpus, N=15 million, Church & Hanks 89)



I(x,y) f(x,y) f(x) x f(y) y

0.96 6 621 doctor 73785 with

0.95 41 284690 a 1105 doctors

0.93 12 84716 is 1105 doctors





These associations were likely to happen because

the non-doctor words shown here are very common

and therefore likely to co-occur with any noun.

Associations Are Important Because…



• We may be able to discover that phrases that

should be treated as a word. I.e. “data mining”.



• We may be able to automatically discover

synonyms. I.e. “Bike” and “Bicycle”

Content Analysis Summary

• Content Analysis: transforming raw text into more

computationally useful forms

• Words in text collections exhibit interesting

statistical properties

– Word frequencies have a Zipf distribution

– Word co-occurrences exhibit dependencies

• Text documents are transformed to vectors

– Pre-processing includes tokenization, stemming,

collocations/phrases

Information

need Collections





Pre-process

text input

How is

the index

Parse Query Index constructed?





Rank

The section that follows is about



Index Construction Evaluate

Inverted Index

• This is the primary data structure for text indexes

• Main Idea:

– Invert documents into a big index

• Basic steps:

– Make a “dictionary” of all the tokens in the collection

– For each token, list all the docs it occurs in.

– Do a few things to reduce redundancy in the data structure

Inverted Indexes

We have seen “Vector files” conceptually. An

Inverted File is a vector file “inverted” so

that rows become columns and columns

become rows

docs t1 t2 t3

D1 1 0 1

D2 1 0 0

D3 0 1 1

D4 1 0 0 Terms D1 D2 D3 D4 D5 D6 D7 …

D5 1 1 1 t1 1 1 0 1 1 1 0

D6 1 1 0 t2 0 0 1 0 1 1 1

D7 0 1 0

t3 1 0 1 0 1 0 0

D8 0 1 0

D9 0 0 1

D10 0 1 1

How Are Inverted Files Created

Term Doc #

now 1



• Documents are parsed to extract tokens. is

the

time

1

1

1



These are saved with the Document ID. for

all

1

1

good 1

men 1

to 1

come 1

to 1

the 1

aid 1

of 1

Doc 1 Doc 2 their

country

1

1

it 2

was 2



Now is the time It was a dark and a

dark

2

2



for all good men stormy night in and

stormy

2

2

night 2

to come to the aid the country in 2

the 2



of their country manor. The time country

manor

2

2

was past midnight the

time

2

2

was 2

past 2

midnight 2

Term Doc # Term Doc #

now 1 a 2



How Inverted is

the

time

1

1

1

aid

all

and

1

1

2



Files are Created for

all

good

1

1

1

come

country

country

1

1

2

men 1 dark 2

to 1 for 1

• After all documents come

to

1

1

good

in

1

2

the 1 is 1

have been parsed the aid

of

1

1

it

manor

2

2

inverted file is sorted their

country

1

1

men

midnight

1

2

it 2 night 2

alphabetically. was

a

2

2

now

of

1

1

dark 2 past 2

and 2 stormy 2

stormy 2 the 1

night 2 the 1

in 2 the 2

the 2 the 2

country 2 their 1

manor 2 time 1

the 2 time 2

time 2 to 1

was 2 to 1

past 2 was 2

midnight 2 was 2

How Inverted Term

a

aid

Doc #

2

1

Term

a

aid

Doc #

2

1

Freq

1

1

all 1



Files are Created and

come

country

2

1

1

all

and

come

1

2

1

1

1

1

country 2 country 1 1

dark 2 country 2 1

for 1 dark 2 1

• Multiple term entries good

in

1

2

for 1 1

good 1 1

for a single document is

it

1

2

in 2 1

is 1 1

manor 2

are merged. men 1

it 2 1

midnight 2 manor 2 1



• Within-document term night

now

2

1

men

midnight

1

2

1

1

of 1 night 2 1

frequency information past 2 now 1 1

stormy 2 of 1 1

is compiled. the 1

past 2 1

the 1

stormy 2 1

the 2

the 2 the 1 2

their 1 the 2 2

time 1 their 1 1

time 2 time 1 1

to 1 time 2 1

to 1 to 1 2

was 2 was 2 2

was 2

How Inverted Files are Created



• Then the file can be split into

– A Dictionary file

and

– A Postings file

How Inverted Files are Created

Term

a

Doc #

2

Freq

1

Dictionary Postings

aid 1 1 Doc # Freq

Term N docs Tot Freq

all 1 1 a 1 1 2 1

and 2 1 aid 1 1 1 1

come 1 1 all 1 1 1 1

country 1 1 and 1 1 2 1

country 2 1 come 1 1 1 1

country 2 2 1 1

dark 2 1

dark 1 1 2 1

for 1 1 2 1

for 1 1

good 1 1 1 1

good 1 1

in 2 1 in 1 1 1 1

is 1 1 is 1 1 2 1

it 2 1 it 1 1 1 1

manor 2 1 manor 1 1 2 1

men 1 1 men 1 1 2 1

midnight 1 1 1 1

midnight 2 1

night 1 1 2 1

night 2 1 2 1

now 1 1

now 1 1 of 1 1 1 1

of 1 1 past 1 1 1 1

past 2 1 stormy 1 1 2 1

stormy 2 1 the 2 4 2 1

the 1 2 their 1 1 1 2

time 2 2 2 2

the 2 2

to 1 2 1 1

their 1 1

was 1 2 1 1

time 1 1 2 1

time 2 1 1 2

to 1 2 2 2

was 2 2

Inverted Indexes

• Permit fast search for individual terms

• For each term, you get a list consisting of:

– document ID

– frequency of term in doc (optional)

– position of term in doc (optional)

• These lists can be used to solve Boolean queries:

• country -> d1, d2

• manor -> d2

• country AND manor -> d2

• Also used for statistical ranking algorithms

How Inverted Files are Used

Query on

Dictionary Postings “time” AND “dark”

Term N docs Tot Freq Doc # Freq

a 1 1 2 1

aid 1 1 1 1

all

and

1

1

1

1

1

2

1

1 2 docs with “time” in

dictionary ->

come 1 1 1 1

country 2 2 1 1

dark 1 1 2 1

for

good

1

1

1

1

2

1

1

1 IDs 1 and 2 from

posting file

in 1 1 1 1

is 1 1 2 1

it 1 1 1 1

manor

men

1

1

1

1

2

2

1

1 1 doc with “dark” in

dictionary ->

midnight 1 1 1 1

night 1 1 2 1

now 1 1 2 1

of

past

1

1

1

1

1

1

1

1 ID 2 from posting

file

stormy 1 1 2 1

the 2 4 2 1

their 1 1 1 2

time 2 2 2 2

to 1 2 1 1





Therefore, only doc 2

was 1 2 1 1

2 1

1 2

2 2

satisfied the query.

Information

need Collections





Pre-process

text input

How is

the index

Parse Query Index constructed?





Rank

The section that follows is about



Querying (and Evaluate

ranking)

Simple query language: Boolean

– Terms + Connectors (or operators)

– terms

• words

• normalized (stemmed) words

• phrases Word Doc

– connectors • Cat x

• AND

• OR • Dog





NOT

NEAR (Pseudo Boolean)

• Collar x

• Leash

Boolean Queries

• Cat

• Cat OR Dog

• Cat AND Dog

• (Cat AND Dog)

• (Cat AND Dog) OR Collar

• (Cat AND Dog) OR (Collar AND Leash)

• (Cat OR Dog) AND (Collar OR Leash)

Boolean Searching

“Measurement of the Formal Query:

width of cracks in cracks AND beams

prestressed Cracks AND Width_measurement

concrete beams” AND Prestressed_concrete







Beams Width

measurement

Relaxed Query:

Prestressed (C AND B AND P) OR

concrete (C AND B AND W) OR

(C AND W AND P) OR

(B AND W AND P)

Ordering of Retrieved Documents

• Pure Boolean has no ordering

• In practice:

– order chronologically

– order by total number of “hits” on query terms

• What if one term has more hits than others?

• Is it better to one of each term or many of one term?

Boolean Model

• Advantages

– simple queries are easy to understand

– relatively easy to implement

• Disadvantages

– difficult to specify what is wanted

– too much returned, or too little

– ordering not well determined

• Dominant language in commercial Information

Retrieval systems until the WWW





Since the Boolean model is limited, lets consider a generalization…

Vector Model

• Documents are represented as “bags of words”

• Represented as vectors when used computationally

– A vector is like an array of floating point

– Has direction and magnitude

– Each vector holds a place for every term in the collection

– Therefore, most vectors are sparse







• Smithers secretly loves Monty Burns

• Monty Burns secretly loves Smithers

Both map to…

[ Burns, loves, Monty, secretly, Smithers]

Document Vectors

One location for each word

Document ids



nova galaxy heat h’wood film role diet fur

A 10 5 3

B 5 10

C 10 8 7

D 9 10 5

E 10 10

F 9 10

G 5 7 9

H 6 10 2 8

I 7 5 1 3

We Can Plot the Vectors

Star





Doc about movie stars

Doc about astronomy









Doc about mammal behavior





Diet

Documents in 3D Vector Space

t3

D1

D9

D11





D3 D5

D10





D4 D2

t1

D7

D8 D6

t2





Illustration from Jurafsky & Martin

Vector Space Model

docs Homer Marge Bart Note that the query is projected

D1 * * into the same vector space as the

D2 * documents.

D3 * *

The query here is for “Marge”.

D4 *

D5 * * * We can use a vector similarity

D6 * * model to determine the best match

D7 * to our query (details in a few slides).

D8 *

D9 * But what weights should we use

D10 * * for the terms?

D11 * *

Q *

Assigning Weights to Terms

• Binary Weights

• Raw term frequency

• tf x idf

– Recall the Zipf distribution

– Want to weight terms highly if they are

• frequent in relevant documents … BUT

• infrequent in the collection as a whole

Binary Weights

• Only the presence (1) or absence (0) of a

term is included in the vector

docs t1 t2 t3

D1 1 0 1

D2 1 0 0

D3 0 1 1

D4 1 0 0

D5 1 1 1 We have already

D6 1 1 0 seen and discussed

D7 0 1 0

D8 0 1 0 this model.

D9 0 0 1

D10 0 1 1

D11 1 0 1

Raw Term Weights

• The frequency of occurrence for the term in

each document is included in the vector

docs t1 t2 t3

D1 2 0 3

D2 1 0 0

This model is open

D3 0 4 7 to exploitation by

D4 3 0 0

D5 1 6 3 websites…

D6 3 5 0 sex sex sex sex sex

D7 0 8 0

D8 0 10 0 sex sex sex sex sex

D9 0 0 1

Counts can be D10 0 3 5 sex sex sex sex sex

normalized by D11 4 0 1 sex sex sex sex sex

document lengths. sex sex sex sex sex

tf * idf Weights

• tf * idf measure:

– term frequency (tf)

– inverse document frequency (idf) -- a way to

deal with the problems of the Zipf distribution

• Goal: assign a tf * idf weight to each term

in each document

tf * idf

wik  tfik * log( N / nk )

Tk  term k in document Di

tfik  frequencyof term Tk in document Di

idf k  inverse documentfrequencyof term Tk in C

N  total number of documentsin the collection C

nk  the number of documentsin C that contain Tk



idf k  log  N 

 

 nk 

Inverse Document Frequency

• IDF provides high values for rare words and

low values for common words

 10000 

log  0

 10000 

For a  10000 

collection log    0.301

 5000 

idfk  log  N 

  of 10000

 nk   10000 

documents log    2.698

 20 

 10000 

log  4

 1 

Similarity Measures

|QD|

Simple matching (coordination level match)





|QD| Dice’s Coefficient

2

|Q|| D|





|QD|

|QD| Jaccard’s Coefficient



|QD|

1 1

|Q | | D |

2 2

Cosine Coefficient



|QD|

min(| Q |, | D |) Overlap Coefficient

Cosine

D1  (0.8, 0.3)

D2  (0.2, 0.7)

1.0

Q Q  (0.4, 0.8)

D2 cos1  0.74

0.8

cos 2  0.98

0.6 2

0.4

1 D1

0.2





0.2 0.4 0.6 0.8 1.0

Vector Space Similarity Measure



Di  wd i1 , wd i 2 ,...,wd it

Q  wq1 , wq 2, ...,wqt w  0 if a term is absent

t

if term weights normalized : sim(Q, Di )   wqj  wd ij

j 1



otherwisenormalize in the similarity comparison:

t



w

j 1

qj  wd ij

sim(Q, Di ) 

t t



 ( wqj ) 2 

j 1

 ( wd ij ) 2

j 1

Problems with Vector Space

• There is no real theoretical basis for the

assumption of a term space

– it is more for visualization that having any real

basis

– most similarity measures work about the same

regardless of model

• Terms are not really orthogonal dimensions

– Terms are not independent of all other terms

Probabilistic Models

• Rigorous formal model attempts to predict

the probability that a given document will

be relevant to a given query

• Ranks retrieved documents according to this

probability of relevance (Probability

Ranking Principle)

• Rely on accurate estimates of probabilities

Relevance Feedback

• Main Idea:

– Modify existing query based on relevance judgements

• Query Expansion: Extract terms from relevant documents and

add them to the query

• Term Re-weighing: and/or re-weight the terms already in the

query

– Two main approaches:

• Automatic (psuedo-relevance feedback)

• Users select relevant documents

– Users/system select terms from an automatically-

generated list

Definition: Relevance Feedback is the reformulation of a search query in response

to feedback provided by the user for the results of previous versions of the query.



Suppose you are interested in bovine agriculture on

the banks of the river Jordan…



Term Vector [Jordan , Bank, Bull, River]

Term Weights [ 1 , 1 , 1 , 1 ]





Search

Display Results

Gather Feedback

Update Weights

Term Vector [Jordan , Bank, Bull, River]

Term Weights [ 1.1 , 0.1 , 1.3 , 1.2 ]

Rocchio Method

n1 n2

Ri Si

Q1  Q0      

i 1 n1 i 1 n2



where

Q0  the vector for theinitial query

Ri  the vector for the relevant documenti

Si  the vector for the non - relevant documenti

n1  the number of relevant documentschosen

n2  the number of non - relevant documentschosen

 and  tune the importanceof relevant and nonrelevant terms

(in some studies best to set  to 0.75 and  to 0.25)

Rocchio Illustration

Although we usually work in vector space for text, it is

easier to visualize Euclidian space









Original Query Term Re-weighting Query Expansion

Note that both the location of

the center, and the shape of

the query have changed

Rocchio Method

• Rocchio automatically

– re-weights terms

– adds in new terms (from relevant docs)

• Most methods perform similarly

– results heavily dependent on test collection

• Machine learning methods are proving to

work better than standard IR approaches

like Rocchio

Using Relevance Feedback

• Known to improve results

• People don’t seem to like giving feedback!

Information

need Collections





Pre-process

text input

How is

the index

Parse Query Index constructed?





Rank

The section that follows is about



Evaluation Evaluate

Evaluation

• Why Evaluate?

• What to Evaluate?

• How to Evaluate?

Why Evaluate?

• Determine if the system is desirable

• Make comparative assessments

What to Evaluate?

• How much of the information need is

satisfied.

• How much was learned about a topic.

• Incidental learning:

– How much was learned about the collection.

– How much was learned about other topics.

• How inviting the system is.

What to Evaluate?

What can be measured that reflects users’ ability

to use system? (Cleverdon 66)

– Coverage of Information

– Form of Presentation

– Effort required/Ease of Use

– Time and Space Efficiency

– Recall

effectiveness









• proportion of relevant material actually retrieved

– Precision

• proportion of retrieved material actually relevant

Relevant vs. Retrieved





All docs

Retrieved









Relevant

Precision vs. Recall

| RelRetriev ed | | RelRetriev ed |

Precision  Recall 

| Retrieved | | Rel in Collection |





All docs

Retrieved









Relevant

Why Precision and Recall?

Intuition:



Get as much good stuff while at the same time getting

as little junk as possible.

Retrieved vs. Relevant Documents

Very high precision, very low recall









Relevant

Retrieved vs. Relevant Documents

Very low precision, very low recall (0 in fact)









Relevant

Retrieved vs. Relevant Documents

High recall, but low precision









Relevant

Retrieved vs. Relevant Documents

High precision, high recall (at last!)









Relevant

Precision/Recall Curves

• There is a tradeoff between Precision and Recall

• So measure Precision at different levels of Recall

• Note: this is an AVERAGE over MANY queries



precision

x



x



x

x

recall

Precision/Recall Curves

• Difficult to determine which of these two hypothetical

results is better:







precision x

x



x

x



recall

Document Cutoff Levels

• Another way to evaluate:

– Fix the number of documents retrieved at several levels:

• top 5

• top 10

• top 20

• top 50

• top 100

• top 500

– Measure precision at each of these levels

– Take (weighted) average over results

• This is a way to focus on how well the system ranks the

first k documents.

Problems with Precision/Recall

• Can’t know true recall value

– except in small collections

• Precision/Recall are related

– A combined measure sometimes more appropriate

• Assumes batch mode

– Interactive IR is important and has different criteria for

successful searches

– Assumes a strict rank ordering matters.

Relation to Contingency Table

Doc is Doc is Doc is Doc is

Relevant NOT Relevant NOT

relevant relevant

Doc is Doc is

retrieved a b retrieved N ret rel N ret rel

Doc is Doc is

NOT c d NOT N ret rel N ret rel

retrieved retrieved



• Accuracy: (a+d) / (a+b+c+d)

• Precision: a/(a+b)

• Recall: a/(a+c)

• Why don’t we use Accuracy for IR?

– (Assuming a large collection)

– Most docs aren’t relevant

– Most docs aren’t retrieved

– Inflates the accuracy value

The E-Measure

Combine Precision and Recall into one number (van

Rijsbergen 79)

1  b2

E  1 2

b 1



R P

P = precision

R = recall

b = measure of relative importance of P or R



For example,

b = 0.5 means user is twice as interested in

precision as recall

How to Evaluate?

Test Collections

TREC

• Text REtrieval Conference/Competition

– Run by NIST (National Institute of Standards & Technology)

– 2004 (November) will be 13th year

• Collection: >6 Gigabytes (5 CRDOMs), >1.5

Million Docs

– Newswire & full text news (AP, WSJ, Ziff, FT)

– Government documents (federal register, Congressional

Record)

– Radio Transcripts (FBIS)

– Web “subsets”

TREC (cont.)

• Queries + Relevance Judgments

– Queries devised and judged by “Information Specialists”

– Relevance judgments done only for those documents

retrieved -- not entire collection!

• Competition

– Various research and commercial groups compete (TREC

6 had 51, TREC 7 had 56, TREC 8 had 66)

– Results judged on precision and recall, going up to a

recall level of 1000 documents

TREC

• Benefits:

– made research systems scale to large collections (pre-

WWW)

– allows for somewhat controlled comparisons

• Drawbacks:

– emphasis on high recall, which may be unrealistic for

what most users want

– very long queries, also unrealistic

– comparisons still difficult to make, because systems are

quite different on many dimensions

– focus on batch ranking rather than interaction

– no focus on the WWW

TREC is changing

• Emphasis on specialized “tracks”

– Interactive track

– Natural Language Processing (NLP) track

– Multilingual tracks (Chinese, Spanish)

– Filtering track

– High-Precision

– High-Performance

• http://trec.nist.gov/

Homework…



Related docs
Other docs by xiang
The Parable of the Rich Fool
Views: 23  |  Downloads: 0
14838-Nat.Equest Summer 08-2
Views: 7  |  Downloads: 0
kompendium_februar_01
Views: 1  |  Downloads: 0
Antimikrobielle Wirkung ausgewhl
Views: 2  |  Downloads: 0
Vietnamese BULLETIN vietnamien
Views: 1  |  Downloads: 0
Information Retrieval Models and
Views: 19  |  Downloads: 0
Download our Menu - Aveda Institutes
Views: 2  |  Downloads: 0
Journ茅e mondiale de l'hydrograph
Views: 2  |  Downloads: 0
SJSAS
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!