Docstoc

Text_mining

Document Sample
Text_mining Powered By Docstoc
					   Text Mining



  Dr Eamonn Keogh
Computer Science & Engineering Department
   University of California - Riverside
           Riverside,CA 92521
           eamonn@cs.ucr.edu
      Text Mining/Information
             Retrieval
• Task Statement:

  Build a system that retrieves documents that users
       are likely to find relevant to their queries.


• This assumption underlies the field of
  Information Retrieval.
Information
need                                               Collections

               How is
               the query             Pre-process
  text input
               constructed?                         How is
                                                    the text
   Parse
                                                    processed?
                     Query             Index



                              Rank


                                                    Evaluate
                  Terminology

Token: A natural language word “Swim”, “Simpson”,
“92513” etc

Document: Usually a web page, but more generally any file.
             Some IR History
– Roots in the scientific “Information Explosion” following
  WWII
– Interest in computer-based IR from mid 1950’s
   •   H.P. Luhn at IBM (1958)
   •   Probabilistic models at Rand (Maron & Kuhns) (1960)
   •   Boolean system development at Lockheed (‘60s)
   •   Vector Space Model (Salton at Cornell 1965)
   •   Statistical Weighting methods and theoretical advances (‘70s)
   •   Refinements and Advances in application (‘80s)
   •   User Interfaces, Large-scale testing and application (‘90s)
                  Relevance
• In what ways can a document be relevant
  to a query?
  – Answer precise question precisely.
       – Who is Homer’s Boss? Montgomery Burns.
  – Partially answer question.
       – Where does Homer work? Power Plant.
  – Suggest a source for more information.
       – What is Bart’s middle name? Look in Issue 234 of
         Fanzine
  – Give background information.
  – Remind the user of other knowledge.
  – Others ...
     Information
     need                                                               Collections

                                    How is
                                    the query             Pre-process
         text input
                                    constructed?                         How is
                                                                         the text
            Parse
                                                                         processed?
                                          Query             Index



                                                   Rank



The section that follows is about                                        Evaluate
Content Analysis
(transforming raw text into a
computationally more manageable form)
Document Processing Steps




       Figure from Baeza-Yates & Ribeiro-
                     Neto
Stemming and Morphological Analysis

• Goal: “normalize” similar words
• Morphology (“form” of words)
  – Inflectional Morphology
    • E.g,. inflect verb endings and noun number
    • Never change grammatical class
       – dog, dogs
       – Bike, Biking
       – Swim, Swimmer, Swimming


     What about… build, building;
   Examples of Stemming (using Porters algorithm)
                           Original Words   Stemmed Words
                           …                …
                           consign          consign
                           consigned        consign
                           consigning       consign
                           consignment      consign
                           consist          consist
                           consisted        consist
                           consistency      consist
                           consistent       consist
Porters algorithms is      consistently     consist
available in Java, C,
                           consisting       consist
Lisp, Perl, Python etc
from                       consists         consist
                           …
http://www.tartarus.org/
~martin/PorterStemmer/
   Errors Generated by Porter
      Stemmer (Krovetz 93)
Too Aggressive Too Timid
organization/organ   european/europe
policy/police        cylinder/cylindrical
execute/executive    create/creation
arm/army             search/searcher
 Statistical Properties of Text
• Token occurrences in text are not
  uniformly distributed
• They are also not normally
  distributed
• They do exhibit a Zipf distribution
Government documents, 157734 tokens, 32259 unique


 8164 the       969 on            1 ABC
 4771 of        915 FT            1 ABFT
 4005 to        883 Mr            1 ABOUT
 2834 a         860 was           1 ACFT
 2827 and       855 be            1 ACI
 2802 in        849 Pounds        1 ACQUI
 1592 The       798 TEXT          1 ACQUISITIONS
 1370 for       798 PUB           1 ACSIS
 1326 is        798 PROFILE       1 ADFT
 1324 s         798 PAGE          1 ADVISERS
 1194 that      798 HEADLINE      1 AE
  973 by        798 DOCNO
Plotting Word Frequency by Rank
• Main idea: count
  – How many times tokens occur in the text
     • Over all texts in the collection
• Now rank these according to how often they
  occur. This is called the rank.
   The Corresponding Zipf Curve
Rank   Freq
1      37     system
2      32     knowledg
3      24     base
4      20     problem
5      18     abstract
6      15     model
7      15     languag
8      15     implem
9      13     reason
10     13      inform
11     11      expert
12     11      analysi
13     10      rule
14     10      program
15     10      oper
16     10      evalu
17     10      comput
18     10      case
19     9      gener
20     9      form
          Zipf Distribution

• The Important Points:
  – a few elements occur very frequently
  – a medium number of elements have medium
    frequency
  – many elements occur very infrequently
                Zipf Distribution
• The product of the frequency of words (f) and
  their rank (r) is approximately constant
   – Rank = order of words’ frequency of occurrence



                       f  C 1 / r
                       C  N / 10
• Another way to state this is with an approximately correct
  rule of thumb:
   –   Say the most common term occurs C times
   –   The second most common occurs C/2 times
   –   The third most common occurs C/3 times
   –   …
   Zipf Distribution
(linear and log scale)




      Illustration by Jacob Nielsen
    What Kinds of Data Exhibit a
         Zipf Distribution?
• Words in a text collection
    – Virtually any language usage
•   Library book checkout patterns
•   Incoming Web Page Requests
•   Outgoing Web Page Requests
•   Document Size on Web
•   City Sizes
•   …
         Consequences of Zipf
• There are always a few very frequent tokens
  that are not good discriminators.
  – Called “stop words” in IR
     • English examples: to, from, on, and, the, ...
• There are always a large number of tokens
  that occur once and can mess up algorithms.
• Medium frequency words most descriptive
 Word Frequency vs. Resolving
  Power (from van Rijsbergen 79)
The most frequent words are not the most descriptive.
     Statistical Independence

Two events x and y are statistically
 independent if the product of their
 probability of their happening individually
 equals their probability of happening
 together.

         P( x)P( y)  P( x, y)
      Statistical Independence
          and Dependence
• What are examples of things that are
  statistically independent?

• What are examples of things that are
  statistically dependent?
             Lexical Associations
• Subjects write first word that comes to mind
   – doctor/nurse; black/white (Palermo & Jenkins 64)
• Text Corpora yield similar associations
• One measure: Mutual Information (Church and Hanks 89)


                                        P ( x, y )
                  I ( x, y )  log 2
                                       P( x), P( y )

• If word occurrences were independent, the numerator and
  denominator would be equal (if measured across a large
  collection)
            Statistical Independence
• Compute for a window of words
P( x )  P( y )  P( x, y ) if independent.   abcdefghij klmnop
P( x )  f ( x ) / N
                                               w1   w11
We' ll approximate P( x, y ) as follows :                 w21

             1 N |w|
P ( x, y )     wi ( x, y )
             N i 1
| w | length of window w (say 5)
wi  words within window starting at position i
w( x, y )  number of times x and y co - occurin w
N  number of wordsin collection
  Interesting Associations with “Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)

 I(x,y)   f(x,y)   f(x)   x          f(y)   y
 11.3     12       111    Honorary   621    Doctor
 11.3     8        1105   Doctors    44     Dentists
 10.7     30       1105   Doctors    241    Nurses
 9.4      8        1105   Doctors    154    Treating
 9.0      6        275    Examined 621      Doctor
 8.9      11       1105   Doctors    317    Treat
 8.7      25       621    Doctor     1407   Bills
Un-Interesting Associations with
            “Doctor”
 (AP Corpus, N=15 million, Church & Hanks 89)

I(x,y)   f(x,y)   f(x)     x        f(y)    y
0.96     6        621      doctor   73785   with
0.95     41       284690   a        1105    doctors
0.93     12       84716    is       1105    doctors


These associations were likely to happen because
the non-doctor words shown here are very common
and therefore likely to co-occur with any noun.
Associations Are Important Because…

• We may be able to discover that phrases that
  should be treated as a word. I.e. “data mining”.

• We may be able to automatically discover
  synonyms. I.e. “Bike” and “Bicycle”
    Content Analysis Summary
• Content Analysis: transforming raw text into more
  computationally useful forms
• Words in text collections exhibit interesting
  statistical properties
   – Word frequencies have a Zipf distribution
   – Word co-occurrences exhibit dependencies
• Text documents are transformed to vectors
   – Pre-processing includes tokenization, stemming,
     collocations/phrases
    Information
    need                                                         Collections


                                                   Pre-process
        text input
                                                                 How is
                                                                 the index
           Parse                    Query            Index       constructed?


                                            Rank
The section that follows is about

Index Construction                                                 Evaluate
                 Inverted Index
• This is the primary data structure for text indexes
• Main Idea:
   – Invert documents into a big index
• Basic steps:
   – Make a “dictionary” of all the tokens in the collection
   – For each token, list all the docs it occurs in.
   – Do a few things to reduce redundancy in the data structure
   How Are Inverted Files Created
                                            Term       Doc #
                                            now                1

• Documents are parsed to extract tokens.   is
                                            the
                                            time
                                                               1
                                                               1
                                                               1

  These are saved with the Document ID.     for
                                            all
                                                               1
                                                               1
                                            good               1
                                            men                1
                                            to                 1
                                            come               1
                                            to                 1
                                            the                1
                                            aid                1
                                            of                 1
      Doc 1                Doc 2            their
                                            country
                                                               1
                                                               1
                                            it                 2
                                            was                2

  Now is the time     It was a dark and     a
                                            dark
                                                               2
                                                               2

  for all good men     stormy night in      and
                                            stormy
                                                               2
                                                               2
                                            night              2
 to come to the aid      the country        in                 2
                                            the                2

   of their country   manor. The time       country
                                            manor
                                                               2
                                                               2
                      was past midnight     the
                                            time
                                                               2
                                                               2
                                            was                2
                                            past               2
                                            midnight           2
                             Term       Doc #       Term       Doc #
                             now                1   a                  2

How Inverted                 is
                             the
                             time
                                                1
                                                1
                                                1
                                                    aid
                                                    all
                                                    and
                                                                       1
                                                                       1
                                                                       2

Files are Created            for
                             all
                             good
                                                1
                                                1
                                                1
                                                    come
                                                    country
                                                    country
                                                                       1
                                                                       1
                                                                       2
                             men                1   dark               2
                             to                 1   for                1
 • After all documents       come
                             to
                                                1
                                                1
                                                    good
                                                    in
                                                                       1
                                                                       2
                             the                1   is                 1
   have been parsed the      aid
                             of
                                                1
                                                1
                                                    it
                                                    manor
                                                                       2
                                                                       2
   inverted file is sorted   their
                             country
                                                1
                                                1
                                                    men
                                                    midnight
                                                                       1
                                                                       2
                             it                 2   night              2
   alphabetically.           was
                             a
                                                2
                                                2
                                                    now
                                                    of
                                                                       1
                                                                       1
                             dark               2   past               2
                             and                2   stormy             2
                             stormy             2   the                1
                             night              2   the                1
                             in                 2   the                2
                             the                2   the                2
                             country            2   their              1
                             manor              2   time               1
                             the                2   time               2
                             time               2   to                 1
                             was                2   to                 1
                             past               2   was                2
                             midnight           2   was                2
How Inverted                Term
                            a
                            aid
                                       Doc #
                                               2
                                               1
                                                   Term
                                                   a
                                                   aid
                                                              Doc #
                                                                      2
                                                                      1
                                                                          Freq
                                                                                 1
                                                                                 1
                            all                1

Files are Created           and
                            come
                            country
                                               2
                                               1
                                               1
                                                   all
                                                   and
                                                   come
                                                                      1
                                                                      2
                                                                      1
                                                                                 1
                                                                                 1
                                                                                 1
                            country            2   country            1          1
                            dark               2   country            2          1
                            for                1   dark               2          1
  • Multiple term entries   good
                            in
                                               1
                                               2
                                                   for                1          1
                                                   good               1          1
    for a single document   is
                            it
                                               1
                                               2
                                                   in                 2          1
                                                   is                 1          1
                            manor              2
    are merged.             men                1
                                                   it                 2          1
                            midnight           2   manor              2          1

  • Within-document term    night
                            now
                                               2
                                               1
                                                   men
                                                   midnight
                                                                      1
                                                                      2
                                                                                 1
                                                                                 1
                            of                 1   night              2          1
    frequency information   past               2   now                1          1
                            stormy             2   of                 1          1
    is compiled.            the                1
                                                   past               2          1
                            the                1
                                                   stormy             2          1
                            the                2
                            the                2   the                1          2
                            their              1   the                2          2
                            time               1   their              1          1
                            time               2   time               1          1
                            to                 1   time               2          1
                            to                 1   to                 1          2
                            was                2   was                2          2
                            was                2
  How Inverted Files are Created

• Then the file can be split into
  – A Dictionary file
  and
  – A Postings file
    How Inverted Files are Created
Term
a
           Doc #
                   2
                       Freq
                              1
                                   Dictionary                            Postings
aid                1          1                                          Doc #       Freq
                                  Term       N docs       Tot Freq
all                1          1   a                   1              1           2          1
and                2          1   aid                 1              1           1          1
come               1          1   all                 1              1           1          1
country            1          1   and                 1              1           2          1
country            2          1   come                1              1           1          1
                                  country             2              2           1          1
dark               2          1
                                  dark                1              1           2          1
for                1          1                                                  2          1
                                  for                 1              1
good               1          1                                                  1          1
                                  good                1              1
in                 2          1   in                  1              1           1          1
is                 1          1   is                  1              1           2          1
it                 2          1   it                  1              1           1          1
manor              2          1   manor               1              1           2          1
men                1          1   men                 1              1           2          1
                                  midnight            1              1           1          1
midnight           2          1
                                  night               1              1           2          1
night              2          1                                                  2          1
                                  now                 1              1
now                1          1   of                  1              1           1          1
of                 1          1   past                1              1           1          1
past               2          1   stormy              1              1           2          1
stormy             2          1   the                 2              4           2          1
the                1          2   their               1              1           1          2
                                  time                2              2           2          2
the                2          2
                                  to                  1              2           1          1
their              1          1
                                  was                 1              2           1          1
time               1          1                                                  2          1
time               2          1                                                  1          2
to                 1          2                                                  2          2
was                2          2
              Inverted Indexes
• Permit fast search for individual terms
• For each term, you get a list consisting of:
   – document ID
   – frequency of term in doc (optional)
   – position of term in doc (optional)
• These lists can be used to solve Boolean queries:
      • country -> d1, d2
      • manor -> d2
      • country AND manor -> d2
• Also used for statistical ranking algorithms
                    How Inverted Files are Used
                                                              Query on
 Dictionary                            Postings               “time” AND “dark”
Term       N docs       Tot Freq       Doc #       Freq
a                   1              1           2          1
aid                 1              1           1          1
all
and
                    1
                    1
                                   1
                                   1
                                               1
                                               2
                                                          1
                                                          1   2 docs with “time” in
                                                                 dictionary ->
come                1              1           1          1
country             2              2           1          1
dark                1              1           2          1
for
good
                    1
                    1
                                   1
                                   1
                                               2
                                               1
                                                          1
                                                          1      IDs 1 and 2 from
                                                                 posting file
in                  1              1           1          1
is                  1              1           2          1
it                  1              1           1          1
manor
men
                    1
                    1
                                   1
                                   1
                                               2
                                               2
                                                          1
                                                          1   1 doc with “dark” in
                                                                 dictionary ->
midnight            1              1           1          1
night               1              1           2          1
now                 1              1           2          1
of
past
                    1
                    1
                                   1
                                   1
                                               1
                                               1
                                                          1
                                                          1      ID 2 from posting
                                                                 file
stormy              1              1           2          1
the                 2              4           2          1
their               1              1           1          2
time                2              2           2          2
to                  1              2           1          1


                                                              Therefore, only doc 2
was                 1              2           1          1
                                               2          1
                                               1          2
                                               2          2
                                                                satisfied the query.
    Information
    need                                                         Collections


                                                   Pre-process
        text input
                                                                 How is
                                                                 the index
           Parse                    Query            Index       constructed?


                                            Rank
The section that follows is about

Querying (and                                                      Evaluate
ranking)
 Simple query language: Boolean
– Terms + Connectors (or operators)
– terms
   • words
   • normalized (stemmed) words
   • phrases                          Word     Doc
– connectors                          • Cat    x
   •   AND
   •   OR                             • Dog
   •
   •
       NOT
       NEAR (Pseudo Boolean)
                                      • Collar x
                                      • Leash
           Boolean Queries
• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
             Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)
    – Each of the following combinations works:


•   Cat      x     x           x     x
•   Dog            x     x     x           x      x
•   Collar   x                 x           x      x
•   Leash          x     x           x            x
             Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)
    – None of the following combinations work:


•   Cat      x           x
•   Dog      x                 x
•   Collar         x                 x
•   Leash          x                       x
                      Boolean Searching
“Measurement of the                        Formal Query:
width of cracks in                         cracks AND beams
prestressed                  Cracks        AND Width_measurement
concrete beams”                            AND Prestressed_concrete



              Beams                      Width
                                         measurement
                                             Relaxed Query:
                           Prestressed       (C AND B AND P) OR
                           concrete          (C AND B AND W) OR
                                             (C AND W AND P) OR
                                             (B AND W AND P)
 Ordering of Retrieved Documents
• Pure Boolean has no ordering
• In practice:
  – order chronologically
  – order by total number of “hits” on query terms
     • What if one term has more hits than others?
     • Is it better to one of each term or many of one term?
                    Boolean Model
   • Advantages
       – simple queries are easy to understand
       – relatively easy to implement
   • Disadvantages
       – difficult to specify what is wanted
       – too much returned, or too little
       – ordering not well determined
   • Dominant language in commercial Information
     Retrieval systems until the WWW


Since the Boolean model is limited, lets consider a generalization…
                      Vector Model
• Documents are represented as “bags of words”
• Represented as vectors when used computationally
   –   A vector is like an array of floating point
   –   Has direction and magnitude
   –   Each vector holds a place for every term in the collection
   –   Therefore, most vectors are sparse



  • Smithers secretly loves Monty Burns
  • Monty Burns secretly loves Smithers
   Both map to…
  [ Burns, loves, Monty, secretly, Smithers]
                  Document Vectors
               One location for each word
Document ids

    nova       galaxy heat   h’wood   film   role   diet   fur
  A   10          5     3
  B   5           10
  C                            10      8      7
  D                            9       10     5
  E                                                  10     10
  F                                                  9      10
  G 5            7                     9
  H              6      10    2        8
  I                           7        5               1    3
         We Can Plot the Vectors
Star


                             Doc about movie stars
       Doc about astronomy




                                   Doc about mammal behavior


                                              Diet
Documents in 3D Vector Space
                       t3
                                                         D1
                    D9
                    D11


        D3                                      D5
        D10


                                                     D4 D2
                                                              t1
              D7
              D8                                D6
   t2


                   Illustration from Jurafsky & Martin
           Vector Space Model
docs Homer Marge Bart   Note that the query is projected
 D1    *          *     into the same vector space as the
 D2    *                documents.
 D3          *    *
                        The query here is for “Marge”.
 D4    *
 D5    *     *    *     We can use a vector similarity
 D6    *     *          model to determine the best match
 D7          *          to our query (details in a few slides).
 D8          *
 D9               *     But what weights should we use
D10          *    *     for the terms?
D11    *          *
 Q           *
 Assigning Weights to Terms
• Binary Weights
• Raw term frequency
• tf x idf
  – Recall the Zipf distribution
  – Want to weight terms highly if they are
     • frequent in relevant documents … BUT
     • infrequent in the collection as a whole
            Binary Weights
• Only the presence (1) or absence (0) of a
  term is included in the vector
                docs   t1   t2   t3
                 D1     1    0    1
                 D2     1    0    0
                 D3     0    1    1
                 D4     1    0    0
                 D5     1    1    1   We have already
                 D6     1    1    0   seen and discussed
                 D7     0    1    0
                 D8     0    1    0   this model.
                 D9     0    0    1
                D10     0    1    1
                D11     1    0    1
               Raw Term Weights
   • The frequency of occurrence for the term in
     each document is included in the vector
                    docs   t1   t2   t3
                     D1     2    0    3
                     D2     1    0    0
                                          This model is open
                     D3     0    4    7   to exploitation by
                     D4     3    0    0
                     D5     1    6    3   websites…
                     D6     3    5    0   sex sex sex sex sex
                     D7     0    8    0
                     D8     0   10    0   sex sex sex sex sex
                     D9     0    0    1
Counts can be       D10     0    3    5   sex sex sex sex sex
normalized by       D11     4    0    1   sex sex sex sex sex
document lengths.                         sex sex sex sex sex
            tf * idf Weights
• tf * idf measure:
  – term frequency (tf)
  – inverse document frequency (idf) -- a way to
    deal with the problems of the Zipf distribution
• Goal: assign a tf * idf weight to each term
  in each document
                     tf * idf
          wik  tfik * log( N / nk )
Tk  term k in document Di
tfik  frequencyof term Tk in document Di
idf k  inverse documentfrequencyof term Tk in C
N  total number of documentsin the collection C
nk  the number of documentsin C that contain Tk

idf k  log  N 
             
             nk 
        Inverse Document Frequency
   • IDF provides high values for rare words and
     low values for common words
                                      10000 
                                 log        0
                                      10000 
                    For a             10000 
                    collection   log          0.301
                                      5000 
idfk  log  N 
                  of 10000
            nk                      10000 
                    documents    log          2.698
                                      20 
                                      10000 
                                 log        4
                                      1 
                          Similarity Measures
                       Simple matching (coordination level match)
|QD|

     |QD|             Dice’s Coefficient
2
    |Q|| D|


    |QD|
    |QD|              Jaccard’s Coefficient

    |QD|
     1         1
|Q | | D |
         2         2
                       Cosine Coefficient

  |QD|
min(| Q |, | D |) Overlap Coefficient
                            Cosine
                                  D1  (0.8, 0.3)
                                  D2  (0.2, 0.7)
1.0
                Q                 Q  (0.4, 0.8)
      D2                          cos1  0.74
0.8
                                  cos 2  0.98
0.6   2
0.4
           1                D1
0.2


       0.2      0.4   0.6   0.8   1.0
    Problems with Vector Space
• There is no real theoretical basis for the
  assumption of a term space
  – it is more for visualization that having any real
    basis
  – most similarity measures work about the same
    regardless of model
• Terms are not really orthogonal dimensions
  – Terms are not independent of all other terms
        Probabilistic Models
• Rigorous formal model attempts to predict
  the probability that a given document will
  be relevant to a given query
• Ranks retrieved documents according to this
  probability of relevance (Probability
  Ranking Principle)
• Rely on accurate estimates of probabilities
          Relevance Feedback
• Main Idea:
  – Modify existing query based on relevance judgements
     • Query Expansion: Extract terms from relevant documents and
       add them to the query
     • Term Re-weighing: and/or re-weight the terms already in the
       query
  – Two main approaches:
     • Automatic (psuedo-relevance feedback)
     • Users select relevant documents
  – Users/system select terms from an automatically-
    generated list
Definition: Relevance Feedback is the reformulation of a search query in response
to feedback provided by the user for the results of previous versions of the query.

Suppose you are interested in bovine agriculture on
the banks of the river Jordan…

                                      Term Vector       [Jordan , Bank, Bull, River]
                                      Term Weights     [    1 , 1 , 1 , 1 ]


  Search
               Display Results
                                     Gather Feedback
                                                            Update Weights
Term Vector        [Jordan , Bank, Bull, River]
Term Weights       [ 1.1 , 0.1 , 1.3 , 1.2 ]
                  Rocchio Method
            n1       n2
                 Ri      Si
Q1  Q0      
            i 1 n1 i 1 n2

where
Q0  the vector for theinitial query
Ri  the vector for the relevant documenti
Si  the vector for the non - relevant documenti
n1  the number of relevant documentschosen
n2  the number of non - relevant documentschosen
 and  tune the importanceof relevant and nonrelevant terms
(in some studies best to set  to 0.75 and  to 0.25)
             Rocchio Illustration
     Although we usually work in vector space for text, it is
     easier to visualize Euclidian space




Original Query          Term Re-weighting                Query Expansion
                        Note that both the location of
                        the center, and the shape of
                        the query have changed
               Rocchio Method
• Rocchio automatically
   – re-weights terms
   – adds in new terms (from relevant docs)
      • have to be careful when using negative terms
      • Rocchio is not a machine learning algorithm
• Most methods perform similarly
   – results heavily dependent on test collection
• Machine learning methods are proving to work
  better than standard IR approaches like Rocchio
    Using Relevance Feedback
• Known to improve results
• People don’t seem to like giving feedback!
   Relevance Feedback for Time Series
The original query




The weigh vector.
Initially, all weighs
are the same.




                        Note: In this example we are using a piecewise linear
                        approximation of the data. We will learn more about this
                        representation later.
The initial query is
executed, and the five
best matches are
shown (in the
dendrogram)




One by one the 5 best
matching sequences
will appear, and the
user will rank them
from between very bad
(-3) to very good (+3)
Based on the user
feedback, both the
shape and the weigh
vector of the query are
changed.




The new query can be
executed.                 Two papers consider relevance feedback for time series.
The hope is that the      Query Expansion
query shape and           L Wu, C Faloutsos, K Sycara, T. Payne: FALCON: Feedback Adaptive Loop for Content-
                          Based Retrieval. VLDB 2000: 297-306
weights will converge     Term Re-weighting
                          Keogh, E. & Pazzani, M. Relevance feedback retrieval of time series data. In Proceedings
to the optimal query.     of SIGIR 99
     Document Space has High
         Dimensionality
• What happens beyond 2 or 3 dimensions?
• Similarity still has to do with how many
  tokens are shared in common.
• More terms -> harder to understand which
  subsets of words are shared among similar
  documents.
• One approach to handling high
  dimensionality:Clustering
            Text Clustering
• Finds overall similarities among groups of
  documents.
• Finds overall similarities among groups of
  tokens.
• Picks out some themes, ignores others.
                  Scatter/Gather
                  Hearst & Pedersen 95
• Cluster sets of documents into general “themes”, like
  a table of contents (using K-means)
• Display the contents of the clusters by showing topical
  terms and typical titles
• User chooses subsets of the clusters and re-clusters the
  documents within
• Resulting new groups have different “themes”
     S/G Example: query on “star”

Encyclopedia text
                               14 sports
8 symbols                      47 film, tv
68 film, tv (p)                 7 music
97 astrophysics
67 astronomy(p)       12 stellar phenomena
10 flora/fauna                   49 galaxies, stars
                                29 constellations
                                  7 miscellaneous

Clustering and re-clustering is entirely automated
Ego Surfing!   http://vivisimo.com/
    Information
    need                                                         Collections


                                                   Pre-process
        text input
                                                                 How is
                                                                 the index
           Parse                    Query            Index       constructed?


                                            Rank
The section that follows is about

Evaluation                                                         Evaluate
              Evaluation
• Why Evaluate?
• What to Evaluate?
• How to Evaluate?
            Why Evaluate?
• Determine if the system is desirable
• Make comparative assessments
• Others?
          What to Evaluate?
• How much of the information need is
  satisfied.
• How much was learned about a topic.
• Incidental learning:
  – How much was learned about the collection.
  – How much was learned about other topics.
• How inviting the system is.
                             What to Evaluate?
                What can be measured that reflects users’ ability
                to use system? (Cleverdon 66)
                –   Coverage of Information
                –   Form of Presentation
                –   Effort required/Ease of Use
                –   Time and Space Efficiency
                –   Recall
effectiveness




                     • proportion of relevant material actually retrieved
                – Precision
                     • proportion of retrieved material actually relevant
           Relevant vs. Retrieved


All docs
                           Retrieved




           Relevant
                 Precision vs. Recall
            | RelRetriev ed |               | RelRetriev ed |
Precision                      Recall 
               | Retrieved |             | Rel in Collection |


    All docs
                                   Retrieved




               Relevant
       Why Precision and Recall?
Intuition:

Get as much good stuff while at the same time getting
  as little junk as possible.
Retrieved vs. Relevant Documents
Very high precision, very low recall




          Relevant
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 in fact)




          Relevant
Retrieved vs. Relevant Documents
High recall, but low precision




          Relevant
 Retrieved vs. Relevant Documents
High precision, high recall (at last!)




       Relevant
           Precision/Recall Curves
• There is a tradeoff between Precision and Recall
• So measure Precision at different levels of Recall
• Note: this is an AVERAGE over MANY queries

   precision
                   x

                       x

                               x
                                    x
                           recall
          Precision/Recall Curves
• Difficult to determine which of these two hypothetical
  results is better:



   precision       x
                       x

                              x
                                    x

                           recall
Precision/Recall Curves
    Recall under various retrieval
            assumptions
     1.0
                   Perfect
     0.9
R    0.8           Tangent
E    0.7           Parabolic Parabolic
                                                         1000 Documents
C    0.6           Recall Recall
                                                         100 Relevant
A    0.5                     random
L    0.4
L    0.3
     0.2
     0.1
                                   Perverse
     0.0
           0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
               Proportion of documents retrieved
Precision under various assumptions
    1.0       Perfect
P   0.9
R   0.8
E   0.7    Tangent                                      1000 Documents
C   0.6    Parabolic                                    100 Relevant
I   0.5    Recall
S   0.4
           Parabolic
I   0.3
           Recall
O   0.2     random
N   0.1
                                  Perverse
    0.0
          0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
              Proportion of documents retrieved
            Document Cutoff Levels
• Another way to evaluate:
   – Fix the number of documents retrieved at several levels:
       •   top 5
       •   top 10
       •   top 20
       •   top 50
       •   top 100
       •   top 500
   – Measure precision at each of these levels
   – Take (weighted) average over results
• This is a way to focus on how well the system ranks the
  first k documents.
 Problems with Precision/Recall
• Can’t know true recall value
   – except in small collections
• Precision/Recall are related
   – A combined measure sometimes more appropriate
• Assumes batch mode
   – Interactive IR is important and has different criteria for
     successful searches
   – Assumes a strict rank ordering matters.
          Relation to Contingency Table
                Doc is      Doc is                    Doc is       Doc is
                Relevant    NOT                       Relevant     NOT
                            relevant                               relevant
    Doc is                                Doc is
    retrieved       a           b         retrieved   N ret rel    N ret rel
    Doc is                                Doc is
    NOT             c           d         NOT         N ret rel    N ret rel
    retrieved                             retrieved

•    Accuracy: (a+d) / (a+b+c+d)
•    Precision: a/(a+b)
•    Recall:    a/(a+c)
•    Why don’t we use Accuracy for IR?
      –   (Assuming a large collection)
      –   Most docs aren’t relevant
      –   Most docs aren’t retrieved
      –   Inflates the accuracy value
                   The E-Measure
Combine Precision and Recall into one number (van
  Rijsbergen 79)
         1  b2
  E  1 2
        b     1
            
         R P
   P = precision
   R = recall
   b = measure of relative importance of P or R

   For example,
   b = 0.5 means user is twice as interested in
           precision as recall
How to Evaluate?
Test Collections
             Test Collections
• Cranfield 2 –
   – 1400 Documents, 221 Queries
   – 200 Documents, 42 Queries
• INSPEC – 542 Documents, 97 Queries
• UKCIS -- > 10000 Documents, multiple sets, 193
  Queries
• ADI – 82 Document, 35 Queries
• CACM – 3204 Documents, 50 Queries
• CISI – 1460 Documents, 35 Queries
• MEDLARS (Salton) 273 Documents, 18 Queries
                         TREC
• Text REtrieval Conference/Competition
  – Run by NIST (National Institute of Standards & Technology)
  – 2002 (November) will be 11th year
• Collection: >6 Gigabytes (5 CRDOMs), >1.5
  Million Docs
  – Newswire & full text news (AP, WSJ, Ziff, FT)
  – Government documents (federal register, Congressional
    Record)
  – Radio Transcripts (FBIS)
  – Web “subsets”
                TREC (cont.)
• Queries + Relevance Judgments
  – Queries devised and judged by “Information Specialists”
  – Relevance judgments done only for those documents
    retrieved -- not entire collection!
• Competition
  – Various research and commercial groups compete (TREC
    6 had 51, TREC 7 had 56, TREC 8 had 66)
  – Results judged on precision and recall, going up to a
    recall level of 1000 documents
                        TREC
• Benefits:
   – made research systems scale to large collections (pre-
     WWW)
   – allows for somewhat controlled comparisons
• Drawbacks:
   – emphasis on high recall, which may be unrealistic for
     what most users want
   – very long queries, also unrealistic
   – comparisons still difficult to make, because systems are
     quite different on many dimensions
   – focus on batch ranking rather than interaction
   – no focus on the WWW
           TREC is changing
• Emphasis on specialized “tracks”
   – Interactive track
   – Natural Language Processing (NLP) track
   – Multilingual tracks (Chinese, Spanish)
   – Filtering track
   – High-Precision
   – High-Performance
• http://trec.nist.gov/
          What to Evaluate?
• Effectiveness
  – Difficult to measure
  – Recall and Precision are one way
  – What might be others?

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:10/4/2012
language:English
pages:110