lecture2 by pengxiang


									Indexing and Document Analysis

             CSC 575 / IS 575
      Intelligent Information Retrieval
 •      Indexing is the process of transforming items (documents)
        into a searchable data structure
          –     creation of document surrogates to represent each document
          –     requires analysis of original documents
                   •    simple: identify meta-information (e.g., author, title, etc.)
                   •    complex: linguistic analysis of content

 •      The search process involves correlating user queries with
        the documents represented in the index

Intelligent Information Retrieval                                                       2
 •      Choices for accessing data during query evaluation
          –     Scan the entire collection
                   •    Typical in early (batch) retrieval systems
                   •    Computational and I/O costs are O (characters in collection)
                   •    Practical for only “small” collections
          –     Use indexes for direct access
                   •    Evaluation time O (query term occurrences in collection)
                   •    Practical for “large” collections
                   •    Many opportunities for optimization
          –     Hybrids: use small index, then scan subset of the collection

Intelligent Information Retrieval                                                      3
             What should the index contain?
 •      Database systems index primary and secondary keys
          –     This is the hybrid approach
          –     Index provides fast access to a subset of database records
          –     Scan subset to find solution set

 •      IR Problem:
          –     Can’t predict the keys that people will use in queries
          –     Every word in a document is a potential search term

 •      IR Solution: Index by all keys (words)

Intelligent Information Retrieval                                            4
                                    Accessing “Features”
 •      The index is accessed by the atoms of a query language
 •      The atoms are called “features” or “keys” or “terms”
 •      Most common feature types:
          –     Words in text, punctuation
          –     Manually assigned terms (controlled & uncontrolled vocabulary)
          –     Document structure (sentence & paragraph boundaries)
          –     Inter- or intra-document links (e.g., citations)

 •      Composed features
          –     Feature sequences (phrases, names, dates, monetary amounts)
          –     Feature sets (e.g., synonym classes)

Intelligent Information Retrieval                                                5
                                    Indexing Languages
 •      An index is constructed on the basis of an indexing
        language or vocabulary
          –     The vocabulary may be controlled or uncontrolled
                   •    Controlled: limited to a predefined set of index terms
                   •    Uncontrolled: allows the use of any terms fitting some broad criteria
 •      Indexing may be done manually or automatically
          –     Manual or human indexing:
                   •    Indexers decide which keywords to assign to document based on
                        controlled vocabulary (e.g. MEDLINE, Yahoo)
                   •    Significant cost
          –     Automatic indexing:
                   •    Indexing program decides which words, phrases or other features to use
                        from text of document
                   •    Indexing speeds are approximately 500MB/hour for a single processor

Intelligent Information Retrieval                                                                6
                  Types of Indexing Languages
   •     Uncontrolled Keyword Indexing
   •     Controlled, but not structured
   •     Thesauri
            –    Controlled and structured
   •     Classification Systems
            –    Controlled, structured, and coded
            –    Often uses a predefined hierarchy
            –    Faceted Classification Systems: group terms into facets that
                 describe basic structure of a domain
                     •   Less rigid than predefined hierarchy
   •     Examples of indexing languages:
            –    Library of Congress Subject Headings
            –    Yellow Pages Topics
Intelligent Information Retrieval                                               7
•    A Thesaurus is a collection of selected vocabulary with links
     among Synonymous, Equivalent, Broader, Narrower and
     other Related Terms
       –     Used in a variety of ways during searching and indexing
       –     e.g. MeSH, LoC Subject Headings
•    General thesauri such as Roget or WordNet have not been
     shown to be useful in most retrieval situations
       –     not domain-specific
       –     sense ambiguity
•    Automatic thesauri based on corpus analysis can be used to
     expand queries
       –     e.g. adding synonyms to queries, etc.
       –     simple technique looks at frequent words and phrases in top retrieved
Intelligent Information Retrieval                                                    8
                               Classification Systems
•    A classification system is often based on a broad ordering of
     topical areas.
       –     Thesauri and classification systems both use this broad ordering and
             maintain a structure of broader, narrower, and related topics.
       –     Classification schemes commonly use a coded notation for
             representing a topic and it’s place in relation to other terms.
•    Examples:
       –     The Library of Congress Classification System
       –     The Dewey Decimal Classification System
       –     The ACM Computing Reviews Categories
       –     The American Mathematical Society Classification System

Intelligent Information Retrieval                                                   9
             Criteria for Indexing Languages
   •     Exhaustivity
            –    number of different topics indexed;              Trade-off between
                                                                  exhaustivity and
                 completeness                                     specificity is similar
   •     Specificity                                              to that between
                                                                  precision and recall
            –    level of accuracy of indexing; how detailed
                 (e.g., “dogs” and “dog breeds”)
   •     Pre-coordinate indexing
            –    combinations of related index terms (e.g. phrases) used
                 as an indexing label
            –    term linkages are created at index creation time
   •     Post-coordinate indexing
            –    combinations generated at search time
            –    usually implemented by ANDing index terms together

Intelligent Information Retrieval                                                      10
    Automatic Indexing & Classification
•    Automatic indexing is typically based on the simple deriving of
     keywords from documents.
•    More complex Automatic Indexing Systems attempt to select
     controlled vocabulary terms based on terms in the document.
•    Automatic classification attempts to automatically group similar
     documents using either:
       –     A fully automatic clustering method.
       –     An established classification scheme and set of documents already
             indexed by that scheme.
•    Manual vs. Automatic Indexing
       –     Experimental evidence is that retrieval effectiveness using automatic can
             be at least as good as manual indexing with controlled vocabularies
       –     Experiments have also shown that using both manual and automatic
             indexing improves performance

Intelligent Information Retrieval                                                  11
                        Basic Automatic Indexing
   •     Parse documents to recognize structure
            –    e.g. title, date, other fields
   • Scan for word tokens
            –    lexical analysis using finite state automata
            –    numbers, special characters, hyphenation, capitalization, etc.
            –    languages like Chinese need segmentation since there is not
                 explicit word separation
            –    record positional information for proximity operators
   •     Stopword removal
            –    based on short list of common words such as “the”, “and”, “or”
            –    saves storage overhead of very long indexes
            –    can be dangerous (e.g. “Mr. The”, “and-or gates”)

Intelligent Information Retrieval                                                 12
                        Basic Automatic Indexing
 •      Stem words
          –     morphological processing to group word variants such as plurals
          –     better than string matching (e.g. comput*)
          –     can make mistakes but generally preferred
 •      Weight words
          –     using frequency in documents and database
          –     frequency data is independent of retrieval model
 •      Optional
          –     phrase indexing
          –     thesaurus classes / concept indexing

Intelligent Information Retrieval                                                 13
                                    Phrase Indexing
 •      Both statistical and syntactic methods have been used to
        identify “good” phrases
 •      Proven techniques include finding all word pairs that occur
        more than n times in the corpus or using a part-of-speech
        tagger to identify simple noun phrases
          –     1,100,000 phrases extracted from all TREC data (more than 1,000,000
                WSJ, AP, SJMS, FT, Ziff, CNN documents)
 •      Phrases can have an impact on effectiveness and efficiency
          –     phrase indexing will speed up phrase queries
          –     finding documents containing “Black Sea” better than finding
                documents containing both words
          –     effectiveness not straightforward and depends on retrieval model
                   •    e.g. for “information retrieval”, how much do individual words count?

Intelligent Information Retrieval                                                               14
                                    Concept Indexing
 • Simple indexing is based on words or word stems
 • More complex indexing could include phrases or thesaurus classes
 • Index term is the general name for a word, phrase, or other feature
        used for indexing
 •      Concept-based retrieval often used to imply something beyond word
 •      In virtually all systems, a concept is a name given to a set of
        recognition criteria or rules
          –     similar to a thesaurus class
 • Words, phrases, synonyms, linguistic relations can all be evidence
        used to infer presence of the concept
          –     e.g. the concept “information retrieval” can be inferred based on the
                presence of the words “information”, or “retrieval”, or the phrase
                “information retrieval” or maybe the phrase “text retrieval”

Intelligent Information Retrieval                                                       15
                                Information Extraction
   •     Special recognizers for specific concepts
            –    people, organizations, places, dates
   •     “Meta” terms such as #COMPANY, #PERSON can be
         added to indexing
            –    e.g. a query could include a restriction like “…the document must
                 specify the location of the companies involved…”
   •     Could potentially customize indexing by adding more
            –    difficult to build
            –    problems with accuracy
            –    adds considerable overhead
   • Document Summarization
            –    extract a summary based on recognizers or characteristics of terms

Intelligent Information Retrieval                                                     16
     Stemming and Morphological Analysis
   •     Goal: “normalize” similar words
   •     Morphology (“form” of words)
            –    Inflectional Morphology
                     •   E.g,. inflect verb endings
                     •   Never change grammatical class
                           – dog, dogs
            –    Derivational Morphology
                     •   Derive one word from another,
                     •   Often change grammatical class
                           – build, building; health, healthy
   • Porter’s stemmer uses a collection of rules
   • KSTEM uses lists of words plus rules for inflectional and
         derivational morphology

Intelligent Information Retrieval                                17
                Porter’s Stemming Algorithm
•     Based on a measure of vowel-consonant sequences
       –     measure m for a stem is [C](VC)m[V] where C is a sequence of consonants and
             V is a sequence of vowels (including “y”) ( [ ] indicates optional )
       –     m=0 (tree, by), m=1 (trouble, oats, trees, ivy), m=2 (troubles, private)

• Some Notation:
       –     *<X>                   -->   stem ends with letter X
       –     *v*                    -->   stem contains a vowel
       –     *d                     -->   stem ends in double consonant
       –     *o                     -->   stem ends with a cvc sequence where the final
                                          consonant is not w, x, y
• Algorithm is based on a set of condition action rules
       –     old suffix --> new suffix
       –     rules are divided into steps and are examined in sequence

• Good average recall and precision
Intelligent Information Retrieval                                                         18
                Porter’s Stemming Algorithm
                     • A selection of rules from Porter’s algorithm:
                 1a          NULL          sses     ss              stresses -> stress
                             NULL          ies      I               ponies -> poni
                             NULL          ss       ss              caress -> caress
                             NULL          s        NULL            cats -> cat
                 1b          *v*           ing      NULL            making -> mak
                             ...           ...      ...             ...
                 1b1         NULL          at       ate             inflat(ed) -> inflate
                             ...           ...      ...             ...
                 1c          *v*           y        I               happy -> happi
                 2           m>0           aliti    al              formaliti > formal
                             m>0           izer     ize             digitizer -> digitize
                             ...           ...      ...             ...
                 3           m>0           icate    ic              duplicate -> duplic
                             ...           ...      ...             ...
                 4           m>1           able     NULL            adjustable -> adjust
                             m>1           icate    NULL            microscopic -> microscop
                             ...           ...      ...             ...
                 5a          m>1           e        NULL            inflate -> inflat
                             ...           ...      ...             ...
                 5b          M > 1, *d, *<L> NULL   single letter   controll -> control, roll -> roll

Intelligent Information Retrieval                                                                       19
                Porter’s Stemming Algorithm
   •     The algorithm:
            1. apply step 1a to word
            2. apply step 1b to stem
            3. If (2nd or 3rd rule of step 1b was used)
                      apply step 1b1 to stem
            4. apply step 1c to stem
            5. apply step 2 to stem
            6. apply step 3 to stem
            7. apply step 4 to stem
            8. apply step 5a to stem
            9. apply step 5b to stem

Intelligent Information Retrieval                         20
                                    Stemming Example
 •      Original text:
                marketing strategies carried out by U.S. companies for their
                agricultural chemicals, report predictions for market share of such
                chemicals, or report market statistics for agrochemicals, pesticide,
                herbicide, fungicide, insecticide, fertilizer, predicted sales, market
                share, stimulate demand, price cut, volume of sales

 • Porter stemmer results:
                market strateg carr compan agricultur chemic report predict market
                share chemic report market statist agrochem pesticid herbicid
                fungicid insecticid fertil predict sale stimul demand price cut
                volum sale

Intelligent Information Retrieval                                                        21
                             Problems with Stemming
•      Lack of domain-specificity and context can lead to occasional serious
       retrieval failures
•      Stemmers are often difficult to understand and modify
•      Sometimes too aggressive in conflation
          –    e.g. “policy”/“police”, “university”/“universe”, “organization”/“organ”
               are conflated by Porter
•      Miss good conflations
          –    e.g. “European”/“Europe”, “matrices”/“matrix”, “machine”/“machinery”
               are not conflated by Porter
•      Produce stems that are not words or are difficult for a user to interpret
          –    e.g. “iteration” produces “iter” and “general” produces “gener”
•      Corpus analysis can be used to improve a stemmer or replace it

    Intelligent Information Retrieval                                                22
                                    Content Analysis
 •      Automated indexing relies on some form of content
        analysis to identify index terms
 •      Content analysis: automated transformation of raw text
        into a form that represent some aspect(s) of its meaning
 •      Including, but not limited to:
          –     Automated Thesaurus Generation
          –     Phrase Detection
          –     Categorization
          –     Clustering
          –     Summarization

Intelligent Information Retrieval                                  23
          Techniques for Content Analysis
   •     Statistical
            –    Single Document              Generally rely of the statistical properties of
                                              text such as term frequency and document
            –    Full Collection              frequency
   •     Linguistic
            –    Syntactic
                     •   analyzing the syntactic structure of documents
            –    Semantic
                     •   identifying the semantic meaning of concepts within documents
            –    Pragmatic
                     •   using information about how the language is used (e.g., co-occurrence
                         patterns among words and word classes)
   •     Knowledge-Based (Artificial Intelligence)
   •     Hybrid (Combinations)

Intelligent Information Retrieval                                                                24
                                     Text Processing
   •     Standard Steps:
            –    Recognize document structure
                     •   titles, sections, paragraphs, etc.
            –    Break into tokens
                     •   usually space and punctuation delineated
                     •   special issues with Asian languages
            –    Apply stoplists and stop algorithms
            –    Stemming/morphological analysis
            –    Store in inverted index (to be discussed later)
   •     To determine the index terms in the last step, term
         frequencies are taken into account

Intelligent Information Retrieval                                   25
                       Statistical Properties of Text
• Token occurrences in text are not uniformly distributed
• They are also not normally distributed
• They do exhibit a Zipf distribution

•      What Kinds of Data Exhibit a
       Zipf Distribution?
          –    Words in a text collection
          –    Library book checkout patterns
          –    Incoming Web page requests (Nielsen)
          –    Outgoing Web page requests (Cunha & Crovella)                  rank
          –    Document Size on Web (Cunha & Crovella)
          –    Length of Web page references (Cooley, Mobasher, Srivastava)
          –    Item popularity in E-Commerce
    Intelligent Information Retrieval                                                26
                                     Zipf Distribution
 •      The product of the frequency of words (f) and their rank (r)
        is approximately constant
         – Rank = order of words in terms of decreasing frequency of occurrence

                                    f  C 1 / r
                                    C  N / 10
                  where N is the total number of term occurrences

 •      Main Characteristics
          –     a few elements occur very frequently
          –     many elements occur very infrequently
          –     frequency of words in the text falls very rapidly

Intelligent Information Retrieval                                                 27
                   Example of Frequent Words
                                    Frequent    Number of    Percentage
                                     Word      Occurrences    of Total
                                       the      7,398,934       5.9
                                        of      3,893,790       3.1
                                        to      3,364,653       2.7
                                      and       3,320,687       2.6
                                        in      2,311,785       1.8
                                        is      1,559,147       1.2
                                       for      1,313,561        1
                                      The       1,144,860       0.9
                                      that      1,066,503       0.8
                                      said      1,027,713       0.8

          Frequencies from 336,310 documents in the 1 GB TREC Volume 3 Corpus
              • 125,720,891 total word occurrences
              • 508,209 unique words

Intelligent Information Retrieval                                               28
             Housing Listing Frequency Data
          6208 tokens, 1318 unique (very small collection)
          Bin        Frequency                        Histogram
             1              295
          6.72              216                 350
         12.44               28
         18.16                7                 300
         23.88               29
          29.6                7
         35.32               10     Frequency   200
         41.04                7                                   Frequency
         46.76               14                 150
         52.48                2                 100
          58.2               26
         63.92                9                  50
         69.64                1
         75.36                1                   0
         81.08                0
                                                   35 6
                                                   52 2
                                                   69 8
                                                   18 1

                                                  10 .8
                                                  12 96
                                                  13 12
          86.8                2
         92.52                0                        Bin
         98.24                0

Intelligent Information Retrieval                                             29
                     A More Standard Collection
 Government documents, 157734 tokens, 32259 unique
    8164 the                        969 on         1 ABC
    4771 of                         915 FT         1 ABFT
    4005 to                         883 Mr         1 ABOUT
    2834 a                          860 was        1 ACFT
    2827 and                        855 be         1 ACI
    2802 in                         849 Pounds     1 ACQUI
    1592 The                        798 TEXT       1 ACQUISITIONS
    1370 for                        798 PUB        1 ACSIS
    1326 is                         798 PROFILE    1 ADFT
    1324 s                          798 PAGE
                                                   1 ADVISERS
    1194 that                       798 HEADLINE   1 AE
     973 by                         798 DOCNO
Intelligent Information Retrieval                                   30
                          Zipf’s Law and Indexing
 •      The most frequent words are poor index terms
          –     they occur in almost every document
          –     they usually have no relationship to the concepts and ideas
                represented in the document
 •      Extremely infrequent words are poor index terms
          –     may be significant in representing the document
          –     but, very few documents will be retrieved when indexed by terms
                with the frequency of one or two
 •      Index terms in between
          –     a high and a low frequency threshold are set
          –     only terms within the threshold limits are considered good
                candidates for index terms

Intelligent Information Retrieval                                                 31
                                    Resolving Power
   •      Zipf (and later H.P. Luhn) postulated that the resolving
          power of significant words reached a peak at a rank order
          position half way between the two cut-offs
               –     Resolving Power: the ability of words to discriminate content

                                                     Resolving power of
                                                      significant words

                                                                          The actual cut-off
                                                                          are determined by
                                                                          trial and error, and
                                                                          often depend on the
                                    rank                                  specific collection.
                       upper               lower
                       cut-off             cut-off

Intelligent Information Retrieval                                                            32
                 Collocation (Co-Occurrence)
•     Co-occurrence patterns of words and word classes reveal
      significant information about how a language is used
         –    pragmatics
•     Used in building dictionaries (lexicography) and for IR tasks
      such as phrase detection, query expansion, etc.
•     Co-occurrence based on text windows
         –    typical window may be 100 words
         –    smaller windows used for lexicography, e.g. adjacent pairs or 5 words
•     Typical measure is the expected mutual information measure
         –    compares probability of occurrence assuming independence to
              probability of co-occurrence.

Intelligent Information Retrieval                                                 33
                 Independence vs. Dependence
•     How likely is a red car to drive by given we’ve seen a
      black one?

•     How likely is word W to appear, given that we’ve seen
      word V?
•     Color of cars driving by are independent (although more
      frequent colors are more likely)
•     Words in text are (in general) not independent (although
      again more frequent words are more likely)

Intelligent Information Retrieval                                34
                Probability of Co-Occurrence
 •      Compute for a window of words
P( x)  P( y )  P( x, y ) if independent.
P( x)  f ( x) / N                           abcdefghij klmnop

We' ll approximate P( x, y ) as follows :
                                              w1   w11
                              N | w|                    w21
P ( x, y )     wi ( x, y)
             N i 1
| w | length of window w (say 5)
wi  words within wi
                   ndow starting at position i
w( x, y )  number of times x and y co - occurin w
N  number of wordsin collection

Intelligent Information Retrieval                                35
                                    Lexical Associations
 •      Subjects write first word that comes to mind
          –     doctor/nurse; black/white (Palermo & Jenkins 64)
 • Text Corpora yield similar associations
 • One measure: Mutual Information (Church and Hanks 89)
                                                        P ( x, y )
                                    I ( x, y )  log 2
                                                       P( x).P( y )
 •      If word occurrences were independent, the numerator and
        denominator would be equal (if measured across a large

Intelligent Information Retrieval                                     36
                   Interesting Associations with
         (AP Corpus, N=15 million, Church & Hanks 89)
 I(x,y)                   f(x,y)    f(x)   x          f(y)   y
 11.3                     12        111    Honorary   621    Doctor
 11.3                     8         1105   Doctors    44     Dentists
 10.7                     30        1105   Doctors    241    Nurses
 9.4                      8         1105   Doctors    154    Treating
 9.0                      6         275    Examined 621      Doctor
 8.9                      11        1105   Doctors    317    Treat
 8.7                      25        621    Doctor     1407   Bills

Intelligent Information Retrieval                                     37
        Un-Interesting Associations with
      (AP Corpus, N=15 million, Church & Hanks 89)
 I(x,y)                   f(x,y)    f(x)     x        f(y)    y
 0.96                     6         621      doctor   73785   with
 0.95                     41        284690   a        1105    doctors
 0.93                     12        84716    is       1105    doctors

 These associations were likely to happen because the non-
 doctor words shown here are very common and therefore
 likely to co-occur with any noun.

Intelligent Information Retrieval                                    38
                                    Indexing Models
   •     Basic issue: which terms should be used to index a
   •     Sometimes seen as term weighting
   •     Some approaches
            –    binary weights
            –    simple term frequency
            –    TF.IDF (inverse document frequency model)
            –    probabilistic weighting
            –    term discrimination model
            –    signal-to-noise ratio (based on information theory)
            –    Bayesian models
            –    Language models

Intelligent Information Retrieval                                      39
                           Indexing Implementation
• Common implementations of indexes
       –     Bitmaps
                •    For each term, allocate vector with 1 bit per document
                •    If feature present in document n, set nth bit to 1, otherwise 0
       –     Signature files (Also called superimposed coding)
                •    For each term, allocate fixed size s-bit vector (signature)
                •    Define hash function: Single function: word --> 1..2
                •    Each term then has s-bit signature (may not be unique)
                •    OR the term signatures to form document signature
                •    Lookup signature for query term. If all corresponding 1-bits on in document
                     signature, document probably contains that term
       –     Inverted files
                •    Source file: collection, organized by document
                •    Inverted file: collection organized by term (one record per term, listing
                     locations where term occurs)
                •    Query: traverse lists for each query term
                        – OR: the union of component lists
                        – AND: an intersection of component lists
Intelligent Information Retrieval                                                                  40
 •      Indexing Languages
 •      Content Analysis: transforming raw text into more
        computationally useful forms
 •      Words in text collections exhibit interesting statistical
          –     Word frequencies have a Zipf distribution
          –     Word co-occurrences exhibit dependencies
 •      Text documents are transformed to vectors
          –     pre-processing includes tokenization, stemming,
          –     Documents occupy multi-dimensional space
          –     tf x idf as a measure for weighting index terms

Intelligent Information Retrieval                                   41

To top