Document Sample
Modeling Powered By Docstoc
					                  Chapter 2 Modeling

                           Hsin-Hsi Chen
                Department of Computer Science and
                      Information Engineering
                    National Taiwan University

Hsin-Hsi Chen                                        1

Hsin-Hsi Chen              2
• indexing: assign identifiers to text items.
• assign: manual vs. automatic indexing
• identifiers:
     – objective vs. nonobjective text identifiers
         cataloging rules define, e.g., author names, publisher
         names, dates of publications, …
     – controlled vs. uncontrolled vocabularies
         instruction manuals, terminological schedules, …
     – single-term vs. term phrase
Hsin-Hsi Chen                                                     3
                       Two Issues
• Issue 1: indexing exhaustivity
     – exhaustive: assign a large number of terms
     – nonexhaustive
• Issue 2: term specificity
     – broad terms (generic)
         cannot distinguish relevant from nonrelevant items
     – narrow terms (specific)
         retrieve relatively fewer items, but most of them are

Hsin-Hsi Chen                                                    4
                     Parameters of
                retrieval effectiveness
• Recall
                Number of relevant items retrieved
            Total number of relevant items in collection
• Precision
             Number of relevant items retrieved
              Total number of items retrieved
• Goal
    high recall and high precision

Hsin-Hsi Chen                                        5
                  b          a          Part
          Nonrelevant     Relevant
            Items          Items
               c             d

                     a                 a
          Recall         Precision 
                   a +d               a+b
Hsin-Hsi Chen                               6
                    A Joint Measure
• F-score
                   (   1)  P  R

                        PR

    –  is a parameter that encode the importance of
      recall and procedure.
    – =1: equal weight
    – >1: precision is more important
    – <1: recall is more important
Hsin-Hsi Chen                                          7
  Choices of Recall and Precision
• Both recall and precision vary from 0 to 1.
• In principle, the average user wants to
  achieve both high recall and high precision.
• In practice, a compromise must be reached
  because simultaneously optimizing recall
  and precision is not normally achievable.

Hsin-Hsi Chen                                    8
Choices of Recall and Precision (Continued)

• Particular choices of indexing and search
  policies have produced variations in
  performance ranging from 0.8 precision and
  0.2 recall to 0.1 precision and 0.8 recall.
• In many circumstance, both the recall and
  the precision varying between 0.5 and 0.6
  are more satisfactory for the average users.

Hsin-Hsi Chen                                9
   Term-Frequency Consideration
• Function words
   – for example, "and", "or", "of", "but", …
   – the frequencies of these words are high in all
• Content words
   – words that actually relate to document content
   – varying frequencies in the different texts of a
   – indicate term importance for content
Hsin-Hsi Chen                                          10
    A Frequency-Based Indexing Method

• Eliminate common function words from the
  document texts by consulting a special dictionary,
  or stop list, containing a list of high frequency
  function words.
• Compute the term frequency tfij for all remaining
  terms Tj in each document Di, specifying the
  number of occurrences of Tj in Di.
• Choose a threshold frequency T, and assign to
  each document Di all term Tj for which tfij > T.
Hsin-Hsi Chen                                      11
• high-frequency terms
  favor recall
• high precision
  the ability to distinguish individual
  documents from each other
• high-frequency terms
  good for precision when its term frequency
  is not equally high in all documents.

Hsin-Hsi Chen                                  12
     Inverse Document Frequency
• Inverse Document Frequency (IDF) for term Tj
                idf j  log
                            df j
    where dfj (document frequency of term Tj) is
       number of documents in which Tj occurs.
    – fulfil both the recall and the precision
    – occur frequently in individual documents but
      rarely in the remainder of the collection
Hsin-Hsi Chen                                        13
  New Term Importance Indicator
• weight wij of a term Tj in a document ti
                wij  tf ij  log
                                  df j
• Eliminating common function words
• Computing the value of wij for each term Tj in
  each document Di
• Assigning to the documents of a collection all
  terms with sufficiently high (tf x idf) factors

Hsin-Hsi Chen                                       14
         Term-discrimination Value
• Useful index terms
  distinguish the documents of a collection from
  each other
• Document Space
     – two documents are assigned very similar term sets,
       when the corresponding points in document
       configuration appear close together
     – when a high-frequency term without discrimination is
       assigned, it will increase the document space density

Hsin-Hsi Chen                                                  15
                  A Virtual Document Space

Original State          After Assignment of   After Assignment of
                        good discriminator    poor discriminator

  Hsin-Hsi Chen                                            16
                Good Term Assignment
• When a term is assigned to the documents
  of a collection, the few items to which the
  term is assigned will be distinguished from
  the rest of the collection.
• This should increase the average distance
  between the items in the collection and
  hence produce a document space less dense
  than before.
Hsin-Hsi Chen                               17
                Poor Term Assignment
• A high frequency term is assigned that does
  not discriminate between the items of a
• Its assignment will render the document
  more similar.
• This is reflected in an increase in document
  space density.

Hsin-Hsi Chen                                18
        Term Discrimination Value
• definition
                dvj = Q - Qj
    where       Q and Qj are space densities before and
                after the assignments of term Tj.
                       1        N N
                Q              sim( Di , Dk )
                   N ( N  1) i 1 k 1
• dvj>0, Tj is a good term;
  dvj<0, Tj is a poor term.

Hsin-Hsi Chen                                         19
Variations of Term-Discrimination Value
       with Document Frequency

Thesaurus                       Phrase
transformation                  transformation
Low frequency    Medium frequency   High frequency
dvj=0            dvj>0              dvj<0

 Hsin-Hsi Chen                                              20
           Another Term Weighting
• wij = tfij x dvj
• compared with wij  tf ij  log
                                       df j
   –      : decrease steadily with increasing document
     df j frequency
     – dvj: increase from zero to positive as the document
                    frequency of the term increase,
                decrease shapely as the document frequency
                    becomes still larger.

Hsin-Hsi Chen                                                21
   Term Relationships in Indexing
• Single-term indexing
     – Single terms are often ambiguous.
     – Many single terms are either too specific or too
       broad to be useful.
• Complex text identifiers
     – subject experts and trained indexers
     – linguistic analysis algorithms, e.g., NP chunker
     – term-grouping or term clustering methods

Hsin-Hsi Chen                                         22
  Term Classification (Clustering)
                T 1 T 2 T 3 Tt
           D1 d 11 d 12  d 1t
           D 2 d 21 d 22  d 2 t
                        
           Dn dn1   dn 2  dnt
Hsin-Hsi Chen                      23
 Term Classification (Clustering)
• Column part
  Group terms whose corresponding column
  representation reveal similar assignments to the
  documents of the collection.
• Row part
  Group documents that exhibit sufficiently similar
  term assignment.

Hsin-Hsi Chen                                         24
           Linguistic Methodologies
• Indexing phrases:
  nominal constructions including adjectives and
     – Assign syntactic class indicators (i.e., part of speech) to
       the words occurring in document texts.
     – Construct word phrases from sequences of words
       exhibiting certain allowed syntactic markers (noun-
       noun and adjective-noun sequences).

Hsin-Hsi Chen                                                    25
                  Term-Phrase Formation
   • Term Phrase
     a sequence of related text words carry a more
     specific meaning than the single terms
     e.g., “computer science” vs. computer;
Thesaurus                           Phrase
transformation                      transformation
Low frequency        Medium frequency   High frequency
dvHsin-Hsi Chen      dvj>0              dvj<0                   26
Simple Phrase-Formation Process
• the principal phrase component (phrase head)
  a term with a document frequency exceeding a
  stated threshold, or exhibiting a negative
  discriminator value
• the other components of the phrase
  medium- or low- frequency terms with stated co-
  occurrence relationships with the phrase head
• common function words
  not used in the phrase-formation process
Hsin-Hsi Chen                                       27
                  An Example
• Effective retrieval systems are essential for
  people in need of information.
     – “are”, “for”, “in” and “of”:
       common function words
     – “system”, “people”, and “information”:
       phrase heads

Hsin-Hsi Chen                                     28
       The Formatted Term-Phrases
 effective retrieval systems essential people need information

     Phrase Heads and Components   Phrase Heads and Components
     Must Be Adjacent              Co-occur in Sentence
     1. retrieval system*          6. effective systems
     2. systems essential          7. systems need
     3. essential people           8. effective people
     4. people need                9. retrieval people
     5. need information*          10. effective information*
                                   11. retrieval information*
                                   12. essential information*
               2/5                         5/12
  *: phrases assumed to be useful for content identification
Hsin-Hsi Chen                                                    29
                The Problems
• A phrase-formation process controlled only by
  word co-occurrences and the document
  frequencies of certain words in not likely to
  generate a large number of high-quality phrases.
• Additional syntactic criteria for phrase heads and
  phrase components may provide further control in
  phrase formation.

Hsin-Hsi Chen                                      30
Additional Term-Phrase Formation Steps

• Syntactic class indicator are assigned to the terms,
  and phrase formation is limited to sequences of
  specified syntactic markers, such as adjective-
  noun and noun-noun sequences.
      Adverb-adjective  adverb-noun 
• The phrase elements are all chosen from within
  the same syntactic unit, such as subject phrase,
  object phrase, and verb phrase.

Hsin-Hsi Chen                                        31
                Consider Syntactic Unit
• effective retrieval systems are essential for
  people in the need of information
• subject phrase
     – effective retrieval systems
• verb phrase
     – are essential
• object phrase
     – people in need of information
Hsin-Hsi Chen                                     32
    Phrases within Syntactic Components
       [subj effective retrieval systems] [vp are essential ]
       for [obj people need information]
• Adjacent phrase heads and components within
  syntactic components
   – retrieval systems*
   – people need                  2/3
   – need information*
• Phrase heads and components co-occur within
  syntactic components
   – effective systems
Hsin-Hsi Chen                                                   33
• More stringent phrase formation criteria produce
  fewer phrases, both good and bad, than less
  stringent methodologies.
• Prepositional phrase attachment, e.g.,
           The man saw the girl with the telescope.
• Anaphora resolution
           He dropped the plate on his foot and broke it.

Hsin-Hsi Chen                                               34
                 Problems (Continued)
• Any phrase matching system must be able to deal
  with the problems of
     – synonym recognition
     – differing word orders
     – intervening extraneous word
• Example
     – retrieval of information vs. information retrieval

Hsin-Hsi Chen                                               35
   Equivalent Phrase Formulation
• Base form: text analysis system
• Variants:
     –   system analyzes the text
     –   text is analyzed by the system
     –   system carries out text analysis
     –   text is subjected to system analysis
• Related term substitution
     – text: documents, information items
     – analysis: processing, transformation, manipulation
     – system: program, process

Hsin-Hsi Chen                                               36
        Thesaurus-Group Generation
 • Thesaurus transformation
      – broadens index terms whose scope is too narrow to be
        useful in retrieval
      – a thesaurus must assemble groups of related specific
        terms under more general, higher-level class indicators
Thesaurus                          Phrase
transformation                     transformation
Low frequency     Medium frequency      High frequency
dvj=0             dvj>0                 dvj<0
  Hsin-Hsi Chen                                                37
    Sample Classes of Roget‟s Thesaurus
                Class Indicator   Entry           Class Indicator   Entry
                                  permission                        offer
                                  leave                             presentation
                       760        sanction                          tender
                                  allowance              763        overture
                                  tolerance                         advance
                                  authorization                     submission
                                  prohibition                       proposal
                                  veto                              proposition
                       761        disallowance                      invitation
                                  injunction                        refusal
                                  ban                               declining
                                  taboo                  764        noncompliance
                                  consent                           rejection
                                  acquiescence                      denial
                       762        compliance

Hsin-Hsi Chen                                                                       38
                The Indexing Prescription (1)

• Identify the individual words in the document
• Use a stop list to delete from the texts the function
• Use an suffix-stripping routine to reduce each
  remaining word to word-stem form.
• For each remaining word stem Tj in document Di,
  compute wij.
• Represent each document Di by
      Di=(T1, wi1; T2, wi2; …, Tt, wit)
Hsin-Hsi Chen                                         39
                Word Stemming
• effectiveness --> effective --> effect
• picnicking --> picnic
• king -\-> k

Hsin-Hsi Chen                              40
        Some Morphological Rules
• Restore a silent e after suffix removal from certain
  words to produce “hope” from “hoping” rather
  than “hop”
• Delete certain doubled consonants after suffix
  removal, so as to generate “hop” from “hopping”
  rather than “hopp”.
• Use a final y for an I in forms such as “easier”, so
  as to generate “easy” instead of “easi”.

Hsin-Hsi Chen                                        41
                The Indexing Prescription (2)
• Identify individual text words.
• Use stop list to delete common function words.
• Use automatic suffix stripping to produce word stems.
• Compute term-discrimination value for all word stems.
• Use thesaurus class replacement for all low-frequency
  terms with discrimination values near zero.
• Use phrase-formation process for all high-frequency terms
  with negative discrimination values.
• Compute weighting factors for complex indexing units.
• Assign to each document single term weights, term phrases,
  and thesaurus classes with weights.

Hsin-Hsi Chen                                             42
                Query vs. Document
• Differences
     – Query texts are short.
     – Fewer terms are assigned to queries.
     – The occurrence of query terms rarely exceeds 1.

  Q=(wq1, wq2, …, wqt) where wqj: inverse document frequency
  Di=(di1, di2, …, dit)
     where dij: term frequency*inverse document frequency
                sim(Q, D)   wqj‧dij
                               j 1
Hsin-Hsi Chen                                            43
                Query vs. Document
• When non-normalized documents are used, the longer
  documents with more assigned terms have a greater chance
  of matching particular query terms than do the shorter
  document vectors. t
                      w ‧d      qj             ij

    sim(Q, Di )     j 1

                         (d )
                            j 1

                                       w ‧d        qj          ij

     sim(Q, Di )            t
                                      j 1

                         (d )                      ‧  ( wqj )
                                                2                         2
Hsin-Hsi Chen               j 1                         j 1                 44
                Relevance Feedback
• Terms present in previously retrieved documents that have
  been identified as relevant to the user‟s query are added to
  the original formulations.
• The weights of the original query terms are altered by
  replacing the inverse document frequency portion of the
  weights with term-relevance weights obtained by using the
  occurrence characteristics of the terms in the previous
  retrieved relevant and nonrelevant documents of the

Hsin-Hsi Chen                                                45
                Relevance Feedback
• Q = (wq1, wq2, ..., wqt)
• Di = (di1, di2, ..., dit)
• New query may be the following form
  Q‟ = a{wq1, wq2, ..., wqt}+{w‟qt+1, w‟qt+2, ...,
• The weights of the newly added terms Tt+1
  to Tt+m may consist of a combined term-
  frequency and term-relevance weight.

Hsin-Hsi Chen                                    46
                   Final Indexing

• Identify individual text words.
• Use a stop list to delete common words.
• Use suffix stripping to produce word stems.
• Replace low-frequency terms with thesaurus classes.
• Replace high-frequency terms with phrases.
• Compute term weights for all single terms, phrases, and
  thesaurus classes.
• Compare query statements with document vectors.
• Identify some retrieved documents as relevant and some as
  nonrelevant to the query.

Hsin-Hsi Chen                                             47
                  Final Indexing

• Compute term-relevance factors based on available
  relevance assessments.
• Construct new queries with added terms from relevant
  documents and term weights based on combined frequency
  and term-relevance weight.
• Return to step (7).
  Compare query statements with document vectors ……..

Hsin-Hsi Chen                                         48
        Summary of expected effectiveness of
               automatic indexing
• Basic single-term automatic indexing       -
• Use of thesaurus to group related terms in the given topic
  area                                       +10% to +20%
• Use of automatically derived term associations obtained
  from joint term assignments found in sample document
  collections                                0% to -10%
• Use of automatically derived term phrases obtained by
  using co-occurring terms found in the texts of sample
  collections                                +5% to +10%
• Use of one iteration of relevant feedback to add new query
  terms extracted from previously retrieved relevant
  documents                                  +30% to +60%
Hsin-Hsi Chen                                              49

Hsin-Hsi Chen            50
• central problem of IR
     – Predict which documents are relevant and which are not
• Ranking
     – Establish an ordering of the documents retrieved
• IR models
     – Different model provides distinct sets of premises to
       deal with document relevance

Hsin-Hsi Chen                                                  51
    Information Retrieval Models
 • Classic Models
      – Boolean model
          • set theoretic
          • documents and queries are represented as sets of index terms
          • compare Boolean query statements with the term sets used to
            identify document content.
      – Vector model
          • algebraic model
          • documents and queries are represented as vectors in a t-
            dimensional space
          • compute global similarities between queries and documents.
      – Probabilistic model
             • probabilistic
             • documents and queries are represented on the basis of
               probabilistic theory
Hsin-Hsi Chen• compute the relevance probabilities for the documents of a   52
     Information Retrieval Models

• Structured Models
     – reference to the structure present in written text
     – non-overlapping list model
     – proximal nodes model
• Browsing
     – flat
     – structured guided
     – hypertext

Hsin-Hsi Chen                                           53
        Taxonomy of Information Retrieval Models
                     Classic Models        Set Theoretic
                        boolean              Fuzzy
                         vector         Extended Boolean
S    Retrieval:                               Algebraic
                    Structured Models
E     Adhoc                             Generalized Vector
R    Filtering                          Lat. Semantic Index
                        probabilistic    Neural Network
S    Browsing          Browsing             Probabilistic
K                          Flat
                                        Inference Network
                    Structured Guided
                                          Brief Network
    Hsin-Hsi Chen                                     54
         Issues of a retrieval system
• Models
     – boolean
     – vector
     – probabilistic
• Logical views of documents
     – full text
     – set of index terms
• User task
     – retrieval
     – browsing
Hsin-Hsi Chen                           55
          Combinations of these issues
                    LOGICAL VIEW OF DOCUMENTS
                                             Full Text+
                      Index Terms  Full Text Structure
                          Classic         Classic
         Retrieval     Set Theoretic   Set Theoretic
R                                                       Structured
                        Algebraic       Algebraic
                       Probabilistic   Probabilistic
S                                         Flat         Structure Guided
          Browsing        Flat
K                                       Hypertext          Hypertext

    Hsin-Hsi Chen                                                56
  Retrieval: Ad hoc and Filtering
• Ad hoc retrieval
    – Documents remain relatively static while new queries
      are submitted
• Filtering
    – Queries remain relatively static while new documents
      come into the system
          • e.g., news wiring services in the stock market
    – User profile describes the user‟s preferences
          • Filtering task indicates to the user which document might be
            interested to him
          • Which ones are really relevant is fully reserved to the user
    – Routing: a variation of filtering
          • Ranking filtered documents and show this ranking to users
Hsin-Hsi Chen                                                              57
                   User profile
• Simplistic approach
     – The profile is described through a set of
     – The user provides the necessary keywords
• Elaborate approach
     – Collect information from the user
     – initial profile + relevance feedback (relevant
       information and nonrelevant information)
Hsin-Hsi Chen                                           58
  Formal Definition of IR Models
• /D, Q, F, R(qi, dj)/
     – D: a set composed of logical views (or representations)
       for the documents in collection
     – Q: a set composed of logical views (or representations)
       for the user information needs
     – F: a framework for modeling documents
       representations, queries, and their relationships
     – R(qi, dj): a ranking function which associations a real
       number with qiQ and dj D
Hsin-Hsi Chen                                                    59
  Formal Definition of IR Models

• classic Boolean model
   – set of documents
   – standard operations on sets
• classic vector model
   – t-dimensional vector space
   – standard linear algebra operations on vector
• classic probabilistic model
      – sets
      – standard probabilistic operations, and Bayes‟
Hsin-Hsi Chen                                           60
      Basic Concepts of Classic IR
• index terms (usually nouns): index and summarize
• weight of index terms
• Definition
     – K={k1, …, kt}: a set of all index terms
     – wi,j: a weight of an index term ki of a document dj
     – dj=(w1,j, w2,j, …, wt,j): an index term vector for the
       document dj
     – gi(dj)= wi,j                   wi,j associated with (ki,dj) tells us nothing
                                             about wi+1,j associated with (ki+1,dj)
• assumption
     – index term weights are mutually independent
                The terms computer and network in the area of computer networks
Hsin-Hsi Chen                                                                         61
                 Boolean Model
• The index term weight variables are all
  binary, i.e., wi,j{0,1}
• A query q is a Boolean expression (and, or, not)
• qdnf: the disjunctive normal form for q
• qcc: conjunctive components of qdnf
• sim(dj,q): similarity of dj to q
     – 1: if qcc | (qcc qdnf(ki, gi(dj)=gi(qcc))
     – 0: otherwise
                                dj is relevant to q
Hsin-Hsi Chen                                          62
                Boolean Model (Continued)
                                      (ka  kb)  (ka  kc)
                                      = (ka  kb  kc)  (ka  kb   kc)
                                      (ka  kb  kc) (ka  kb  kc)
• Example                             = (ka  kb  kc)  (ka  kb   kc) 
     – q=ka  (kb  kc)              (ka  kb  kc)

     – qdnf=(1,1,1)  (1,1,0)  (1,0,0)

           ka     (1,0,0)                           kb

Hsin-Hsi Chen         kc                                              63
                Boolean Model (Continued)
• advantage: simple
• disadvantage
     – binary decision (relevant or non-relevant)
       without grading scale
     – exact match (no partial match)
           • e.g., dj=(0,1,0) is non-relevant to q=(ka  (kb  kc)
     – retrieve too few or too many documents

Hsin-Hsi Chen                                                     64
                Basic Vector Space Model
• Term vector representation of
        documents Di=(ai1, ai2, …, ait)
        queries Qj=(qj1, qj2, …, qjt)
• t distinct terms are used to characterize content.
• Each term is identified with a term vector T.
• t vectors are linearly independent.
• Any vector is represented as a linear combination of the t
  term vectors.
• The rth document Dr can be represented as a document
  vector, written as           t
                      Dr   a r T i

                            i 1
Hsin-Hsi Chen                                                  65
Document representation in vector space
 a document vector in a two-dimensional vector space

Hsin-Hsi Chen                                          66
                 Similarity Measure
• measure by product of two vectors
     x • y = |x| |y| cosa
• document-query similarity
    document vector:                                    term vector:
                 t                                                      t
         Dr   a r T i i                                       Qs     qsjTj
                i 1                t                                  j 1
                       Dr‧Qs     a q T ‧T
                                 i , j 1
                                            ri sj   i       j

• how to determine the vector components and term
Hsin-Hsi Chen                                                                    67
            Similarity Measure (Continued)
• vector components
                  T 1 T 2 T 3 Tt
                D1 a 11 a 12  a 1t
                D2 a 21 a 22  a 2 t
                           
Hsin-Hsi Chen   Dn an1 an 2  ant            68
          Similarity Measure (Continued)
• term correlations Ti • Tj are not available
  assumption: term vectors are orthogonal
      Ti • Tj =0 (ij) Ti • Tj =1 (i=j)
• Assume that terms are uncorrelated.
                sim( Dr , Qs)       a q
                                     i , j 1
                                                ri sj

• Similarity measurement between documents
                sim( Dr , Ds)    a a
                                  i , j 1
                                             ri sj

Hsin-Hsi Chen                                           69
                Sample query-document
                 similarity computation
• D1=2T1+3T2+5T3         D2=3T1+7T2+1T3
• similarity computations for uncorrelated terms
  sim(D1,Q)=2•0+3 •0+5 •2=10
  sim(D2,Q)=3•0+7 •0+1 •2=2
• D1 is preferred

Hsin-Hsi Chen                                      70
          Sample query-document
       similarity computation (Continued)
•                  T1    T2     T3
             T1    1     0.5 0
             T2    0.5 1        -0.2
             T3    0     -0.2 1
• similarity computations for correlated terms
  sim(D1,Q)=(2T1+3T2+5T3) • (0T1+0T2+2T3 )
             =4T1•T3+6T2 •T3 +10T3 •T3
   sim(D2,Q)=(3T1+7T2+1T3) • (0T1+0T2+2T3 )
             =6T1•T3+14T2 •T3 +2T3 •T3
• D1 is preferred
Hsin-Hsi Chen                                    71
                Vector Model
• wi,j: a positive, non-binary weight for (ki,dj)
• wi,q: a positive, non-binary weight for (ki,q)
• q=(w1,q, w2,q, …, wt,q): a query vector,
  where t is the total number of index terms in
  the system
• dj= (w1,j, w2,j, …, wt,j): a document vector

Hsin-Hsi Chen                                   72
  Similarity of document dj w.r.t. query q

• The correlation between vectors dj and q
                        d j q
      sim(d j , q) 
                     | d j || q |        cos(dj,q)
                ti 1 wi, j  wi,q                   
           ti 1 wi2, j  tj 1 wi2,q                        Q

• | q | does not affect the ranking
• | dj | provides a normalization
Hsin-Hsi Chen                                                      73
                document ranking
• Similarity (i.e., sim(q, dj)) varies from 0 to 1.
• Retrieve the documents with a degree of
  similarity above a predefined threshold
  (allow partial matching)

Hsin-Hsi Chen                                    74
        term weighting techniques
• IR problem: one of clustering
    – user query: a specification of a set A of objects
    – clustering problem: determine which documents are in
      the set A (relevant), which ones are not (non-relevant)
    – intra-cluster similarity
         • the features better describe the objects in the set A
         • tf factor in vector model
           the raw frequency of a term ki inside a document dj
    – inter-cluster similarity
           • the features better distinguish the the objects in the set A from
              the remaining objects in the collection C
           • idf factor (inverse document frequency) in vector model
              the inverse of the frequency of a term ki among the documents
Hsin-Hsi Chen in the collection                                               75
                Definition of tf
• N: total number of documents in the system
• ni: the number of documents in which the
  index term ki appears
• freqi,j: the raw frequency of term ki in the
  document dj
• fi,j: the normalized frequency of term ki in
  document dj             freqi , j
                   fi, j 
                             max l freql , j Term t has maximum frequency
Hsin-Hsi Chen                                in the document dj   76
                Definition of idf and
                   tf-idf scheme
• idfi: inverse document frequency for ki
                            idf i  log
• wi,j: term-weighting by tf-idf scheme
                          wi, j    fi , j  log
• query term weight (Salton and Buckley)
                                       0.5 freqi ,q             N
                    wi ,q  (0.5                     )  log
                                     max l freqi ,q             ni

                freqi,q: the raw frequency of the term ki in q
Hsin-Hsi Chen                                                        77
           Analysis of vector model
• advantages
     – its term-weighting scheme improves retrieval
     – its partial matching strategy allows retrieval of
       documents that approximate the query conditions
     – its cosine ranking formula sorts the documents
       according to their degree of similarity to the query
• disadvantages
     – indexed terms are assumed to be mutually

Hsin-Hsi Chen                                                 78
                Probabilistic Model
• Given a query, there is an ideal answer set
     – a set of documents which contains exactly the
       relevant documents and no other
• query process
     – a process of specifying the properties of an
       ideal answer set
• problem: what are the properties?

Hsin-Hsi Chen                                          79
          Probabilistic Model (Continued)
• Generate a preliminary probabilistic
  description of the ideal answer set
• Initiate an interaction with the user
     – User looks at the retrieved documents and
       decide which ones are relevant and which ones
       are not
     – System uses this information to refine the
       description of the ideal answer set
     – Repeat the process many times.
Hsin-Hsi Chen                                      80
                Probabilistic Principle
• Given a user query q and a document dj in the
  collection, the probabilistic model estimates the
  probability that user will find dj relevant
• assumptions
     – The probability of relevance depends on query and
       document representations only
     – There is a subset of all documents which the user
       prefers as the answer set for the query q
• Given a query, the probabilistic model assigns to
  each document dj a measure of its similarity to the
  query           P(d j relevant  to q)
Hsin-Hsi Chen        P(d j nonrelevant  to q)             81
                Probabilistic Principle
• wi,j{0,1}, wi,q{0,1}: the index term weight variables
  are all binary non-relevant
• q: a query which is a subset of index terms
• R: the set of documents known to be relevant
• R (complement of R): the set of documents
• P(R|dj): the probability that the document dj is relevant
  to the query q
• P(R|dj): the probability that dj is non-relevant to q
Hsin-Hsi Chen                                        82
• sim(dj,q): the similarity of the document dj
  to the query q
                     P( R | d j )
   sim(d j , q )                          (by definition)
                     P( R | d j )

                     P(d j | R)  P( R)
   sim(d j , q )                          (Bayes‟ rule)
                     P(d j | R)  P( R)

                     P(d j | R)
   sim(d j , q )                          (P(R) and P(R) are the
                   P(d j | R)
                                           same for all documents)
     P( d j | R ) : the probability of randomly selecting the document
                  dj from the set of R of relevant documents
     P(R): the probability that a document randomly selected from
Hsin-Hsi Chen                                                     83
     the entire collection is relevant
                    P(d j | R)                                                   P(ki|R): the probability that the index
sim(d j , q ) 
                    P(d j | R)                                                   term ki is present in a document
                                                                                 randomly selected from the set R.
          t                    gi ( d j )                        1 gi ( d j )
          ( P(ki | R))                      ( P (k i | R ))
                                                                                 P(ki|R): the probability that the index
 log i 1
        t                      gi ( d j )                        1 gi ( d j )   term ki is not present in a document
          ( P(ki | R))                      ( P (k i | R ))                    randomly selected from the set R.
         i 1
                               gi ( d j )                       1 gi ( d j )
   t          ( P (ki | R ))                 ( P (k i | R ))
  log                                                                                independence assumption of
                               gi ( d j )                       1 gi ( d j )
  i 1        ( P (ki | R ))                 ( P (k i | R ))                          index terms
                                                  gi ( d j )
   t          ( P (ki | R )  P (k i | R ))                     ( P (k i | R ))
  log
                                                  gi ( d j )
  i 1        ( P (ki | R )  P (k i | R ))                     ( P (k i | R ))
   t                P ( ki | R )  P ( k i | R ) t P ( k i | R )
  gi (d j )  log                             
  i 1              P (ki | R )  P (k i | R ) i 1 P (k i | R )
   t                P (ki | R )  (1  P (ki | R )) t P (k i | R )
  gi (d j )  log                                
  i 1              P (ki | R )  (1  P (ki | R )) i 1 P (k i | R )
       Hsin-Hsi Chen                                                                                             84
                     P(d j | R)
   sim(d j , q ) 
                     P(d j | R)
       t               P (ki | R )  (1  P (ki | R )) t P (k i | R )
     gi (d j )  log                                
     i 1              P (ki | R )  (1  P (ki | R )) i 1 P (k i | R )
       t                   P ( ki | R )           (1  P (ki | R ))      t P(k | R)
     gi (d j )  (log                                             ) 
                                          )  log
     i 1               (1  P (ki | R ))            P ( ki | R )      i 1 P ( k i | R )
       t                     P ( ki | R )           (1  P (ki | R ))
     gi (d j )  (log                     )  log                   )
      i 1                (1  P (ki | R ))            P ( ki | R )

     Problem: where is the set R?

Hsin-Hsi Chen                                                                               85
                Initial guess
• P(ki|R) is constant for all index terms ki.
                 p(ki | R)  0.5

• The distribution of index terms among the
  non-relevant documents can be
  approximated by the distribution of index
  terms among all the documents in the
  collection.           ni
                 P ( ki | R ) 
Hsin-Hsi Chen                                   86
                Initial ranking
• V: a subset of the documents initially retrieved
  and ranked by the probabilistic model (top r
• Vi: subset of V composed of documents which
  contain the index term ki
• Approximate P(ki|R) by the distribution of the
  index term ki among the documents retrieved so
  far.                                    V
                            P ( ki | R )  i
• Approximate P(ki|R) by considering that all the
  non-retrieved documents are not relevant.
                                          ni  Vi
Hsin-Hsi Chen               P(ki | R)               87
                                          N V
           Small values of V and Vi
                               P ( ki | R ) 
                                                V        a problem when V=1 and Vi=0
                                               ni  Vi
                               P(ki | R) 
• alternative 1                                N V

                               V  0.5
                P ( ki | R )  i
                                V 1
                               ni  Vi  0.5
                P ( ki | R ) 
                                N V 1

• alternative 2                    n
                               Vi  i
                P ( ki | R )      N
                                V 1
                               ni  Vi  i
                P ( ki | R )           N
Hsin-Hsi Chen
                                N V 1                                         88
  Analysis of Probabilistic Model
• advantage
     – documents are ranked in decreasing order of
       their probability of being relevant
• disadvantages
     – the need to guess the initial separation of
       documents into relevant and non-relevant sets
     – do not consider the frequency with which an
       index terms occurs inside a document
     – the independence assumption for index terms
Hsin-Hsi Chen                                          89
     Comparison of classic models
• Boolean model: the weakest classic model
• Vector model is expected to outperform the
  probabilistic model with general collections
  (Salton and Buckley)

Hsin-Hsi Chen                                90
Alternative Set Theoretic Models
       -Fuzzy Set Model
• Model
     – a query term: a fuzzy set
     – a document: degree of membership in this set
     – membership function
           • Associate membership function with the elements of
             the class
           • 0: no membership in the set
           • 1: full membership                     documents

           • 0~1: marginal elements of the set

Hsin-Hsi Chen                                                91
                 Fuzzy Set Theory
                   a class

• A fuzzy subset A of a universe of discourse
  U is characterized by a membership
  function µA: U[0,1] which associates with
  each element u of U a number µA(u) in the
  interval [0,1] a document
     – complement:  A (u)  1   A (u)
     – union:  AB (u)  max(  A (u),  B (u))
     – intersection:  AB (u)  min(  A (u),  B (u))
Hsin-Hsi Chen                                             92
• Assume U={d1, d2, d3, d4, d5, d6}
• Let A and B be {d1, d2, d3} and {d2, d3, d4},
• Assume A={d1:0.8, d2:0.7, d3:0.6, d4:0, d5:0, d6:0} and
    B={d1:0, d2:0.6, d3:0.8, d4:0.9, d5:0, d6:0}
•  A (u)  1   A (u) ={d1:0.2, d2:0.3, d3:0.4, d4:1, d5:1, d6:1}
•  AB (u)  max(  A (u),  B (u))={d1:0.8, d2:0.7, d3:0.8, d4:9,
    d5:0, d6:0}
•  AB (u)  min(  A (u),  B (u))={d1:0.2, d2:0.6, d3:0.6, d4:0,
Hsin-Hsi Chen                                                 93
    d5:0, d6:0}
       Fuzzy Information Retrieval
• basic idea
     – Expand the set of index terms in the query with
       related terms (from the thesaurus) such that
       additional relevant documents can be retrieved
     – A thesaurus can be constructed by defining a
       term-term correlation matrix c whose rows and
       columns are associated to the index terms in the
       document collection
                             keyword connection matrix
Hsin-Hsi Chen                                            94
       Fuzzy Information Retrieval

• normalized correlation factor ci,l between
  two terms ki and kl (0~1)
                ni,l                ni is # of documents containing term ki
ci,l                     where     nl is # of documents containing term kl
         ni  nl  ni,l
                                    ni,l is # of documents containing ki and kl
• In the fuzzy set associated to each index
  term ki, a document dj has a degree of
  membership µi,j
                          i, j  1   (1  ci,l )
Hsin-Hsi Chen                        kl d j                           95
       Fuzzy Information Retrieval

• physical meaning
     – A document dj belongs to the fuzzy set associated to the
       term ki if its own terms are related to ki, i.e., i,j=1.
     – If there is at least one index term kl of dj which is
       strongly related to the index ki, then i,j1.
                 ki is a good fuzzy index
     – When all index terms of dj are only loosely related to ki,
                 ki is not a good fuzzy index

Hsin-Hsi Chen                                                  96
 • q=(ka  (kb  kc)
   =(ka  kb  kc)  (ka  kb   kc) (ka  kb  kc)
                             Da: the fuzzy set of documents
Da     cc3       cc2             associated to the index ka
                 cc1         djDa has a degree of membership
                                a,j > a predefined threshold K
                        Db   Da: the fuzzy set of documents
                                 associated to the index ka
            Dc                  (the negation of index term ka)
 Hsin-Hsi Chen                                          97
Query q=ka  (kb   kc)

 disjunctive normal form qdnf=(1,1,1)  (1,1,0)  (1,0,0)
(1) the degree of membership in a disjunctive fuzzy set is computed
using an algebraic sum (instead of max function) more smoothly
(2) the degree of membership in a conjunctive fuzzy set is computed
using an algebraic product (instead of min function)
 q, j  cc1 cc2 cc3, j
                3                             Recall  A (u)  1   A (u)
       1   (1  cci , j )
            i 1
       1  (1   a, j b, j c, j )  (1   a, j b, j (1  c, j ))  (1   a, j (1  b, j )(1  c, j ))
Hsin-Hsi Chen                                                                                     98
 Alternative Algebraic Model:
Generalized Vector Space Model
• independence of index terms
     – ki: a vector associated with the index term ki
     – the set of vectors {k1, k2, …, kt} is linearly independent
           • orthogonal:   ki  k j  0   for ij
     – The index term vectors are assumed linearly
       independent but are not pairwise orthogonal in
       generalized vector space model
     – The index term vectors, which are not seen as the basis
       of the space, are composed of smaller components
       derived from the particular collection.
Hsin-Hsi Chen                                                   99
    Generalized Vector Space Model
  • {k1, k2, …, kt}: index terms in a collection
  • wi,j: binary weights associated with the term-document pair
     {ki, dj}
  • The patterns of term co-occurrence (inside documents) can
     be represented by a set of 2t minterms
m1=(0, 0, …, 0): point to documents containing none of index terms
m2=(1, 0, …, 0): point to documents containing the index term k1 only
m3=(0,1,…,0): point to documents containing the index term k2 only
m4=(1,1,…,0): point to documents containing the index terms k1 and k2
m2t=(1, 1, …, 1): point to documents containing all the index terms
   • gi(mj): return the weight {0,1} of the index term ki in the
      minterm mj (1  i  t)
    Hsin-Hsi Chen                                              100
Generalized Vector Space Model
    m1  (1,0,...,0,0)
    m 2  (0,1,...,0,0)
    ...                             m i  m j  0 for i  j
    m       t    (0,0,...,0,1)    (the set of mi are pairwise orthogonal)
• mi (2t-tuple vector) is associated with minterm mi
    (t-tuple vector)
• e.g., m4 is associated with m4 containing k1 and k2,
    and no others
• co-occurrence of index terms inside documents:
    dependencies among index terms
Hsin-Hsi Chen                                       101
    minterm mr     mr vector           d1 (t1)       d11 (t1 t2)
    m1=(0,0,0)     m1=(1,0,0,0,0,0,0,0)d2 (t3)       d12 (t1 t3)
    m2=(0,0,1)     m2=(0,1,0,0,0,0,0,0)d3 (t3)       d13 (t1 t2)
    m3=(0,1,0)     m3=(0,0,1,0,0,0,0,0)d4 (t1)       d14 (t1 t2)
t=3 m =(0,1,1)     m4=(0,0,0,1,0,0,0,0)d5 (t2)       d15 (t1 t2 t3)
    m5=(1,0,0)     m5=(0,0,0,0,1,0,0,0)d6 (t2)       d16 (t1 t2)
    m6=(1,0,1)     m6=(0,0,0,0,0,1,0,0)d7 (t2 t3)    d17 (t1 t2)
    m7=(1,1,0)     m7=(0,0,0,0,0,0,1,0)d8 (t2 t3)    d18 (t1 t2)
    m8=(1,1,1)     m8=(0,0,0,0,0,0,0,1)d9 (t2)       d19 (t1 t2 t3)
                                       d10 (t2 t3)   d20 (t1 t2)
           c1,5 m5  c1,6 m6  c1,7 m7  c1,8 m8
    k1 
                 c1,52  c1,6 2  c1,7 2  c1,82
    c1,5  w1,1  w1,4     c1,6  w1,12
    c1,7  w1,11  w1,13  w1,14  w1,16  w1,17  w1,18  w1,20
    c1,8 Chen1,15  w1,19
              w                                            102
    minterm mr     mr vector          d1 (t1)       d11 (t1 t2)
    m1=(0,0,0)                        d2 (t3)
                   m1=(1,0,0,0,0,0,0,0)             d12 (t1 t3)
    m2=(0,0,1)                        d3 (t3)
                   m2=(0,1,0,0,0,0,0,0)             d13 (t1 t2)
    m3=(0,1,0)                        d4 (t1)
                   m3=(0,0,1,0,0,0,0,0)             d14 (t1 t2)
t=3 m =(0,1,1)                        d5 (t2)
                   m4=(0,0,0,1,0,0,0,0)             d15 (t1 t2 t3)
    m5=(1,0,0)                        d6 (t2)
                   m5=(0,0,0,0,1,0,0,0)             d16 (t1 t2)
    m6=(1,0,1)                        d7 (t2 t3)
                   m6=(0,0,0,0,0,1,0,0)             d17 (t1 t2)
    m7=(1,1,0)                        d8 (t2 t3)
                   m7=(0,0,0,0,0,0,1,0)             d18 (t1 t2)
    m8=(1,1,1)                        d9 (t2)
                   m8=(0,0,0,0,0,0,0,1)             d19 (t1 t2 t3)
                                      d10 (t2 t3)   d20 (t1 t2)
          c2,3 m3  c2,4 m4  c2,7 m 7  c2,8 m8
   k2 
                 c2,32  c2,4 2  c2,7 2  c2,82
  c2,3  w2,5  w2,6  w2,9   c2,4  w2,7  w2,8  w2,10
  c2,7  w2,11  w2,13  w2,14  w2,16  w2,17  w2,18  w2,20
         Chen
  c2Hsin-Hsiw2,15  w2,19
    minterm mr           mr vector              d1 (t1)       d11 (t1 t2)
    m1=(0,0,0)           m1=(1,0,0,0,0,0,0,0)   d2 (t3)       d12 (t1 t3)
    m2=(0,0,1)           m2=(0,1,0,0,0,0,0,0)   d3 (t3)       d13 (t1 t2)
    m3=(0,1,0)           m3=(0,0,1,0,0,0,0,0)   d4 (t1)       d14 (t1 t2)
t=3 m =(0,1,1)           m4=(0,0,0,1,0,0,0,0)   d5 (t2)       d15 (t1 t2 t3)
    m5=(1,0,0)           m5=(0,0,0,0,1,0,0,0)   d6 (t2)       d16 (t1 t2)
    m6=(1,0,1)           m6=(0,0,0,0,0,1,0,0)   d7 (t2 t3)    d17 (t1 t2)
    m7=(1,1,0)           m7=(0,0,0,0,0,0,1,0)   d8 (t2 t3)    d18 (t1 t2)
    m8=(1,1,1)           m8=(0,0,0,0,0,0,0,1)   d9 (t2)       d19 (t1 t2 t3)
                                                d10 (t2 t3)   d20 (t1 t2)

        c3,2 m 2  c3,4 m4  c3,6 m 6  c3,8 m8
 k3 
                    c3,2 2  c3,4 2  c3,6 2  c3,82
 c3,2  w3,2  w3,3            c3,4  w3,7  w3,8  w3,10       c3,6  w3,12

    Hsin-Hsi Chen       c3,8  w3,15  w3,19                          104
  Generalized Vector Space Model

  • Determine the index vector ki associated
    with the index term ki

 ki 
      r, gi (mr )1ci,r mr                       Collect all the vectors mr in
                                                   which the index term ki is in

      r, gi (mr )1 i,r
                       c2                          state 1.

ci ,r                   w
                                                   Sum up wi,j associated with
                                                   the index term ki and document
                                     i, j
                                                   dj whose term occurrence
          d j |gl ( d j )  gl ( mr ) for all l    pattern coincides with minterm mr
  Hsin-Hsi Chen                                                                105
 Generalized Vector Space Model
• kikj quantifies a degree of correlation
  between ki and kj
          ki  k j                 ci,r  c j,r
                       r | gi ( mr ) 1 gj ( mr ) 1

• standard cosine similarity is adopted
         d j  i wi , j k i        q  i wi ,q k i

        ki 
             r, gi (mr )1ci,r mr
Hsin-Hsi Chen
             r, gi (mr )1 i,r
                c1,5 m5  c1,6 m6  c1,7 m7  c1,8 m8
     k1 
                     c1,52  c1,6 2  c1,7 2  c1,82

                c2,3 m3  c2,4 m4  c2,7 m 7  c2,8 m8
    k2 
                     c2,32  c2,4 2  c2,7 2  c2,82

                c3,2 m 2  c3,4 m4  c3,6 m6  c3,8 m8
     k3 
                     c3,2 2  c3,4 2  c3,6 2  c3,82
                 k 1  k 2  c1,7  c2,7  c1,8  c2,8
                 k 1  k 3  c1,6  c3,6  c1,8  c3,8
Hsin-Hsi Chen
                 k 2  k 3  c2,4  c3,4  c2,8  c3,8   107
    Comparison with Standard Vector Space Model

d1 (t1): (w1,1,0,0)            d11 (t1 t2)
d2 (t3): (0,0,w3,2)            d12 (t1 t3)
d3 (t3): (0,0,w3,3)            d13 (t1 t2)
d4 (t1): (w1,4,0,0)            d14 (t1 t2)
d5 (t2): (0,w2,5,0)            d15 (t1 t2 t3)
d6 (t2): (0,w2,6,0)            d16 (t1 t2)
d7 (t2 t3): (0,w2,7,w3,7)      d17 (t1 t2)
d8 (t2 t3): (0,w2,8,w3,8)      d18 (t1 t2)
d9 (t2): (0,w2,9,0)            d19 (t1 t2 t3)
d10 (t2 t3): (0,w2,10,w3,10)   d20 (t1 t2)
 Hsin-Hsi Chen                                    108
Latent Semantic Indexing Model
• representation of documents and queries by
  index terms
   – problem 1: many unrelated documents might be
     included in the answer set
   – problem 2: relevant documents which are not
     indexed by any of the query keywords are not
• possible solution: concept matching instead
  of index term matching
     – application in cross-language information
Hsin-Hsi Chen                                      109
                basic idea
• Map each document and query vector into a
  lower dimensional space which is
  associated with concepts
• Retrieval in the reduced space may be
  superior to retrieval in the space of index

Hsin-Hsi Chen                               110
• t: the number of index terms in the
• N: the total number of documents
• M=(Mij): a term-document association
  matrix with t rows and N columns
• Mij: a weight wi,j associated with the term-
  document pair [ki, dj] (e.g., using tf-idf)

Hsin-Hsi Chen                                    111
      Singular Value Decomposition
A  R nn
(1) A  AT
Q  R nn st QQT  I {QT Q  I }                orthogonal
sin gular value decomposition :
A  QDQT           { AT  (QDQT )T  (QT )T DT QT  QDQT  A}

                               2           0
   where D =                        .                     diagonal matrix
                           0                .
   Hsin-Hsi Chen                                                    112
                           1  2 …  n  0
A  R nn
(2) A  AT
U , V  R nn   st U T U  I , V T V  I            orthogonal

sin gular value decomposition :                  (AB)T= BT AT
AAT  (UDV T )(UDV T )T  (UDV T )(VDU T )  UD 2U T
                           2             0
where D =                         .                       diagonal matrix
                      0                   .

 Hsin-Hsi Chen            1  2 …  n  0                       113
where Q  [q1 q2  qn ], qi : a column vector
                                    2              0
A[q1 q2  qn ]  [q1 q2  qn ]            .

[ Aq1 Aq2  Aqn ]  [1q1 2 q2  n qn ]
Aq1  1q1 Aq2  2 q2  Aqn  n qn
 1, 2, …, n 為A之eigenvalues,
  Hsin-Hsi Chen                                         114
     Singular Value Decomposition
   M : a term  document matrix with t rows and N columns
   M  KSD
   M M : a N  N document  to  document matrix
   M M : a t  t term  to  term matrix
According to
    M  Rt N
                                                       t    t
    K : the matrix of eigenvectors derived from M M       K KI
                                                  t         t
       D : the matrix of eigenvectors derived from M M D D  I
    M  KSD
 Hsin-Hsi Chen                                                  115
M M : document  to  document matrix
         t          t
 ( K S D )t ( K S D )
            t           t            t
 ( D S K )( K S D )                                Q is matrix of eigenvectors of A
            2       t                               D is diagonal matrix of singular values
 DS D                                                   得到
                                                    D : the matrix of eigenvectors
M M : term  to  term matrix                                            t
                t               t t
                                                          derived from M M
 ( K S D )( K S D )
                                                     K : the matrix of eigenvectors
                t           t    t
 ( K S D )( D S K )                                                         t
                                                          derived from M M
        2           t
 KS K
                                                    S : r  r diagonal matrix of sin gular
                                                       values, where r  min( t , N )
       Hsin-Hsi Chen                                                                  116
                            s < r (Concept space is reduced)
   Consider only the s largest singular values of S

                         2               0
                     0                    .

                1  2 …  n  0
 The resultant Ms matrix is the matrix of rank s which is closest
 to the original matrix M in the least square sense.
                         M s  Ks Ss D        s

                              (s<<t, s<<N)
Hsin-Hsi Chen                                                117
                Ranking in LSI
• query: a pseudo-document in the original M
     – query is modeled as the document with number
     – MstMs: the ranks of all documents w.r.t this

Hsin-Hsi Chen                                     118
 Structured Text Retrieval Models
• Definition
   – Combine information on text content with information on the
     document structure
   – e.g., same-page(near(„atomic holocaust‟, Figure(label(„earth‟))))
• Expressive power vs. evaluation efficiency
   – a model based on non-overlapping lists
   – a model based on proximal nodes
• Terminology
   – match point: position in the text of a sequence of words that
       matches the user query
   – region: a contiguous portion of the text
   – node: a structural component of the document (chap, sec, …)
  Hsin-Hsi Chen                                                119
                      Non-Overlapping Lists
      • divide the whole text of each document in non-
        overlapping text regions (lists)
      • example
               1                          Chapter 1                  5000
       L0                                                                Chapter
                a list of all chapters in the document
               1                   1.1       3000         3001 1.2   5000
       L1                                                                   Sections
indexing 1a list of all sections in the document
             1.1.1 1000 1001                 3000        3001   1.2.1 5000
lists L2                                                                     Subsections
                   a list of all subsections in the document
               1    500 501 1000 1001
         L3                                                                  Subsubsections
                a list all subsubsections in the document
      • Text regions from distinct lists might overlap
      Hsin-Hsi Chen                                                                 120
                Non-Overlapping Lists

• Data structure                 Recall that there is another inverted
    – a single inverted file     file for the words in the text
    – each structural component stands as an entry
    – for each entry, there is a list of text regions as a list
• Operations
    – Select a region which contains a given word
    – Select a region A which does not contain any other
      region B (where B belongs to a list distinct from the list
      for A)
    – Select a region not contained within any other region
    – …
Hsin-Hsi Chen                                                     121
                           Inverted Files
• File is represented as an array of indexed records.

                           Term 1 Term 2 Term 3 Term 4

                Record 1     1      1      0      1

                Record 2     0      1      1      1

                Record 3     1      0      1      1

                Record 4     0      0      1      1

Hsin-Hsi Chen                                            122
                Inverted-file process
• The record-term array is inverted (transposed).

                    Record 1 Record 2 Record 3 Record 4

           Term 1      1        0        1        0

           Term 2      1        1        0        0

           Term 3      0        1        1        1

           Term 4      1        1        1        1

Hsin-Hsi Chen                                             123
      Inverted-file process (Continued)
• Take two or more rows of an inverted term-record
  array, and produce a single combined list of record
       Query          (term2 and term3)
       1      1       0       0
       0      1       1       1
              1 <-- R2

Hsin-Hsi Chen                                      124
      Extensions of Inverted Index Operations
              (Distance Constraints)

• Distance Constraints
     – (A within sentence B)
       terms A and B must co-occur in a common
     – (A adjacent B)
       terms A and B must occur adjacently in the text

Hsin-Hsi Chen                                       125
      Extensions of Inverted Index Operations
              (Distance Constraints)
• Implementation
     – include term-location in the inverted indexes
       information: {R345, R348, R350, …}
       retrieval:    {R123, R128, R345, …}
     – include sentence-location in the indexes
        {R345, 25; R345, 37; R348, 10; R350, 8; …}
        {R123, 5; R128, 25; R345, 37; R345, 40; …}

Hsin-Hsi Chen                                      126
      Extensions of Inverted Index Operations
              (Distance Constraints)
     – include paragraph numbers in the indexes
       sentence numbers within paragraphs
       word numbers within sentences
       information: {R345, 2, 3, 5; …}
       retrieval: {R345, 2, 3, 6; …}
     – query examples
       (information adjacent retrieval)
       (information within five words retrieval)
     – cost: the size of indexes
Hsin-Hsi Chen                                      127
       Model Based on Proximal Nodes
      • hierarchical vs. flat indexing structures
                      nodes: position in the text


flat index                                            paragraphs, pages, lines
      an inverted list for holocaust
             holocaust           10          256           …         48,324
      Hsin-Hsi Chen                                                       128
                …              entries: positions in the text
 Model Based on Proximal Nodes

• query language
     –   Specification of regular expressions
     –   Reference to structural components by name
     –   Combination
     –   Example
           • Search for sections, subsections, or subsubsections
             which contain the word „holocaust‟
           • [(*section) with („holocaust‟)]

Hsin-Hsi Chen                                                  129
 Model Based on Proximal Nodes

• Basic algorithm
     – Traverse the inverted list for the term „holocaust‟
     – For each entry in the list (i.e., an occurrence), search
       the hierarchical index looking for sections, subsections,
       and sub-subsections
• Revised algorithm
     – For the first entry, search as before
     – Let the last matching structural component be the
       innermost matching component                 nearby nodes
     – Verify the innermost matching component also matches
       the second entry.
Hsin-Hsi Chen   If it does, the larger structural components above it also do.   130
                Models for Browsing
• Browsing vs. searching
     – The goal of a searching task is clearer in the
       mind of the user than the goal of a browsing
• Models
     – Flat browsing
     – Structure guided browsing
     – The hypertext model
Hsin-Hsi Chen                                           131
                Models for Browsing
• Flat organization
     – Documents are represented as dots in a 2-D plan
     – Documents are represented as elements in a 1-D list, e.g.,
       the results of search engine
• Structure guided browsing
     – Documents are organized in a directory, which group
       documents covering related topics
• Hypertext model
     – Navigating the hypertext: a traversal of a directed graph
Hsin-Hsi Chen                                                 132
     Trends and Research Issues
• Library systems
   – Cognitive and behavioral issues oriented particularly at
     a better understanding of which criteria the users adopt
     to judge relevance
• Specialized retrieval systems
   – e.g., legal and business documents
   – how to retrieve all relevant documents without
     retrieving a large number of unrelated documents
• The Web
     – User does not know what he wants or has great
         difficulty in formulating his request
     – How the paradigm adopted for the user interface affects
         the ranking
     – The indexes maintained by various Web search engine133
Hsin-Hsi Chen
         are almost disjoint

Shared By: