Docstoc

01

Document Sample
01 Powered By Docstoc
					           INFO 4300 / CS4300
           Information Retrieval

                                   u
    slides adapted from Hinrich Sch¨tze’s,
linked from http://informationretrieval.org/
            IR 1: Boolean Retrieval

                   Paul Ginsparg

             Cornell University, Ithaca, NY


                   26 Aug 2010


                                               1 / 43
Plan for today




      Course overview
      Administrativa
      Boolean retrieval




                          2 / 43
Overview




  “After change, things are different . . .”




                                              3 / 43
“Plan”



         Search full text: basic concepts
         Web search
         Probabalistic Retrieval
         Interfaces
         Metadata / Semantics

  IR ⇔ NLP ⇔ ML

  Prereqs: Introductory courses in data structures and algorithms, in
  linear algebra and in probability theory




                                                                        4 / 43
Administrativa

   Course Webpage:
   http://www.infosci.cornell.edu/Courses/info4300/2010fa/

       Lectures: Tuesday and Thursday 11:40-12:55, Olin Hall 165
       Instructor: Paul Ginsparg, ginsparg@..., 255-7371,
       Cornell Information Science, 301 College Avenue
       Instructor’s Assistant: Corinne Russell, crussell@cs.. . .,
       255-5925, Cornell Information Science, 301 College Avenue
       Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail
       instructor to schedule an appointment
       Teaching Assistant: Niranjan Sivakumar, ns253@...
       The Teaching Assistant does not have scheduled office hours
       but is available to help you by email.
   Course text at: http://informationretrieval.org/

                                                                     5 / 43
Tentative Assignment and Exam Schedules

   During this course there will be four assignments which require
   programming
       Assignment 1 due Sun 19 Sep
       Assignment 2 due Sat 9 Oct
       Assignment 3 due Sun 7 Nov
       Assignment 4 due Fri 3 Dec
   and two examinations:
       Midterm on Thu 14 Oct
       Final exam on Fri 17 Dec
   The course grade will be based on course assignments,
   participation in discussions, and examinations, with rough
   weightings:
   Assignments 40%, Participation 20%, Examinations 40%.

                                                                     6 / 43
Outline



   1   Introduction


   2   Inverted index


   3   Processing Boolean queries


   4   Discussion Section (next week)




                                        7 / 43
Definition of information retrieval


   Information retrieval (IR) is finding material (usually documents) of
   an unstructured nature (usually text) that satisfies an information
   need from within large collections (usually stored on computers).


   Used to be only reference librarians, paralegals, professionals.

   Now hundreds of millions of people (billions?) engage in
   information retrieval every day when they use a web search engine
   or search their email


   Three scales (web, enterprise/inst/domain, personal)



                                                                          8 / 43
Clustering and Classification


   IR also covers supporting users in browsing or filtering document
   collections or further processing a set of retrieved documents.


   Clustering: find a good grouping of the documents based on their
   contents. (c.f., arrange books on a bookshelf according to their
   topic)

   Classification: given a set of topics, standing information needs, or
   other categories (such as suitability of texts for different age
   groups), decide which class(es), if any, to which each of a set of
   documents belongs




                                                                          9 / 43
Structured vs Unstructured


   “unstructured data”: no clear, semantically overt
   (easy-for-a-computer) structure.

   structured data: e.g.: relational database (product inventories and
   personnel records)


   But: no data truly “unstructured”
   (text data has latent linguistic structure, in addition headings,
   paragraphs, footnotes, with explicit markup)
   IR facilitates “semistructured” search: e.g., find document whose
   title contains Java and body contains threading



                                                                         10 / 43
200
180
160
140
120
100                              Unstructured
80                               Structured
60
40
20
 0
      Data volume   Market Cap
Boolean retrieval



       The Boolean model is among the simplest models on which to
       base an information retrieval system.
       Queries are Boolean expressions, e.g., Caesar and Brutus
       The seach engine returns all documents that satisfy the
       Boolean expression.




                               Does Google use the Boolean model?




                                                                    14 / 43
Outline



   1   Introduction


   2   Inverted index


   3   Processing Boolean queries


   4   Discussion Section (next week)




                                        15 / 43
Unstructured data in 1650: Shakespeare



      Which plays of Shakespeare contain the words Brutus and
      Caesar, but not Calpurnia?
      One could grep all of Shakespeare’s plays for Brutus and
      Caesar, then strip out lines containing Calpurnia.
      Why is grep not the solution?
          Slow (for large collections)
          grep is line-oriented, IR is document-oriented
          “not Calpurnia” is non-trivial
          Other operations (e.g., find the word Romans near
          countryman) not feasible
          Ranked retrieval (best documents to return) — later in course




                                                                          16 / 43
Term-document incidence matrix
                 Anthony     Julius     The     Hamlet   Othello   Macbeth     ...
                    and      Caesar   Tempest
                 Cleopatra
   Anthony           1         1         0          0       0         1
   Brutus            1         1         0          1       0         0
   Caesar            1         1         0          1       1         1
   Calpurnia         0         1         0          0       0         0
   Cleopatra         1         0         0          0       0         0
   mercy             1         0         1          1       1         1
   worser            1         0         1          1       1         0
   ...

   Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar.
   Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The
   tempest.

   (Shakespeare used about 32,000 different words)




                                                                                17 / 43
Binary–valued vector for Brutus
               Anthony     Julius     The     Hamlet   Othello   Macbeth   ...
                  and      Caesar   Tempest
               Cleopatra
   Anthony         1         1         0        0        0          1
   Brutus          1         1         0        1        0          0
   Caesar          1         1         0        1        1          1
   Calpurnia       0         1         0        0        0          0
   Cleopatra       1         0         0        0        0          0
   mercy           1         0         1        1        1          1
   worser          1         0         1        1        1          0
   ...




                                                                            18 / 43
Incidence vectors




       So we have a binary–valued vector for each term.
       To answer the query Brutus and Caesar and not
       Calpurnia:
           Take the vectors for Brutus, Caesar, and Calpurnia
           Complement the vector of Calpurnia
           Do a (bitwise) and on the three vectors
           110100 and 110111 and 101111 = 100100




                                                                19 / 43
Answers to query



   Anthony and Cleopatra, Act III, Scene ii
   Agrippa [Aside to Domitius Enobarbus]: Why, Enobarbus,
                            When Antony found Julius Caesar dead,
                            He cried almost to roaring; and he wept
                            When at Philippi he found Brutus slain.

   Hamlet, Act III, Scene ii
   Lord Polonius:            I did enact Julius Caesar: I was killed i’ the
                             Capitol; Brutus killed me.




                                                                         20 / 43
Ad hoc retrieval

   Provide documents from within the collection that are relevant to
   an arbitrary user information need, communicated to the system by
   means of a one-off, user-initiated query.
   Information need (topic about which user desires to know more) =
   query (what the user conveys to computer)


   To assess the effectiveness of an IR system, two key statistics:


   Precision: What fraction of the returned results are relevant to
   the information need?


   Recall: What fraction of the relevant documents in the collection
   were returned by the system?

                                                                       21 / 43
Bigger collections




       Consider N = 106 documents, each with about 1000 tokens
       On average 6 bytes per token, including spaces and
       punctuation ⇒ size of document collection is about 6 GB
       Assume there are M = 500,000 distinct terms in the collection
       (Notice that we are making a term/token distinction.)




                                                                       22 / 43
Can’t build the incidence matrix




       M = 500,000 × 106 = half a trillion 0s and 1s.
       But the matrix has no more than one billion 1s.
           Matrix is extremely sparse. (109 /5 · 1011 = .2%)
       What is a better representation?
           We only record the 1s.




                                                               23 / 43
Inverted Index


   For each term t, we store a list of all documents that contain t.
      Brutus       −→ 1        2     4     11 31 45 173 174

      Caesar       −→     1    2    4     5     6   16   57    132     ...

    Calpurnia      −→     2   31   54   101

          .
          .
          .

     dictionary                               postings




                                                                         24 / 43
Inverted index construction


     1   Collect the documents to be indexed:
         Friends, Romans, countrymen. So let it be with Caesar . . .
     2   Tokenize the text, turning each document into a list of tokens:
          Friends Romans countrymen So . . .
     3   Do linguistic preprocessing, producing a list of normalized
         tokens, which are the indexing terms: friend roman
         countryman so . . .
     4   Index the documents that each term occurs in by creating an
         inverted index, consisting of a dictionary and postings.




                                                                           25 / 43
Tokenization and preprocessing
    Doc 1. I did enact Julius Caesar: I
                                                    Doc 1. i did enact julius caesar i was
    was killed i’ the Capitol; Brutus killed
                                                    killed i’ the capitol brutus killed me
    me.
                                               =⇒   Doc 2. so let it be with caesar the
    Doc 2. So let it be with Caesar. The
                                                    noble brutus hath told you caesar was
    noble Brutus hath told you Caesar
                                                    ambitious
    was ambitious:




                                                                                             26 / 43
Generate postings
                                                 term docID
                                                 i         1
                                                 did       1
                                                 enact     1
                                                 julius    1
                                                 caesar    1
                                                 i         1
                                                 was       1
                                                 killed    1
                                                 i’        1
                                                 the       1
                                                 capitol   1
                                                 brutus    1
   Doc 1. i did enact julius caesar i was
                                                 killed    1
   killed i’ the capitol brutus killed me
                                                 me        1
   Doc 2. so let it be with caesar the      =⇒   so        2
   noble brutus hath told you caesar was
                                                 let       2
   ambitious
                                                 it        2
                                                 be        2
                                                 with      2
                                                 caesar    2
                                                 the       2
                                                 noble     2
                                                 brutus    2
                                                 hath      2
                                                 told      2
                                                 you       2
                                                 caesar    2
                                                 was       2
                                                 ambitious 2




                                                               27 / 43
Sort postings
   term docID         term docID
   i         1        ambitious 2
   did       1        be        2
   enact     1        brutus    1
   julius    1        brutus    2
   caesar    1        capitol   1
   i         1        caesar    1
   was       1        caesar    2
   killed    1        caesar    2
   i’        1        did       1
   the       1        enact     1
   capitol   1        hath      1
   brutus    1        i         1
   killed    1        i         1
   me        1        i’        1
   so        2
                 =⇒   it        2
   let       2        julius    1
   it        2        killed    1
   be        2        killed    1
   with      2        let       2
   caesar    2        me        1
   the       2        noble     2
   noble     2        so        2
   brutus    2        the       1
   hath      2        the       2
   told      2        told      2
   you       2        you       2
   caesar    2        was       1
   was       2        was       2
   ambitious 2        with      2




                                    28 / 43
Create postings lists, determine document frequency
   term docID
   ambitious 2
   be        2        term doc. freq.   →   postings lists
   brutus    1
                       ambitious 1      →    2
   brutus    2
                       be 1             →    2
   capitol   1
   caesar    1         brutus 2         →    1 → 2
   caesar    2         capitol 1        →    1
   caesar    2         caesar 2         →    1 → 2
   did       1         did 1            →    1
   enact     1         enact 1          →    1
   hath      1         hath 1           →    2
   i         1         i 1              →    1
   i         1         i’ 1             →    1
   i’        1
                 =⇒    it 1             →    2
   it        2
                       julius 1         →    1
   julius    1
   killed    1         killed 1         →    1
   killed    1         let 1            →    2
   let       2         me 1             →    1
   me        1         noble 1          →    2
   noble     2         so 1             →    2
   so        2         the 2            →    1 → 2
   the       1         told 1           →    2
   the       2
                       you 1            →    2
   told      2
                       was 2            →    1 → 2
   you       2
   was       1         with 1           →    2
   was       2
   with      2




                                                             29 / 43
Split the result into dictionary and postings file



      Brutus      −→   1   2     4   11    31   45    173   174

      Caesar      −→   1   2     4     5   6    16     57   132   ...

    Calpurnia     −→   2   31   54   101

         .
         .
         .

     dictionary                        postings file




                                                                    30 / 43
Later in this course




       Index construction: how can we create inverted indexes for
       large collections?
       How much space do we need for dictionary and index?
       Index compression: how can we efficiently store and process
       indexes for large collections?
       Ranked retrieval: what does the inverted index look like when
       we want the “best” answer?




                                                                       31 / 43
Outline



   1   Introduction


   2   Inverted index


   3   Processing Boolean queries


   4   Discussion Section (next week)




                                        32 / 43
Simple conjunctive query (two terms)




      Consider the query: Brutus AND Calpurnia
      To find all matching documents using inverted index:
        1   Locate Brutus in the dictionary
        2   Retrieve its postings list from the postings file
        3   Locate Calpurnia in the dictionary
        4   Retrieve its postings list from the postings file
        5   Intersect the two postings lists
        6   Return intersection to user




                                                               33 / 43
Intersecting two postings lists




    Brutus         −→      1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
    Calpurnia      −→      2 → 31 → 54 → 101

    Intersection   =⇒      2 → 31
       This is linear in the length of the postings lists.
       Note: This only works if postings lists are sorted.




                                                              34 / 43
Intersecting two postings lists


   Intersect(p1 , p2 )
     1 answer ←
     2 while p1 = nil and p2 = nil
     3 do if docID(p1 ) = docID(p2 )
     4       then Add(answer , docID(p1 ))
     5             p1 ← next(p1 )
     6             p2 ← next(p2 )
     7       else if docID(p1 ) < docID(p2 )
     8                 then p1 ← next(p1 )
     9                 else p2 ← next(p2 )
    10 return answer




                                               35 / 43
Query processing: Exercise

    france    −→     1 → 2 → 3 → 4 → 5 → 7 → 8 → 9 → 11 → 12 → 13 → 14 → 15
    paris     −→     2 → 6 → 10 → 12 → 14
    lear      −→     12 → 15
   Compute hit list for ((paris AND NOT france) OR lear)




                                                                        36 / 43
Boolean queries


      The Boolean retrieval model can answer any query that is a
      Boolean expression.
          Boolean queries are queries that use and, or and not to join
          query terms.
          Views each document as a set of terms.
          Is precise: Document matches condition or not.
      Primary commercial retrieval tool for 3 decades
      Many professional searchers (e.g., lawyers) still like Boolean
      queries.
          You know exactly what you are getting.
      Many search systems you use are also Boolean: spotlight,
      email, intranet etc.



                                                                         37 / 43
Commercially successful Boolean retrieval: Westlaw



      Largest commercial legal search service in terms of the
      number of paying subscribers
      Over half a million subscribers performing millions of searches
      a day over tens of terabytes of text data
      The service was started in 1975.
      In 2005, Boolean search (called “Terms and Connectors” by
      Westlaw) was still the default, and used by a large percentage
      of users . . .
      . . . although ranked retrieval has been available since 1992.




                                                                        38 / 43
Westlaw: Example queries


   Information need: Information on the legal theories involved in
   preventing the disclosure of trade secrets by employees formerly
   employed by a competing company
   Query: “trade secret” /s disclos! /s prevent /s employe!

   Information need: Requirements for disabled people to be able to
   access a workplace
   Query: disab! /p access! /s work-site work-place (employment /3
   place)

   Information need: Cases about a host’s responsibility for drunk
   guests
   Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest


                                                                        39 / 43
Westlaw: Comments

     /s = within sentence
     /p = within paragraph
     /k = within k words (proximity = augmentation of Boolean
     queries)
     Space is disjunction (OR), not conjunction (AND)! (This was
     the default in search pre-Google.)
     Long, precise queries: proximity operators, incrementally
     developed, not like web search
     Why professional searchers often like Boolean search:
     precision, transparency, control
     (But: high precision, low recall?)
     When are Boolean queries the best way of searching? Depends
     on: information need, searcher, document collection, . . .

                                                                   40 / 43
Outline



   1   Introduction


   2   Inverted index


   3   Processing Boolean queries


   4   Discussion Section (next week)




                                        41 / 43
Discussion 1



   In preparation, explore three information retrieval systems and
   compare them:
       Bing — a Web search engine (http://bing.com/).
       The Library of Congress catalog — a very large bibliographic
       catalog (http://catalog.loc.gov/).
       PubMed — an indexing and abstracting service for medicine
       and related fields
       (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi).




                                                                      42 / 43
Use each service separately for the following information discovery
task:
    What is the medical evidence that cell phone usage can cause
    cancer?
Evaluate each search service. What do you consider the strengths
and weaknesses of each service? When would you use them?
(a) Does the service search full text or surrogates? What is the
underlying corpus? What effect does this have on your results?
(b) Is fielded searching offered? What Boolean operators are
supported? What regular expressions? How does it handle
non-Roman character sets? What is the stop list? How are results
ranked? Are they sorted, if so in what order?
(c) From a usability viewpoint. What style of user interface(s) is
provided? What training or help services? If there are basic and
advanced user interfaces, what does each offer?


                                                                      43 / 43

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:34
posted:3/24/2011
language:English
pages:43
qihao0824 qihao0824 http://
About