Indexing

Document Sample
scope of work template
							Indexing
Accessing Data During Query Evaluation

   Scan the entire collection
     Typical in early batch retrieval systems
     Still used today, in hardware form (eg. Fast
      Data Finder)
     Computational and I/O coast are O
      (character in collection)
     Practical only for small collections




                                                     2
Accessing Data During Query Evaluation

     Use indexes for direct access
         Evaluation time O (query term occurrences
          in collection)
       Practical for large collections
       Many opportunities for optimization




                                                      3
What should the Index contain?

    Database systems index primary and secondary
     keys
        Index provides fast access to a subset of database
         record
        Scan subset to find solution set
    IR Problem: Cannot predict keys that people will
     use in queries
        Every word in a document is a potential search term
        Solution: Index by all keys (word)


                                                               4
Some vocabulary about Indexing
    File organizations or indexes are used to
     increase performance of system
    Text indexing is the process of deciding what
     will be used to represent a given document
  Index terms are used to build indexes for the
   documents
  The retrieval model described how the indexed
   terms are incorporated in to a model
        Relationship between retrieval model and indexing
         model

                                                             5
Accessing the Index

    Index accessed through features or keys or
     terms
        Keys/terms can be atomic or complex
    Most common atomic keys/terms:
        Words in text, punctuation
        Manually assigned terms (controlled and
         uncontrolled vocabulary)
        Document structure: sentence and paragraph
         boundaries
        Inter or intra document links (e.g. citations)

                                                          6
Accessing the Index

    Composed features
      Sequences: phrases, names, dates,
       monetary amounts
      Sets : synonym classes




                                           7
Manual vs. Automatic Indexing

    Manual or human indexing:
        Index decide which keywords to assign to
         document based on controlled vocabulary
             e.g. MEDLINE, Yahoo
        Significant cost
    Automatic indexing:
        Indexing program decides which words, phrases or
         other features to use from test of document
        Indexing speeds range widely



                                                            8
Manual vs. Automatic Indexing

                 Manual        Automatic


                 Current         Text
   Controlled   indexing   categorization
   Vocabulary   practice   “Intelligent” IR

                 Current    Text search
                indexing      engines
   Free text
                practice   “Statistical” IR




                                              9
Manual vs. Automatic Indexing

  Experimental evidence is that retrieval
   effectiveness using automatic indexing can be
   at least as effective as manual indexing with
   controlled vocabularies
  Experiments have also shown that using both
   manual and automatic indexing improves
   performance




                                                   10
Some vocabulary words

    Index language
        Language used to describe documents and queries
    Exhaustivity
        Number of different topics indexed, completeness
    Specificity
        Level of accuracy of indexing
    Pre-coordinate indexing
        Combinations of index terms uses as indexing label
        E.g., author lists key phrases of paper
    Post-coordinate indexing
        Combinations generated at search time
        Most common and the focus of this course

                                                              11
Indexing Choices

    What is a word?
        Embedded punctuation (e.g. MD-11, hard-core)
        Case folding (e.g., New vs new, Apple vs apple)
        Stopwords (e.g., the, an, a, on)
        Morphology (e.g., computer, compute, computing)
    Index granularity has a large impact on speed
     and effectiveness
        Index term?
        Index surface forms?
        Both ?

                                                           12
Basic automatic Indexing

    Parse documents to recognize structure
        E.g., title, date, other fields
    Scan for word tokens
        Numbers ,special characters, hyphenation,
         Capitalization, etc
    Stopword removal
    Stem words
    Weight words
        Want more important words to have higher weight
    Optional
        Phrase indexing
        Thesaurus classes
                                                           13
Words vs. Terms vs. Concepts

    Simple indexing is based on words or word
     stems
        More complex indexing could include phrases or
         thesaurus classes
  Concept-base retrieval often used to imply
   something beyond word indexing
  Words, phrases, synonyms, linguistics can all
   be evidence used to infer present of the
   concept
        E.g., the concept “information retrieval” can be
         inferred based on the presence of the words
         “information”, “retrieval”, the phrase “information
         retrieval” and may be the phrase “text retrieval”
                                                               14
Phrases

  Both statistical and syntactic methods have
   been used to identify good phrases
  Proven techniques include finding all word
   pairs that occur more than n times in the
   corpus or using a part of speech tagger to
   identify simple noun phrases
        1,100,000 phrases extracted from all TREC data
    Phrases can have an impact on both
     effectiveness and efficiency
        Phrase indexing will speed up phrase queries
        Finding documents containing “White House” better
         than finding documents containing both words

                                                          15
Information Extraction

    Special recognizers for specific concepts
        People, organization, places, dates,
         amounts, product
    Meta terms such as #COMPANY,
     #PERSON can be added to indexing




                                                 16
Indexing Example




                   17
Implementations

    Common implementations of indexes
      Bitmaps
      Signature files

      Inverted files

      Hashing
      N-grams




                                         18
 N-grams สามารถหาความรู้เพิ่มเติมได้จาก
http://catadmin.cattelecom.com/km/blog/kittichonm/category/search-engine/n-gram/
   โปรแกรมสร้าง N-gram ระดับตัวอักษรสาหรับภาษาไทย
                                                 ่
   ไฟล์ที่เอามาลองสร้าง N-gram นั้นเป็นไฟล์ขาวภาษาไทย มีข่าวอยู่หลาย 1,000 ข่าว มีจานวนตัวอักษรทั้งหมด
                                                                                           ้
   28,694,548 ตัว (77 MB) ตัวอักษรพวกนี้รวมทั้งเครื่องหมาย, เลข,และตัวอักษรอื่นๆทีเ่ กิดขึนในข่าว
                                                             ่
    หลังจากโปรแกรมรันเสร็จ นี่คือผลของ 10 อันดับแรกทีเกิดขึ้นบ่อยทีสุด  ่
    า_1901143
         น_1553522
         _1493261
         ร_1445651
        ่_1214212
         ก_1182815
         อ_1089453
         เ_1006035
         ง_984559
         ม_927818
    ข้อสังเกตเล่นๆ:
                         ้
         - สระอา เกิดขึนบ่อยที่สุด ด้วยความถี่ 1901143 ครั้ง
         - วรรค เกิดบ่อยเป็นอันดับ 3 ตอนแรกคิดว่าจะเกิดขึ้นน้อยในภาษาไทย
         - bigram ที่เกิดขึ้นบ่อยสุด คือ - าร (สระอาตามด้วย รอเรือ) ด้วยความถี่ 311818 ครั้ง
                                                                                                         19
Indexes: Inverted Lists
  Inverted lists are today the most common
   indexing technique
  Sources file: collection, organized by document
  Inverted file: collection organized by term
        One record per term, listing locations where term
         occurs




                                                             20
Inverted Lists

    During query evaluation, traverse lists for each
     query term
        OR: the union of component list
        AND: an intersection of component list
        Proximity: an intersection of component list
        SUM: the union of component lists : each entry has
         a score




                                                              21
Inverted Files

    Example test: each line is a document




                                             22
Inverted Files




                 23
Word-Level Inverted File




                           24
Index Construction Methods

  Memory-based inversion
  Sort-based inversion
  All above, combined with compression
  FAST-INV
  Based on text partitioning




                                          25
Index Construction: Overview

    Total text size 5 GB, with 5 million documents,
     40 MB main memory




                                                       26
Expanding the Index

    Simplest way to handle documents insertion
     for the inverted file index
        Accumulate updates in a stop-press file
        For each query issued the stop-press file is
         checked
        When the stop=press grows too large, re-index the
         entire collection
        Major disadvantage: to keep performance up to
         scratch, stop-press files must be kept small, so re-
         indexing need to be done often, while it takes
         longer with ever growing data set
                                                                27
Indexes: Signature Files

  Bag of words only
  For each term, allocate fixed size s-bit vector
   (signature)
  Define hash function:
  Each term has an s-bit signature
        May not be unique
  OR the term signatures to form document
   signature
  Long documents are a problem
        Usually segment them into smaller pieces

                                                     28
Encoding and Compression

    Encoding transforms data from one representation to
     another



  Compression is an encoding that takes less space
  Lossless: decoder can reproduce message exactly
  Lossy: can reproduce message approximately
  Degree of compression:
        (Original – Encoded)/Encoded
        Example: (125MB-25MB)/25 MB = 400%


                                                           29
Compression

    Advantage of Compression
      Save space in memory (e.g., compressed
       cache
      Save space when storing (e.g., disk, CD-
       ROM)
      Save time when accessing (e.g., I/O)

      Save time when communicating (e.g., over
       network)


                                                  30
Compression

    Disadvantages of Compression
      Costs time and computation to compress
       and uncompress
      Complicates or prevents random access

      May involve loss of information (e.g., JPEG,
       MP3)
      Makes data corruption much more costly.
       Small errors may make all of the data
       inaccessible.

                                                      31

						
Related docs
Other docs by tyndale