Performance Compressed Inverted Indexes

Document Sample
Performance Compressed Inverted Indexes Powered By Docstoc
					Performance of Compressed
     Inverted Indexes
Reasons for Compression
 Compression reduces the size of the index
 Compression can increase the
  performance of query evaluation
Factors Affecting Index Performance
 Retrieval time for index lists (index size)
 Complexity of decoding index lists
Standard Techniques
 Translate absolute location of terms into
  differences between locations
 Use bitwise encoding schemes such as
  Golomb-Rice or Elias coding
 Usually reduce an index to about 15% of
  the size of the collection
 Performance is generally equal or better
  than an uncompressed index
Articles Reviewed
   Compression of Inverted Indexes For Fast
    Query Evaluation
       Scholer, Williams, Yiannis and Zobel, 2002
       School of Computer Science and Information
        Technology, RMIT University, Melbourne,
   Index Compression vs. Retrieval Time of
    Inverted Files for XML Documents
       Fuhr and Govert, 2002
       University of Dortmund, Germany
Article 1: Improving Performance
   Two techniques were chosen to attempt to
    improve the performance of compressed
       Optimization of existing bitwise compression
       Implementation of bytewise compression
Optimized Bitwise Compression
 Improved existing code developed by
  Williams and Zobel
 Optimized for the Intel / Linux platform
 Decoding speed improved to 60% of that
  achieved by Williams and Zobel
Bytewise Compression Routines
 Integers are stored in standard binary
  form using only 7 bits of a byte
 Each integer only takes up as many bytes
  as necessary to store the integer
 1 bit per byte is used as a flag to indicate
  that a byte is the final byte for the integer
 Decoding of the integers is much simpler
  than the complex bitwise encodings
Bitwise vs. Bytewise
 Bytewise encoding of indexes takes up
  nearly 20% of the original document size
  (33% more than bitwise encodings)
 Bytewise encoding provides query
  performance that is double that of the
  optimized bitwise encodings
 Even when the index is small enough to
  be stored in memory, bytewise encoding
  shows small improvements over
  uncompressed indexes
Article 2: Structured Indexes
 Most IR approaches in the past have
  ignored the structure and formatting of
 The widespread adoption of HTML and
  XML has created the need for
  improvements in structured IR
Inverted Indexes of XML Documents
 The document structure must be stored or
  referenced from the inverted index
 Standard schemes use a Path-In-List (PIL)
  approach; structure data is stored within
  the inverted list for each term
 Indexes are generally much larger than
  the original text when uncompressed
Compression of Inverted Lists
 Problem: the uncompressed PIL approach
  generates an index that is too large
 Two possible solutions were explored:
       Use bitwise compression schemes to compress
        the existing PIL representation
       Store only a pointer in the list that points into
        another data structure that models the
        document structure
XML Structure (XS) Tree
 The XS Tree is a compact representation
  of the structure of an XML document
 Size of XS Tree is generally 1-2% of the
  original document size
 XS Trees for an entire document collection
  can usually be kept in memory
Performance of PIL vs. XS Trees
   The XS Tree index, including the XS Trees, is
    generally 2-3 times smaller than the compressed
    PIL approach
   Both approaches yield indexes that are smaller
    than the document collection
   In both cases, compression results in retrieval
    performance that is far worse than uncompressed
   Retrieval performance of the XS Tree approach is
    10-100 times worse than that of the
    uncompressed PIL
   Retrieval performance is dependent on:
       the retrieval time of the index (index size)
       the complexity of decoding the index entries
   Scholer et. al. find the ideal balance with
    bytewise compression, which results in
    optimal retrieval times
 The XS Tree’s goal of compressing the size
  of the index is successful
 The complexity of decoding the XS Tree
  structure results in nearly unusable
 Future research should be undertaken to
  find a structure that is quicker to decode
  than the XS Tree

Shared By: