Performance Compressed Inverted Indexes
Shared by: HC120912173459
-
Stats
- views:
- 9
- posted:
- 9/12/2012
- language:
- English
- pages:
- 16
Document Sample


Performance of Compressed
Inverted Indexes
Reasons for Compression
Compression reduces the size of the index
Compression can increase the
performance of query evaluation
operations
Factors Affecting Index Performance
Retrieval time for index lists (index size)
Complexity of decoding index lists
Standard Techniques
Translate absolute location of terms into
differences between locations
Use bitwise encoding schemes such as
Golomb-Rice or Elias coding
Usually reduce an index to about 15% of
the size of the collection
Performance is generally equal or better
than an uncompressed index
Articles Reviewed
Compression of Inverted Indexes For Fast
Query Evaluation
Scholer, Williams, Yiannis and Zobel, 2002
School of Computer Science and Information
Technology, RMIT University, Melbourne,
Australia
Index Compression vs. Retrieval Time of
Inverted Files for XML Documents
Fuhr and Govert, 2002
University of Dortmund, Germany
Article 1: Improving Performance
Two techniques were chosen to attempt to
improve the performance of compressed
indexes:
Optimization of existing bitwise compression
routines
Implementation of bytewise compression
routines
Optimized Bitwise Compression
Routines
Improved existing code developed by
Williams and Zobel
Optimized for the Intel / Linux platform
Decoding speed improved to 60% of that
achieved by Williams and Zobel
Bytewise Compression Routines
Integers are stored in standard binary
form using only 7 bits of a byte
Each integer only takes up as many bytes
as necessary to store the integer
1 bit per byte is used as a flag to indicate
that a byte is the final byte for the integer
Decoding of the integers is much simpler
than the complex bitwise encodings
Bitwise vs. Bytewise
Bytewise encoding of indexes takes up
nearly 20% of the original document size
(33% more than bitwise encodings)
Bytewise encoding provides query
performance that is double that of the
optimized bitwise encodings
Even when the index is small enough to
be stored in memory, bytewise encoding
shows small improvements over
uncompressed indexes
Article 2: Structured Indexes
Most IR approaches in the past have
ignored the structure and formatting of
documents
The widespread adoption of HTML and
XML has created the need for
improvements in structured IR
Inverted Indexes of XML Documents
The document structure must be stored or
referenced from the inverted index
Standard schemes use a Path-In-List (PIL)
approach; structure data is stored within
the inverted list for each term
Indexes are generally much larger than
the original text when uncompressed
Compression of Inverted Lists
Problem: the uncompressed PIL approach
generates an index that is too large
Two possible solutions were explored:
Use bitwise compression schemes to compress
the existing PIL representation
Store only a pointer in the list that points into
another data structure that models the
document structure
XML Structure (XS) Tree
The XS Tree is a compact representation
of the structure of an XML document
Size of XS Tree is generally 1-2% of the
original document size
XS Trees for an entire document collection
can usually be kept in memory
Performance of PIL vs. XS Trees
The XS Tree index, including the XS Trees, is
generally 2-3 times smaller than the compressed
PIL approach
Both approaches yield indexes that are smaller
than the document collection
In both cases, compression results in retrieval
performance that is far worse than uncompressed
PIL.
Retrieval performance of the XS Tree approach is
10-100 times worse than that of the
uncompressed PIL
Conclusions
Retrieval performance is dependent on:
the retrieval time of the index (index size)
the complexity of decoding the index entries
Scholer et. al. find the ideal balance with
bytewise compression, which results in
optimal retrieval times
Conclusions
The XS Tree’s goal of compressing the size
of the index is successful
The complexity of decoding the XS Tree
structure results in nearly unusable
performance
Future research should be undertaken to
find a structure that is quicker to decode
than the XS Tree
Related docs
Other docs by HC120912173459
Get documents about "