Techniques for Gigabyte-
Scale N-gram Based
Information Retrieval on PCs
Ethan L. Miller
University of Maryland Baltimore County
elm@csee.umbc.edu
What’s the problem?
N-gram based IR is becoming more important
Language-independent
Garble-tolerant
Better accuracy (phrases, etc.)?
Scalability of n-gram IR now necessary
Adapt traditional (word-based) IR techniques
to n-grams
More unique terms per corpus
More unique terms per document
Avoid use of language-dependent techniques
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 2
What did we do about it?
Scaled n-gram based IR system to handle a
gigabyte on a commodity (
Many potential weighting schemes for terms
Find documents in corpus “similar” to query
Break query into terms
Similarity between query and a given document is
a function of the term vectors for each
Results ranked
Function often looks like a dot product
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 6
N-grams vs. words as terms
1x10 7
Fewer unique words 8-grams
Differences of orders of 7-grams
Number of unique terms
magnitude 1x10 6 6-grams
5-grams => 4x words
5-grams
6-grams => 10x words
1x10 5 4-grams
Longer n-grams => even
higher ratios words
More postings per 1x10 4 3-grams
document
(5-gram postings) / (word
postings) ~ 10 1x10 3
0 5 10 15 20 25 30 35
Most 5-gram postings Corpus size (MB)
have a count of 1
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 7
N-gram IR: memory usage
Postings lists 100%
Percentage of memory used
Naïve: 12 bytes per entry 90%
Postings
80%
Better: compression! N-gram table
70%
N-gram (term) table 60% Document table
1 entry per n-gram
50% File table
~40 bytes per entry
40%
Document & file 30%
information 20%
Large structures 10%
Relatively few instances! 0%
1 10 40 180 1000
Most memory used by Corpus size (MB)
postings list & n-gram
hash table
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 8
Corpus compression
Compress integers in postings to reduce
corpus size
Posting count
Document identifier (use difference from previous
one in sorted list)
Try different compression techniques & adjust
parameters to best fit n-grams
Simple compression
Easy to code
Effective enough?
Gamma compression
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 9
5-gram posting counts
Q: What’s the count for 100%
a particular posting in a 90%
Cumulative percentage
80%
document? 70%
A: Almost certainly 1! 60%
80% of all postings have 50%
a count of 1 40%
98% have a count of 5 or 30%
20%
less
10%
Distribution is more 0%
skewed for n-grams than 1 10 100 1000
for words Count in posting
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 10
Document identifier gaps
Curve less steep than 100%
that of posting counts 90%
Cumulative percentage
Curve less steep than 80%
corresponding curve for 70%
60%
words
50%
Compression may be
40%
less effective
30%
Parameters may need to
20%
be changed
10%
0%
1x10 0 1x10 1 1x10 2 1x10 3 1x10 4
Gap from previous document
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 11
Simple compression
Raw (uncompressed) index requires 6x
storage of documents themselves
Represent numbers in
8 bits: 0-127 (27-1)
16 bits: 127-16383 (214 -1)
32 bits: everything else (up to 30 bits)
Simple compression effectiveness
960 MB of text -> 1085 MB index
Factor of 6 reduction from no compression
gzip compressed index by another factor of 2
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 12
Gamma compression
Represent numbers as unary n followed by
m-bit binary
n to m translation table can be tuned
Adjust translation to minimize number of bits used
Posting counts
Represent “1” in 1 bit
Small numbers have very few bits
Document gaps
Small numbers have small representations, but...
Shallower curve: don’t weight as much towards
small numbers
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 13
Gamma compression results
Use single vector for simplicity
Select for minimal sum of posting counts,
document gap sizes
Vector of worked best
Within 3% of minimum of each set compressed
separately
Posting counts compressed far more than
document gaps
960 MB of text -> 647 MB of index
Postings lists = 485 MB
Overhead (doc info, n-gram headers) = 150 MB
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 14
Postings lists: memory vs. disk
50 J
Construct indices for
In-
memory
257 MB corpus
40 J On-disk
Running time (seconds)
J
Run queries with
postings lists J
30
In memory
J
On disk
20
On disk lists slower, as J
expected, but…
10
Less than 2x slowdown
Decompression not much
slower than disk I/O 0
0 1 2 3 4 5
Seek time less critical Query size (KB)
than we thought
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 15
N-gram library rewrite
Build more efficient data structures
Better dynamic storage
Reduction in memory consumption
Make on-disk storage work better
More efficient
Independent of underlying byte order
Build to standard API
Reusable component
Fit with legacy apps
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 16
Data structure design
Main data structures
TermTable
Maintain per-term information
Store term text as hashed 64-bit value
PostingsList
Keep compressed postings lists
Dynamically allocate chunks as needed
Other structures
DocTable
Corpus (includes other structures)
Structures use templates extensively
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 17
Data structures: connections
H(‘ellta’) H(‘lltal’) TermTable
nOccs nOccs
nDocs nDocs
… …
PostList PostList
Count0
chunk1 DocId0
Count1 DocTable
chunk2 DocId1
...
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 18
Current status
Basic data structures working
PostingsList
HashTable (for documents, terms)
Structures need to be tied together
Corpus data structure
Term generation (parsing)
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 19
Future work
Currently rewriting IR system from scratch
Better memory & posting list management
Support for trying different term weighting
schemes & reduction mechanisms
Support for excluding n-grams that won’t matter
Explore tradeoff between disk and memory
Try new weighting algorithms with n-grams
Parallelize the IR engine (-> Linux clusters)
Gauge IR performance for n-grams on large
corpora
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 20
Conclusions
Demonstrated an n-gram based IR system
indexing a gigabyte on a commodity PC
Used compression & disk storage for scaling
Preserved properties of n-gram based retrieval
Found source of performance improvement in
scalable IR systems
Compression more helpful than memory residence
Disk access isn’t so bad if the file system is fast
CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 21