Embed
Email

Miller

Document Sample

Shared by: huanghengdong
Categories
Tags
Stats
views:
0
posted:
12/17/2011
language:
pages:
21
Techniques for Gigabyte-

Scale N-gram Based

Information Retrieval on PCs



Ethan L. Miller

University of Maryland Baltimore County

elm@csee.umbc.edu

What’s the problem?

 N-gram based IR is becoming more important

 Language-independent

 Garble-tolerant

 Better accuracy (phrases, etc.)?

 Scalability of n-gram IR now necessary

 Adapt traditional (word-based) IR techniques

to n-grams

 More unique terms per corpus

 More unique terms per document

 Avoid use of language-dependent techniques





CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 2

What did we do about it?

 Scaled n-gram based IR system to handle a

gigabyte on a commodity (

 Many potential weighting schemes for terms

 Find documents in corpus “similar” to query

 Break query into terms

 Similarity between query and a given document is

a function of the term vectors for each

 Results ranked

 Function often looks like a dot product





CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 6

N-grams vs. words as terms

1x10 7

 Fewer unique words 8-grams

 Differences of orders of 7-grams









Number of unique terms

magnitude 1x10 6 6-grams

 5-grams => 4x words

5-grams

 6-grams => 10x words

1x10 5 4-grams

 Longer n-grams => even

higher ratios words

 More postings per 1x10 4 3-grams

document

 (5-gram postings) / (word

postings) ~ 10 1x10 3

0 5 10 15 20 25 30 35

 Most 5-gram postings Corpus size (MB)

have a count of 1



CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 7

N-gram IR: memory usage

 Postings lists 100%









Percentage of memory used

 Naïve: 12 bytes per entry 90%

Postings

80%

 Better: compression! N-gram table

70%

 N-gram (term) table 60% Document table

 1 entry per n-gram

50% File table

 ~40 bytes per entry

40%

 Document & file 30%

information 20%

 Large structures 10%

 Relatively few instances! 0%

1 10 40 180 1000

 Most memory used by Corpus size (MB)

postings list & n-gram

hash table

CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 8

Corpus compression

 Compress integers in postings to reduce

corpus size

 Posting count

 Document identifier (use difference from previous

one in sorted list)

 Try different compression techniques & adjust

parameters to best fit n-grams

 Simple compression

 Easy to code



 Effective enough?



 Gamma compression



CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 9

5-gram posting counts

 Q: What’s the count for 100%

a particular posting in a 90%









Cumulative percentage

80%

document? 70%

 A: Almost certainly 1! 60%

 80% of all postings have 50%

a count of 1 40%

 98% have a count of 5 or 30%

20%

less

10%

 Distribution is more 0%

skewed for n-grams than 1 10 100 1000

for words Count in posting









CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 10

Document identifier gaps

 Curve less steep than 100%

that of posting counts 90%









Cumulative percentage

 Curve less steep than 80%

corresponding curve for 70%

60%

words

50%

 Compression may be

40%

less effective

30%

 Parameters may need to

20%

be changed

10%

0%

1x10 0 1x10 1 1x10 2 1x10 3 1x10 4

Gap from previous document







CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 11

Simple compression

 Raw (uncompressed) index requires 6x

storage of documents themselves

 Represent numbers in

 8 bits: 0-127 (27-1)

 16 bits: 127-16383 (214 -1)

 32 bits: everything else (up to 30 bits)

 Simple compression effectiveness

 960 MB of text -> 1085 MB index

 Factor of 6 reduction from no compression

 gzip compressed index by another factor of 2





CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 12

Gamma compression

 Represent numbers as unary n followed by

m-bit binary

 n to m translation table can be tuned

 Adjust translation to minimize number of bits used

 Posting counts

 Represent “1” in 1 bit

 Small numbers have very few bits

 Document gaps

 Small numbers have small representations, but...

 Shallower curve: don’t weight as much towards

small numbers



CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 13

Gamma compression results

 Use single vector for simplicity

 Select for minimal sum of posting counts,

document gap sizes

 Vector of worked best

 Within 3% of minimum of each set compressed

separately

 Posting counts compressed far more than

document gaps

 960 MB of text -> 647 MB of index

 Postings lists = 485 MB

 Overhead (doc info, n-gram headers) = 150 MB



CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 14

Postings lists: memory vs. disk









50 J

 Construct indices for 

In-

memory

257 MB corpus

40 J On-disk









Running time (seconds)

J

 Run queries with

postings lists J



30

 In memory

J 

 On disk 

20

 On disk lists slower, as J 

expected, but…

10 

 Less than 2x slowdown

 Decompression not much

slower than disk I/O 0

0 1 2 3 4 5

 Seek time less critical Query size (KB)

than we thought



CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 15

N-gram library rewrite

 Build more efficient data structures

 Better dynamic storage

 Reduction in memory consumption

 Make on-disk storage work better

 More efficient

 Independent of underlying byte order

 Build to standard API

 Reusable component

 Fit with legacy apps







CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 16

Data structure design

 Main data structures

 TermTable

 Maintain per-term information



 Store term text as hashed 64-bit value



 PostingsList

 Keep compressed postings lists



 Dynamically allocate chunks as needed



 Other structures

 DocTable

 Corpus (includes other structures)

 Structures use templates extensively

CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 17

Data structures: connections

H(‘ellta’) H(‘lltal’) TermTable

nOccs nOccs

nDocs nDocs

… …

PostList PostList

Count0

chunk1 DocId0

Count1 DocTable

chunk2 DocId1

...





CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 18

Current status

 Basic data structures working

 PostingsList

 HashTable (for documents, terms)

 Structures need to be tied together

 Corpus data structure

 Term generation (parsing)









CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 19

Future work

 Currently rewriting IR system from scratch

 Better memory & posting list management

 Support for trying different term weighting

schemes & reduction mechanisms

 Support for excluding n-grams that won’t matter

 Explore tradeoff between disk and memory

 Try new weighting algorithms with n-grams

 Parallelize the IR engine (-> Linux clusters)

 Gauge IR performance for n-grams on large

corpora



CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 20

Conclusions

 Demonstrated an n-gram based IR system

indexing a gigabyte on a commodity PC

 Used compression & disk storage for scaling

 Preserved properties of n-gram based retrieval

 Found source of performance improvement in

scalable IR systems

 Compression more helpful than memory residence

 Disk access isn’t so bad if the file system is fast









CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs 21



Related docs
Other docs by huanghengdong
2012_Vendor_Form_Wedding_Expo
Views: 0  |  Downloads: 0
SCOPE 1 GP letter v2.0 12Mar2007
Views: 0  |  Downloads: 0
Boston_immigration_records
Views: 2  |  Downloads: 0
PSC MATRIX of achievement 080709
Views: 0  |  Downloads: 0
Summary - CIRCA
Views: 0  |  Downloads: 0
ieee_wiley_ebooks_library_customer_title_list
Views: 0  |  Downloads: 0
2009-2010_ACC0044_fishers_772_07-dec-2009
Views: 1  |  Downloads: 0
FSP20111216-EN
Views: 0  |  Downloads: 0
Workshops
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!