Indexing
Document Sample


Indexing
Accessing Data During Query Evaluation
Scan the entire collection
Typical in early batch retrieval systems
Still used today, in hardware form (eg. Fast
Data Finder)
Computational and I/O coast are O
(character in collection)
Practical only for small collections
2
Accessing Data During Query Evaluation
Use indexes for direct access
Evaluation time O (query term occurrences
in collection)
Practical for large collections
Many opportunities for optimization
3
What should the Index contain?
Database systems index primary and secondary
keys
Index provides fast access to a subset of database
record
Scan subset to find solution set
IR Problem: Cannot predict keys that people will
use in queries
Every word in a document is a potential search term
Solution: Index by all keys (word)
4
Some vocabulary about Indexing
File organizations or indexes are used to
increase performance of system
Text indexing is the process of deciding what
will be used to represent a given document
Index terms are used to build indexes for the
documents
The retrieval model described how the indexed
terms are incorporated in to a model
Relationship between retrieval model and indexing
model
5
Accessing the Index
Index accessed through features or keys or
terms
Keys/terms can be atomic or complex
Most common atomic keys/terms:
Words in text, punctuation
Manually assigned terms (controlled and
uncontrolled vocabulary)
Document structure: sentence and paragraph
boundaries
Inter or intra document links (e.g. citations)
6
Accessing the Index
Composed features
Sequences: phrases, names, dates,
monetary amounts
Sets : synonym classes
7
Manual vs. Automatic Indexing
Manual or human indexing:
Index decide which keywords to assign to
document based on controlled vocabulary
e.g. MEDLINE, Yahoo
Significant cost
Automatic indexing:
Indexing program decides which words, phrases or
other features to use from test of document
Indexing speeds range widely
8
Manual vs. Automatic Indexing
Manual Automatic
Current Text
Controlled indexing categorization
Vocabulary practice “Intelligent” IR
Current Text search
indexing engines
Free text
practice “Statistical” IR
9
Manual vs. Automatic Indexing
Experimental evidence is that retrieval
effectiveness using automatic indexing can be
at least as effective as manual indexing with
controlled vocabularies
Experiments have also shown that using both
manual and automatic indexing improves
performance
10
Some vocabulary words
Index language
Language used to describe documents and queries
Exhaustivity
Number of different topics indexed, completeness
Specificity
Level of accuracy of indexing
Pre-coordinate indexing
Combinations of index terms uses as indexing label
E.g., author lists key phrases of paper
Post-coordinate indexing
Combinations generated at search time
Most common and the focus of this course
11
Indexing Choices
What is a word?
Embedded punctuation (e.g. MD-11, hard-core)
Case folding (e.g., New vs new, Apple vs apple)
Stopwords (e.g., the, an, a, on)
Morphology (e.g., computer, compute, computing)
Index granularity has a large impact on speed
and effectiveness
Index term?
Index surface forms?
Both ?
12
Basic automatic Indexing
Parse documents to recognize structure
E.g., title, date, other fields
Scan for word tokens
Numbers ,special characters, hyphenation,
Capitalization, etc
Stopword removal
Stem words
Weight words
Want more important words to have higher weight
Optional
Phrase indexing
Thesaurus classes
13
Words vs. Terms vs. Concepts
Simple indexing is based on words or word
stems
More complex indexing could include phrases or
thesaurus classes
Concept-base retrieval often used to imply
something beyond word indexing
Words, phrases, synonyms, linguistics can all
be evidence used to infer present of the
concept
E.g., the concept “information retrieval” can be
inferred based on the presence of the words
“information”, “retrieval”, the phrase “information
retrieval” and may be the phrase “text retrieval”
14
Phrases
Both statistical and syntactic methods have
been used to identify good phrases
Proven techniques include finding all word
pairs that occur more than n times in the
corpus or using a part of speech tagger to
identify simple noun phrases
1,100,000 phrases extracted from all TREC data
Phrases can have an impact on both
effectiveness and efficiency
Phrase indexing will speed up phrase queries
Finding documents containing “White House” better
than finding documents containing both words
15
Information Extraction
Special recognizers for specific concepts
People, organization, places, dates,
amounts, product
Meta terms such as #COMPANY,
#PERSON can be added to indexing
16
Indexing Example
17
Implementations
Common implementations of indexes
Bitmaps
Signature files
Inverted files
Hashing
N-grams
18
N-grams สามารถหาความรู้เพิ่มเติมได้จาก
http://catadmin.cattelecom.com/km/blog/kittichonm/category/search-engine/n-gram/
โปรแกรมสร้าง N-gram ระดับตัวอักษรสาหรับภาษาไทย
่
ไฟล์ที่เอามาลองสร้าง N-gram นั้นเป็นไฟล์ขาวภาษาไทย มีข่าวอยู่หลาย 1,000 ข่าว มีจานวนตัวอักษรทั้งหมด
้
28,694,548 ตัว (77 MB) ตัวอักษรพวกนี้รวมทั้งเครื่องหมาย, เลข,และตัวอักษรอื่นๆทีเ่ กิดขึนในข่าว
่
หลังจากโปรแกรมรันเสร็จ นี่คือผลของ 10 อันดับแรกทีเกิดขึ้นบ่อยทีสุด ่
า_1901143
น_1553522
_1493261
ร_1445651
่_1214212
ก_1182815
อ_1089453
เ_1006035
ง_984559
ม_927818
ข้อสังเกตเล่นๆ:
้
- สระอา เกิดขึนบ่อยที่สุด ด้วยความถี่ 1901143 ครั้ง
- วรรค เกิดบ่อยเป็นอันดับ 3 ตอนแรกคิดว่าจะเกิดขึ้นน้อยในภาษาไทย
- bigram ที่เกิดขึ้นบ่อยสุด คือ - าร (สระอาตามด้วย รอเรือ) ด้วยความถี่ 311818 ครั้ง
19
Indexes: Inverted Lists
Inverted lists are today the most common
indexing technique
Sources file: collection, organized by document
Inverted file: collection organized by term
One record per term, listing locations where term
occurs
20
Inverted Lists
During query evaluation, traverse lists for each
query term
OR: the union of component list
AND: an intersection of component list
Proximity: an intersection of component list
SUM: the union of component lists : each entry has
a score
21
Inverted Files
Example test: each line is a document
22
Inverted Files
23
Word-Level Inverted File
24
Index Construction Methods
Memory-based inversion
Sort-based inversion
All above, combined with compression
FAST-INV
Based on text partitioning
25
Index Construction: Overview
Total text size 5 GB, with 5 million documents,
40 MB main memory
26
Expanding the Index
Simplest way to handle documents insertion
for the inverted file index
Accumulate updates in a stop-press file
For each query issued the stop-press file is
checked
When the stop=press grows too large, re-index the
entire collection
Major disadvantage: to keep performance up to
scratch, stop-press files must be kept small, so re-
indexing need to be done often, while it takes
longer with ever growing data set
27
Indexes: Signature Files
Bag of words only
For each term, allocate fixed size s-bit vector
(signature)
Define hash function:
Each term has an s-bit signature
May not be unique
OR the term signatures to form document
signature
Long documents are a problem
Usually segment them into smaller pieces
28
Encoding and Compression
Encoding transforms data from one representation to
another
Compression is an encoding that takes less space
Lossless: decoder can reproduce message exactly
Lossy: can reproduce message approximately
Degree of compression:
(Original – Encoded)/Encoded
Example: (125MB-25MB)/25 MB = 400%
29
Compression
Advantage of Compression
Save space in memory (e.g., compressed
cache
Save space when storing (e.g., disk, CD-
ROM)
Save time when accessing (e.g., I/O)
Save time when communicating (e.g., over
network)
30
Compression
Disadvantages of Compression
Costs time and computation to compress
and uncompress
Complicates or prevents random access
May involve loss of information (e.g., JPEG,
MP3)
Makes data corruption much more costly.
Small errors may make all of the data
inaccessible.
31
Get documents about "