# Lucene

Document Sample

```					Lucene
Brian Nisonger Feb 08,2006

What is it?
Doug Cutting’s grandmother’s middle name  A open source set of Java Classses

 Search

Engine/Document Classifier/Indexer by Doug Cutting 1996

 http://lucene.sourceforge.net/talks/pisa/

 Developed
 Wrote

 Xerox/Apple/Excite/Nutch

several papers in IR

What is it-Nuts and Bolts


Modules for IR
 Analysis
 Tokenization  Where  Where

tokens are indexed

 Document

the Document ID is created  Date of Document is extracted  Title of document is extracted

Nuts and Bolts -II


Modules-Con’t
 Index
 Provides

 Query

Parser
the magic of query happens across indexes

 Where

 Search
 Searches

Nuts and Bolts-III


Modules-Con’t
 Search
 

Spans

 Spans

K+/- words Example:  Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking

 Store/Util
 Store

the indexes and other housekeeping

Theory


Space Optimization for Total Ranking
et al 1996  RAIO (Computer Assisted IR) 1997  http://lucene.sf.net/papers/riao97.ps
 Cutting



Lucene lecture at Pisa
Cutting  Slides from Lecture at University of Pisa 2004
 See

 Doug

Vector


Vectors are a mathematical distance between terms
Uses a cosine distance to determine how close terms/documents are  This distance can then be used for WSD/Clustering/IR  Example:

  

Bass,fishing: .6506 Bass,guitar: .000423 This tells us the document is about fishing not about guitars

Vectors-IR


 

“Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.” http://www.perl.com/pub/a/2003/02/19/engine.html Intro to Comp Ling and its applications to IR


Nisonger 2005 :P

Inverted Index


Term/Doc Id/Weight
 Term

Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stopword elimination, stemming, filtering, term normalization, or language translation -- has been applied.”  http://www.javaworld.com/javaworld/jw-092000/jw-0915-lucene-p2.html

 “A

Inverted Index –Con’t
 

Doc Id
A

unique “key” that identifies each document

Weight
 Binary

 Freq

Count  Weighting Algorithm

Index Merge


keeps track of the differences between words  Periodically merges indexes
 Allows

 Only

new documents to be added easily

Query


Boolean Search
 Only

searches documents with at least 1 term in query  “Boolean Search Engine”


Parallel Search
 Each

term in query is search in parallel  Partial scores added to queue of docs

Query-II


Threshold
 If

partial score is too low and will not be part of N-best then the document is ignored even before search is complete
 Example
  

Potential New Doc [0,0,0,0,0,0,i] Document ranked 14 [233,202,109,100,i] Potential New Doc is ignored

 Small

loss of recall greatly increases speed of search

Evaluation of Lucene


Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering
 Tellex

et al, MIT AI Lab 2003



 Question
 <Who

is the president?> <George W. Bush .76>

Evaluation-II


Prise
A

IR system developed by NIS that according to the paper uses “modern” search engine techniques
Prise was better than Lucene since “Boolean” query engines are considered old school and its answers to questions were better



Findings
 Found

Eval-III


Lucene
 Found

although Prise had better correct answers Lucene found more documents containing relevant information

Eval-Conclusion


External Knowledge Sources for Question Answering http://people.csail.mit.edu/gremio/publica tions/TREC2005.ps.
 Katz



et al, MIT Lab 2005



MIT used Lucene in their 2005 TREC submission not Prise

Users


Lucene is used widely
 TREC

Retrieval Enterprise Systems  Part of Database/Web engine  Part of Nutch  Used by academics for large projects
AI Lab  Know-It-All Project (UW)
 MIT,

 Document

Conclusions


Lucene is a good set of classes
to allow customization without have to “reinvent the wheel”  Robust  Fast  Large development groups  Used Widely in Academia and Industry
 Designed

Questions?
