Docstoc

Lucene and Solr

Document Sample
Lucene and Solr Powered By Docstoc
					Lucene and Solr
Lucene
    ◦ Doug Cutting
         Created in 1999
         Donated to Apache in 2001
   Features
    ◦   Highly scalable
    ◦   Java (1.4)
    ◦   Ports to many other languages
    ◦   No crawler
    ◦   No document parsing
    ◦   No “PageRank”
Lucene
 ◦ Powered by Lucene
     IBM Omnifind Y! Edition
     Technorati
     Wikipedia
     Internet Archive
     LinkedIn
     monster.com
Indexing
   Logical structure
    ◦ Index is collection of documents
    ◦ Documents are a collection of fields
    ◦ Fields are the content
       Stored – Stored verbatim for retrival with results
       Indexed – Tokenized and made searchable
    ◦ Indexed terms stored in inverted index
   Physical structure
    ◦ Multiple documents (with all fields) stored in
      segments
       mergeFactor
    ◦ All segments together make up the index
   IndexWriter is interface object for entire index
           Indexing
aardvark                                       0
                      Little Red Riding Hood
hood         0   1


little       0   2
                                               1
                           Robin Hood
red          0
riding       0
robin        1
                                               2
                          Little Women

women        2
zoo
Indexing
   Analysis
    ◦ Extract tokens from text (tokenizer)
      Whitespace
      Hyphens
    ◦ Manipulate or modify tokens (token filter)
      Stemming
      Removal
    ◦ Tokenizer / Token Filter chains are called
      analyzers
Indexing
      LexCorp BFG-9000

        WhitespaceTokenizer

       LexCorp      BFG-9000

      WordDelimiterFilter catenateWords=1

       Lex       Corp    BFG    9000
              LexCorp

              LowercaseFilter

       lex       corp     bfg   9000
              lexcorp
Searching
   Query Creation
    ◦ Query parser
    ◦ Manual query construction from terms
    ◦ title:“Bell” author:“Hemmingway”^3.0

   Query terms are analyzed
    ◦ Same analyzer for indexing and searching on
      each field
           Searching
LexCorp BFG-9000                            Lex corp bfg9000

  WhitespaceTokenizer                         WhitespaceTokenizer

 LexCorp      BFG-9000                       Lex     corp      bfg9000

WordDelimiterFilter catenateWords=1         WordDelimiterFilter catenateWords=0

 Lex       Corp    BFG    9000              Lex      corp       bfg      9000
        LexCorp

        LowercaseFilter                            LowercaseFilter

 lex       corp     bfg   9000               lex     corp       bfg      9000
        lexcorp
                                 A Match!
Searching
   Many query types
      Term
      Phrase
          “bad wolf”
      Proximity
          “quick fox”~4
      Prefix
          pla?e                (plate or place or plane)
          practic*             (practice or practical or practically)
      Fuzzy (edit distance)
          planting~0.75        (granting or planning)
          roam~                (default is 0.5)
      Range
          date:[05072007 TO 05232007]    (inclusive)
          author: {king TO mason}        (exclusive)
Searching
   Multiple searchers at once
    ◦ Thread safe
   Additions or deletions to index are not
    reflected in already open searchers
    ◦ Must be closed and reopened
   Use commit or optimize on indexWriter
Lucene Sub-projects
   Nutch
    ◦ Web crawler with document parsing
   Hadoop
    ◦ Distributed data processor
    ◦ Implements MapReduce
   Solr
Solr
    ◦ Yonik Seeley
         Developed at CNET
         Donated to Apache in 2006
   Features
    ◦   Servlet
    ◦   Web Administration Interface
    ◦   XML/HTTP, JSON Interfaces
    ◦   Faceting
    ◦   Schema to define types and fields
    ◦   Highlighting
    ◦   Caching
    ◦   Index Replication (Master / Slaves)
    ◦   Pluggable
    ◦   Java 5
Solr
 ◦ Powered by Solr
     Netflix
     CNET
     Smithsonian
     AOL:sports and music
     RightNow ??
     Drupal module
     GameSpot
Configuration (solrconfig.xml)
<mainIndex>
   <useCompoundFile>false</useCompoundFile>
   <mergeFactor>10</mergeFactor>
   <maxBufferedDocs>1000</maxBufferedDocs>
   <maxMergeDocs>2147483647</maxMergeDocs>
   <maxFieldLength>10000</maxFieldLength>
</mainIndex>


<requestHandler name="standard" class="solr.StandardRequestHandler" />
<requestHandler name=“custom" class="your.package.CustomRequestHandler" />


<autoCommit>
   <maxDocs>10000</maxDocs>
   <maxTime>1000</maxTime>
</autoCommit>


<queryResponseWriter name="xml" class="org.apache.solr.request.XMLResponseWriter"
   default="true"/>
Schema (schema.xml)
Fields
<uniqueKey>id</uniqueKey>


<field name="products" type="text" indexed="true" stored=“true"/>
<field name="keywords" type="text_ws" indexed="true" stored=“true”/>
<field name="keywordsSorted" type="text_sorted" indexed="true" stored="false"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW"/>


<dynamicField name="*_i" type="integer" indexed="true" stored="true"/>
<dynamicField name="desc_*" type="string" indexed="true" stored="false"/>


<copyField source=“keywords" dest=“keywordsSorted"/>
Schema
Analyzers
<fieldtype name="nametext" class="solr.TextField">
   <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldtype>


<fieldtype name="text" class="solr.TextField">
   <analyzer>
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.StandardFilterFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.PorterStemFilterFactory"/>
   </analyzer>
</fieldtype>


<fieldtype name="myfieldtype" class="solr.TextField">
   <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SnowballPorterFilterFactory" language="German" />
   </analyzer>
</fieldtype>
Insertion
◦   HTTP POST to http://localhost:8983/solr/update/
<add>
    <doc>
             <field name="employeeId">05991</field>
             <field name="office">Bridgewater</field>
             <field name="skills">Perl</field>
             <field name="skills">Java</field>
    </doc>
    [<doc> ... </doc>[<doc> ... </doc>]]
</add>



Documents or fields can have boosts attached
Update / Delete
 Inserting a document with already present
  uniqueKey will erase the original
 Deleting
    ◦ By uniqueKey field
     <delete><id>05991</id></delete>

    ◦ By query
     <delete><query>name:Anthony</query></delete>

 <Commit/>
 <Optimize/>
Search
   Core parameters
      qt – query type (request handler)
      wt – writer type (response writer)
   Common parameters
        q
        sort
        start
        rows
        fq – filters
        fl – return fields
Search
   Faceting
    ◦ Available in StandardRequestHandler and
      DisMaxRequestHandler
Search
http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=-
   1&facet.field=cat&facet.mincount=1&facet.field=inStock

<response>
   <responseHeader>
          <status>0</status>
          <QTime>3</QTime>
   </responseHeader>
   <result numFound="4" start="0"/>
   <lst name="facet_counts">
   <lst name="facet_queries"/>
   <lst name="facet_fields">
          <lst name="cat">
                     <int name="music">1</int>
                     <int name="connector">2</int>
                     <int name="electronics">3</int>
          </lst>
          <lst name="inStock">
                     <int name="false">3</int>
                     <int name="true">1</int>
          </lst>
   </lst>
   </lst>
</response>
Many more features
   Replication
    ◦ Master / Slave architecture for load balancing
      and backups
 More-like-this
 Easy to add RequestHandlers and
  ResponseWriters
 Responses in many formats
 Hit highlighting
Sources
   http://lucene.apache.org/
   http://lucene.apache.org/solr/
   http://people.apache.org/~yonik/presentations/

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:12
posted:12/3/2011
language:English
pages:24