Docstoc

Sudarshan Gaikaiwari - Lucene @ Yelp

Document Sample
Sudarshan Gaikaiwari - Lucene @ Yelp Powered By Docstoc
					Lucene @ Yelp

Sudarshan Gaikaiwari
Bio

1. Over a decade of experience in information retrieval
2. Used IR techniques at Symantec's DLP group
3. Search Engineer at Yelp
Outline

1. Overview of search services at Yelp
2. Federation Motivation
3. Lucy Indexing
4. Lucy Searching
5. Efficiently Retrieving top k hits
The services we provide
Lucy: business search
Lucy also powers phone search
Cathy: she 'talks' a lot
Listsearch: it searches lists....
Reviewsearch: it searches reviews....
DYM: did you really mean that?
Suggest: auto completion
Federation Motivation
Problem




      Search is too slow
Hard Disk Seek Latency
                     Disk seek 10,000,000 ns




                         Source Software Engineering Advice from
                         Building Large-Scale Distributed Systems
                         Jeffery Dean
RAM read latency

                   Main memory
                   reference
                   100 ns
Pinning Index in RAM

● vmtouch
● mlock
● http://hoytech.com/vmtouch/
Problem

Index is too large fit in memory on a single machine
Geographical sharding
Geographical Sharding drawbacks

1. Cumbersome manual process to determine shard boundary
2. No guarantee that a boundary can be found.
Federation

1. Split index across multiple machines
2. Shard on business id
3. TF-IDF scores from different machines should be
   comparable
Mapping businesses to shards

 1. Assigning businesses to shards

shard = shardlist[hash(business_id) % len(shardlist)]

Problems
1. Involves re-indexing all the businesses if we want to add a
new shard
Virtual Nodes
Advantages

1. Flexibility (move vbuckets from one shard to another)
2. Split hot spot shards
Lucy Master Slave Architecture

Separate indexing (masters)
A master for each shard of a service

Searching (slaves)
A slave for every replica of a service
Lucy Indexing
Lucy Searching
Federator: Combining results across
shards
1. Once we distribute an index across shards we need a
   component which will search all these shards and combine
   their results.
2. Written in Python (runs inside a python web process).
3. Uses Tornado IO loop to send requests to all shards.
4. The transfer protocol for the requests in JSON RPC
Lucy Server
Tokens to Business Attributes
Executing queries

1. Gather the top results for a query
2. Collect attribute statitics for attributes like places, categories
Lucene

1. Efficiently executes queries over the index
2. Provides how relevant the business is to the words in the
   query (word score)
3. Upgrading lucene to 2.9/3.1 is WIP
Successive geobounds relaxation
Successive geobounds relaxation
Federation
Efficiently Retrieving top k hits

 1. When user moves through multiple pages the number of
    hits to be returned increases

num hits = start + count

2. So if we need to retrieve 500 hits the naive way would be to
retrieve 500 hits from each shard and then sort them
Distribution of hits in shards
Probability a hit is in a shard
Binomial Distribution
Probability (r of top k hits) are in a particular shard




Mean



Variance
Formula

Std Deviation




Formula
Simulation


    Formula   Hits selected from each   Results Missed (%)
              shard
              k = 100
              p = 0.2

                          24                      0.017




                          32                    0.0001407




                          44                    0.00000
Simulation Graph
Results

1. ~ 50% savings over 100 hits (44 hits requested from each
   shard)
2. 77% savings over 1000 hits (228 hits requested from each
   shard)
Future work

1. In memory index
2. Move towards real time search
Come Join Us!
Thank You




            smg@yelp.com

				
DOCUMENT INFO
Categories:
Stats:
views:30
posted:6/14/2011
language:English
pages:54