ppt - Minerva Infinity A Scalable Efficient Peer-to-Peer Search .ppt
Document Sample


MINERVA Infinity:
A Scalable Efficient Peer-to-Peer
Search Engine
Sebastian Michel Peter Triantafillou Gerhard Weikum
Max-Planck-Institut für Informatik University of Patras Max-Planck-Institut für Informatik
Saarbrücken, Germany Rio, Greece Saarbrücken, Germany
smichel@mpi-inf.mpg.de peter@ceid.upatras.gr weikum@mpi-inf.mpg.de
Middleware 2005
Grenoble, France
Vision
• Today: Web Search is dominated
by centralized engines (“to google”)
- censorship?
- single point of attack/abuse
- coverage of the web?
• Ultimate goal: “Distributed Google” to
break information monopolies
• P2P approach best suitable
– large number of peers
– exploit mostly idle resources
– intellectual input of user community
MINERVA Infinity: A Scalable Efficient P2P Search Engine
2
Challenges
• large scale networks
– 100,000 to 10,000,000 users
• large collections
> 10^10 documents
– 1,000,000 terms
• high dynamics
MINERVA Infinity: A Scalable Efficient P2P Search Engine
3
Questions
• Network Organization
– structured?
– hierarchical?
– unstructured?
• Data Placement
– move data around?
– data remains at the owner?
• Scalability?
• Query Routing/Execution
– Routing indexes?
– Message flooding?
MINERVA Infinity: A Scalable Efficient P2P Search Engine
4
Overview
• Motivation (Vision/Challenges/Questions)
• Introduction to IR and P2P Systems
• P2P- IR
• Minerva Infinity
• Network Organization
• Data Placement
• Query Processing
• Data Replication
• Experiments
• Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine
5
Information Retrieval Basics
5x
7x
4x
# of terms
Document Terms
(term frequency)
MINERVA Infinity: A Scalable Efficient P2P Search Engine
6
Information Retrieval Basics (2)
Top-k Query Processing: find k documents with
the highest total score
Query Execution: Usually using B+ tree on terms
some kind of threshold algorithm*:
- sequential scans over
the index lists (round-robin)
- (random accesses to fetch
missing scores)
d53: 0.8 d51: 0.6 d28: 0.7
- aggregate scores d55: 0.6 d11: 0.6
d12: 0.5
- stop when the threshold is d44: 0.4 d14: 0.4 d17: 0.1
reached d17: 0.3
...
d52: 0.3
d52: 0.1 d44: 0.2
...
e.g. Fagin‟s algorithm d28: 0.1
...
TA or a variant without random accesses index lists with
(DocId: tf*idf)
sorted by Score
MINERVA Infinity: A Scalable Efficient P2P Search Engine
7
P2P Systems
• Peer:
– “one that is of equal standing with another”
(source: Merriam-Webster Online Dictionary )
• Benefits:
– no single point of failure
– resource/data sharing • Applications:
– File Sharing
• Problems/Challenges: – IP Telephony
– authority/trust/incentives – Web Search
– Digital Libraries
– high dynamics
– …
MINERVA Infinity: A Scalable Efficient P2P Search Engine
8
Structured P2P Systems based on
Distributed Hash Tables (DHTs)
• “structured” P2P networks
• provide one simple method:
lookup:key->peer
• CAN [SIGCOMM 2001] robustness to
load skew,
• CHORD [SIGCOMM 2001] failures,
• Pastry [Middleware 2001] dynamics
• P-Grid [CoopIS 2001]
MINERVA Infinity: A Scalable Efficient P2P Search Engine
9
Chord
• Peers and keys are mapped to
the same cyclic ID space using a p56 p1
hash function k54 p8
k10
p51
• Key k (e.g., hash(file name))
is assigned to the node with p48 p14
key p (e.g., hash(IP address))
such that k p and there is p21
no node p„ with k p„ and p„<p p42
k24
p38
k38 p32 k30
MINERVA Infinity: A Scalable Efficient P2P Search Engine
10
Chord (2)
• Using finger tables to speed Lookup(54)
up lookup process k54
p1
• Store pointers to few distant p56
peers
• Lookup in p8
O(log n) steps p51
Chord Ring p14
p42
p38 p32 p21
MINERVA Infinity: A Scalable Efficient P2P Search Engine
11
Overview
• Motivation (Vision/Challenges/Questions)
• Introduction to IR and P2P Systems
• P2P- IR
• Minerva Infinity
• Network Organization
• Data Placement
• Query Processing
• Data Replication
• Experiments
• Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine
12
P2P - IR
• Share documents (e.g. Web pages) in an
efficient and scalable way
• Ranked retrieval
– simple DHT is insufficient
MINERVA Infinity: A Scalable Efficient P2P Search Engine
13
Possible Approaches
• Each peer is responsible for storing the
COMPLETE index list for a subset of terms.
p56 p1
p8 Query Routing: DHT lookups
Query Execution: Distributed Top-k
p51 [TPUT ‟04, KLEE „05]
p48 p14
p21
p42 capacity overload of peers with
p38 highly frequent / popular terms
p32 (data load AND query load)
MINERVA Infinity: A Scalable Efficient P2P Search Engine
14
Possible Approaches (2)
• Each peer has its own local index
(e.g., created by web crawls)
P2
P1 P3
Distributed Directory
Term List of Peers
Query Routing: P6 P4
1. DHT lookups P5
2. Retrieve Metadata
3. Find most promising peers capacity overload of peers with
Query Execution: - highly frequent terms
- Send the complete Query - high-quality collections
and merge the incoming results
MINERVA Infinity: A Scalable Efficient P2P Search Engine
15
Overview
• Motivation (Vision/Challenges/Questions)
• Introduction to IR and P2P Systems
• P2P- IR
• Minerva Infinity
• Network Organization
• Data Placement
• Query Processing
• Data Replication
• Experiments
• Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine
16
Minerva Infinity
• Idea:
– assign (term, docId, score)
triplets to the peers
• order preserving
• load balancing
– hash(score)+
hash(term) as offset
– guarantee 100% recall
MINERVA Infinity: A Scalable Efficient P2P Search Engine
17
Hash Function
• Requirements:
– Load balancing (to avoid overloading peers)
– Order preserving (to make the QP work)
• One without the other is trivial ...
– Load balancing: apply a pseudo random hash function
– Order preserving:
S-Smin
----------------- * N
Smax - Smin
• Both together is challenging …
MINERVA Infinity: A Scalable Efficient P2P Search Engine
18
Hash Function (2)
• Assume an exponential score distribution
• Place the first half of the data to the first peer
• The next quarter to the next peer
• and so on …
1
…
0
MINERVA Infinity: A Scalable Efficient P2P Search Engine
19
Term Index Networks (TINs)
• Reduce # of hops during QP by reducing the
number of peers that maintain the index list for a
particular term
Only a small subset of peers is used to store an
index list. 62 2
2 45
45 B
7 24
41
41 7 Global 62
Network 12
A 37
12
15
C
16
24 24
20 16
MINERVA Infinity: A Scalable Efficient P2P Search Engine
20
How to Create/Find a TIN
• Use u Beacon-Peers to bootstrap
the TIN for term T
p = 1/u
For i=0 to i<n‘ do
id = hash(t, i*p)
if (i>0) use hash(t,(i-1)*p)
as a gateway to the TIN
else node with id creates the TIN
Global End for
Network
T
Beacon nodes act as gateways to the TIN
MINERVA Infinity: A Scalable Efficient P2P Search Engine
21
Publish Data / Join a TIN
• Peer with id = hash(t, score) not in the TIN for
term t
• Randomly select a beacon node
(Beacon nodes act as gateways to the TIN)
• Call the join method
• Store the item (docId, t, score)
MINERVA Infinity: A Scalable Efficient P2P Search Engine
22
Query Processing
Data Peers Coordinator
1 1
2-keyword Query
Alternative: Collect data and send in one batch.
MINERVA Infinity: A Scalable Efficient P2P Search Engine
23
QP with Moving Coordinator
Data Peers Coordinator
1 1 1
3-keyword Query
MINERVA Infinity: A Scalable Efficient P2P Search Engine
24
Data Replication
• Vertical: Replicate data inside a TIN via a „reverse‟
communication.
1 123
2 123
3 123
• Horizontal: Replicate complete TINs
64
41 7 62
11
12
C 45 2 A 64 8
B B C
24 24 16 20 28
55
31 1 11
57 7 49 5 1
50
A B C B
46 A 34
16 22
19
MINERVA Infinity: A Scalable Efficient P2P Search Engine
25
Experiments
Test bed:
10,000 peers
Benchmarks:
• GOV: TREC .GOV collection + 50 TREC-2003 Web
queries, e.g. juvenile delinquency
• XGOV: TREC .GOV collection + 50 manually expanded
queries, e.g. juvenile delinquency youth minor crime law
jurisdiction offense prevention
• SCALABILITY: One query executed multiple times
……….
MINERVA Infinity: A Scalable Efficient P2P Search Engine
26
Experiments: Metrics
Metrics
• Network traffic (in KB)
• Query response time (in s)
- network cost (150ms RTT,
800Kb/s data transfer rate)
- local I/O cost (8ms rotation latency
+ 8MB/s transfer delay)
- processing cost
• Number of Hops
MINERVA Infinity: A Scalable Efficient P2P Search Engine
27
Scalability Experiment
• Measure time for a different
query loads.
– identical queries
10000000
– inserted into a queue Minerva Infinity
1000000
no parallel
Total Execution Time
100000 processing
in Seconds
10000
1000
100
1 10 100 1000 10000
Query Load: Queue Size
MINERVA Infinity: A Scalable Efficient P2P Search Engine
28
Experiments: Results
GOV GOV
1200 60000.00
Total Bandwidth in KB
Total Time in Seconds
1000 50000.00
800 40000.00
600 30000.00
400 20000.00
200 10000.00
0 0.00
2 3 4 2 3 4
Number of Query Terms Number of Query Terms
MINERVA Infinity: A Scalable Efficient P2P Search Engine
29
Conclusion
• Novel architecture for P2P web search.
• High level of distribution both in data and
processing.
• Novel algorithms to create the networks, place
data, and execute queries.
• Support of two different data replication
strategies.
MINERVA Infinity: A Scalable Efficient P2P Search Engine
30
Future Work
• Support of different score distributions
• Adapt TIN sizes to the actual load
• Different top-k query processing algorithms
MINERVA Infinity: A Scalable Efficient P2P Search Engine
31
Thank you for your attention
MINERVA Infinity: A Scalable Efficient P2P Search Engine
32
Get documents about "