ppt - Minerva Infinity A Scalable Efficient Peer-to-Peer Search .ppt

W
Shared by: sushaifj
Categories
Tags
-
Stats
views:
7
posted:
9/19/2011
language:
English
pages:
32
Document Sample
scope of work template
							               MINERVA Infinity:
        A Scalable Efficient Peer-to-Peer
                Search Engine

Sebastian Michel                     Peter Triantafillou         Gerhard Weikum
Max-Planck-Institut für Informatik       University of Patras   Max-Planck-Institut für Informatik
    Saarbrücken, Germany                     Rio, Greece            Saarbrücken, Germany
   smichel@mpi-inf.mpg.de               peter@ceid.upatras.gr      weikum@mpi-inf.mpg.de




Middleware 2005
Grenoble, France
Vision

 • Today: Web Search is dominated
   by centralized engines (“to google”)
              - censorship?
              - single point of attack/abuse
              - coverage of the web?

 • Ultimate goal: “Distributed Google” to
   break information monopolies

 • P2P approach best suitable
       – large number of peers
       – exploit mostly idle resources
       – intellectual input of user community

MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           2
Challenges

• large scale networks
    – 100,000 to 10,000,000 users
• large collections
    > 10^10 documents
    – 1,000,000 terms
• high dynamics




MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           3
Questions
• Network Organization
    – structured?
    – hierarchical?
    – unstructured?
• Data Placement
    – move data around?
    – data remains at the owner?
• Scalability?
• Query Routing/Execution
    – Routing indexes?
    – Message flooding?

MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           4
Overview
•   Motivation (Vision/Challenges/Questions)
•   Introduction to IR and P2P Systems
•   P2P- IR
•   Minerva Infinity
•   Network Organization
•   Data Placement
•   Query Processing
•   Data Replication
•   Experiments
•   Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           5
Information Retrieval Basics



                                                           5x

                                                           7x

                                                           4x




                                                               # of terms
    Document                                     Terms
                                                           (term frequency)



MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                              6
Information Retrieval Basics (2)

  Top-k Query Processing: find k documents with
  the highest total score

  Query Execution: Usually using                                                    B+ tree on terms
  some kind of threshold algorithm*:
    - sequential scans over
       the index lists (round-robin)
    - (random accesses to fetch
       missing scores)
                                                             d53: 0.8    d51: 0.6      d28: 0.7
    - aggregate scores                                       d55: 0.6                  d11: 0.6
                                                                         d12: 0.5
    - stop when the threshold is                             d44: 0.4    d14: 0.4      d17: 0.1
      reached                                                d17: 0.3




                                                                                       ...
                                                                         d52: 0.3
                                                             d52: 0.1    d44: 0.2
                                                             ...

e.g. Fagin‟s algorithm                                                   d28: 0.1

                                                                        ...
TA or a variant without random accesses                                                 index lists with
                                                                                        (DocId: tf*idf)
                                                                                        sorted by Score

  MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                                           7
P2P Systems

• Peer:
       – “one that is of equal standing with another”
         (source: Merriam-Webster Online Dictionary )


• Benefits:
       – no single point of failure
       – resource/data sharing                             • Applications:
                                                              –   File Sharing
• Problems/Challenges:                                        –   IP Telephony
       – authority/trust/incentives                           –   Web Search
                                                              –   Digital Libraries
       – high dynamics
       – …


MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                      8
Structured P2P Systems based on
Distributed Hash Tables (DHTs)

• “structured” P2P networks
• provide one simple method:

                            lookup:key->peer


•    CAN [SIGCOMM 2001]                                    robustness to
                                                           load skew,
•    CHORD [SIGCOMM 2001]                                  failures,
•    Pastry [Middleware 2001]                              dynamics

•    P-Grid [CoopIS 2001]


MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                           9
 Chord

• Peers and keys are mapped to
  the same cyclic ID space using a p56                                           p1
  hash function                  k54                                                   p8
                                                                                             k10
                                                             p51
• Key k (e.g., hash(file name))
  is assigned to the node with p48                                                           p14
  key p (e.g., hash(IP address))
  such that k  p and there is                                                               p21
  no node p„ with k  p„ and p„<p p42
                                                                                            k24
                                                                   p38
                                                                     k38   p32        k30



  MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                                   10
    Chord (2)

•    Using finger tables to speed                                        Lookup(54)
     up lookup process                                      k54
                                                                               p1
•    Store pointers to few distant                                p56
     peers
•    Lookup in                                                                         p8
     O(log n) steps                                p51


                                                                        Chord Ring          p14


                                                      p42

                                                               p38             p32    p21


    MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                                  11
Overview
•   Motivation (Vision/Challenges/Questions)
•   Introduction to IR and P2P Systems
•   P2P- IR
•   Minerva Infinity
•   Network Organization
•   Data Placement
•   Query Processing
•   Data Replication
•   Experiments
•   Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           12
P2P - IR

• Share documents (e.g. Web pages) in an
  efficient and scalable way
• Ranked retrieval
      – simple DHT is insufficient




MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           13
 Possible Approaches

  • Each peer is responsible for storing the
    COMPLETE index list for a subset of terms.
         p56                    p1
                                          p8                Query Routing: DHT lookups
                                                            Query Execution: Distributed Top-k
p51                                                                          [TPUT ‟04, KLEE „05]

p48                                              p14


                                                 p21
  p42                                                           capacity overload of peers with
    p38                                                         highly frequent / popular terms
                        p32                                     (data load AND query load)

 MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                                    14
Possible Approaches (2)

 • Each peer has its own local index
   (e.g., created by web crawls)

                                                  P2

                                         P1                  P3

                                          Distributed Directory
                                          Term  List of Peers

Query Routing:              P6                               P4

       1. DHT lookups              P5
       2. Retrieve Metadata
       3. Find most promising peers                               capacity overload of peers with
Query Execution:                                                  - highly frequent terms
       - Send the complete Query                                  - high-quality collections
         and merge the incoming results

MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                                    15
Overview
•   Motivation (Vision/Challenges/Questions)
•   Introduction to IR and P2P Systems
•   P2P- IR
•   Minerva Infinity
•   Network Organization
•   Data Placement
•   Query Processing
•   Data Replication
•   Experiments
•   Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           16
Minerva Infinity

• Idea:
       – assign (term, docId, score)
         triplets to the peers
             • order preserving
             • load balancing
       – hash(score)+
         hash(term) as offset
       – guarantee 100% recall




MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           17
Hash Function

• Requirements:
       – Load balancing (to avoid overloading peers)
       – Order preserving (to make the QP work)

• One without the other is trivial ...
       – Load balancing: apply a pseudo random hash function
       – Order preserving:
                                       S-Smin
                                       -----------------   * N
                                       Smax - Smin


• Both together is challenging …
MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                 18
Hash Function (2)

•       Assume an exponential score distribution
•       Place the first half of the data to the first peer
•       The next quarter to the next peer
•       and so on …
    1




                                              …
    0

MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                             19
Term Index Networks (TINs)

• Reduce # of hops during QP by reducing the
  number of peers that maintain the index list for a
  particular term
      Only a small subset of peers is used to store an
     index list.          62                            2
                                                            2              45
                                         45                                         B
                                                                 7         24
                                    41
        41                 7                     Global                62
                                                Network          12
                 A                  37
                                                                                        12

                                                                15
                                                                                C
                         16
                                         24                           24
                                                20         16
MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                             20
How to Create/Find a TIN

• Use u Beacon-Peers to bootstrap
  the TIN for term T
                                        p = 1/u
                                        For i=0 to i<n‘ do
                                               id = hash(t, i*p)
                                               if (i>0) use hash(t,(i-1)*p)
                                                        as a gateway to the TIN
                                               else node with id creates the TIN
          Global                        End for
          Network
                                                                T

                                                       Beacon nodes act as gateways to the TIN

MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                                 21
Publish Data / Join a TIN

• Peer with id = hash(t, score) not in the TIN for
  term t
• Randomly select a beacon node
       (Beacon nodes act as gateways to the TIN)

• Call the join method
• Store the item (docId, t, score)



MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           22
Query Processing
                                 Data Peers                          Coordinator




                  1                   1




                                                                2-keyword Query


                                       Alternative: Collect data and send in one batch.

MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                          23
QP with Moving Coordinator
                                 Data Peers                 Coordinator



                  1                   1                            1




                                          3-keyword Query


MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                          24
Data Replication

• Vertical: Replicate data inside a TIN via a „reverse‟
  communication.
                                                 1                            123
                                                 2                            123
                                                 3                            123

• Horizontal: Replicate complete TINs
                    64
                                                          41        7                   62
                              11
                                                                                                      12
                          C        45                 2        A        64        8

                                             B                                B              C
                24                 24                              16   20             28
                                                     55
    31          1                       11
                         57                                    7         49        5                           1
                                                                                                 50
          A                   B                           C                                                B
                         46                                                   A                  34
               16                                22
                                                                                  19


MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                                                   25
Experiments
Test bed:
  10,000 peers

Benchmarks:
• GOV: TREC .GOV collection + 50 TREC-2003 Web
    queries, e.g. juvenile delinquency
• XGOV: TREC .GOV collection + 50 manually expanded
    queries, e.g. juvenile delinquency youth minor crime law
    jurisdiction offense prevention
• SCALABILITY: One query executed multiple times
    ……….


MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                               26
Experiments: Metrics

Metrics
• Network traffic (in KB)
• Query response time (in s)
          - network cost (150ms RTT,
                    800Kb/s data transfer rate)
          - local I/O cost (8ms rotation latency
                    + 8MB/s transfer delay)
          - processing cost

• Number of Hops

MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           27
Scalability Experiment

• Measure time for a different
  query loads.
       – identical queries
                                                                           10000000
       – inserted into a queue                                                            Minerva Infinity
                                                                            1000000
                                                                                          no parallel


                                                    Total Execution Time
                                                                             100000       processing


                                                         in Seconds
                                                                              10000


                                                                               1000


                                                                                100
                                                                                      1          10           100         1000   10000
                                                                                                      Query Load: Queue Size




MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                                                                         28
Experiments: Results


                                         GOV                                                               GOV

                          1200                                                           60000.00




                                                                 Total Bandwidth in KB
  Total Time in Seconds




                          1000                                                           50000.00

                           800                                                           40000.00

                           600                                                           30000.00

                           400                                                           20000.00

                           200                                                           10000.00

                             0                                                               0.00
                                 2             3             4                                      2             3             4

                                     Number of Query Terms                                              Number of Query Terms




MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                                                                                                    29
Conclusion

  • Novel architecture for P2P web search.
  • High level of distribution both in data and
    processing.
  • Novel algorithms to create the networks, place
    data, and execute queries.
  • Support of two different data replication
    strategies.


MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           30
Future Work

  • Support of different score distributions

  • Adapt TIN sizes to the actual load

  • Different top-k query processing algorithms




MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           31
                   Thank you for your attention




MINERVA Infinity: A Scalable Efficient P2P Search Engine
                                                           32

						
Related docs
Other docs by sushaifj
4、下载对应设备的iOS5 固件.doc
Views: 19  |  Downloads: 0
Draft Final Report.pdf
Views: 26  |  Downloads: 0
Top Songs of 2011 - Acclaimed Music.xls
Views: 47  |  Downloads: 0
112th Congress Bill Tracking.xls
Views: 31  |  Downloads: 0
541
Views: 33  |  Downloads: 0