Docstoc

presentation _ppt_ - PowerPoint Presentation

Document Sample
presentation _ppt_ - PowerPoint Presentation Powered By Docstoc
					Associative Peer to Peer Networks:
         Harnessing Latent Semantics


      Edith Cohen     Amos Fiat   Haim Kaplan
   AT&T Labs-research   Tel-Aviv University




 May 15, 2002    Stanford Networking Seminar
  “Traditional” Client-server Web
 Web server




May 15, 2002   Stanford Networking Seminar
                 Peer-to-peer Networks
Distributed network for sharing content (music,
video, software, etc.), where each host acts as both
a server and a client
• Harness vast resources
• Scalability/Robustness to
  failures/shutdowns




  May 15, 2002        Stanford Networking Seminar
                    P2P Search
Overall performance of a P2P network highly
depends on the efficiency and versatility of
search

          What features are important ?
• Scope: ability to locate “rare” items
  “Find the 10th episode of Star Trek Voyager”
• Partial-match/complex queries:
  “Find an Indiana Jones movie”
   …Or “Indiana Joens” movie…..

 May 15, 2002       Stanford Networking Seminar
     (search in) Basic P2P Architectures
                                                    Partial-
                                                    Matches    Scope
Centralized (Napster): central index
  service.
Decentralized: peers are connected by
  low-degree overlay network.
• Blind Search (Gnutella, FastTrack –
  Morpheus/ Kazza,…): search via
  flooding, multi-head random walks...
• Distributed Hash Tables (Freenet,
  Can, Chord,…): induced topology
  routed search.
   May 15, 2002       Stanford Networking Seminar
          Associative P2P networks
  Retain Gnutella’s desirable properties:
  • Distributed overlay network
  • Peers store only what they need (“common good”
    at par with “own welfare”)
  • No tight control of topology/content
  • Support partial-match queries
  AND
  • Have search scope (orders of magnitude
    improvement over Gnutella)

       • Make implicit use of latent semantics
               – Provably good on a reasonable model
               – Very good on simulations

May 15, 2002              Stanford Networking Seminar
               P2P search framework
   • Search queries are propagated on the
     overlay (from peer to a neighbor peer).
   • When a peer receives a query, it checks if
     it can satisfy it; decreases hop count; and
     forwards it to a subset of its neighbors.
   • Each search includes query and a
     “propagation rule”, which determines which
     neighbors the search is propagated to.

“DHTs”      propagation rule= hash of query
“Gnutella” propagation rule independent of query
Associative propagation rules are predicates (guide rules)
May 15, 2002        Stanford Networking Seminar
                            Overview
•     What do we mean by “latent semantics” ?
•     Challenges in using latent semantics in P2P setting
•     Our proposal: search propagation via Possession rules
•     Possession rules overlays
•     Search strategies
       – Possession rules search strategies: Rapier, GAS
       – Models for “blind search” strategies (gnutella)
• Analysis in the Itemsets model
• Experimental evaluation
• More on GAS search strategy


    May 15, 2002           Stanford Networking Seminar
 View of P2P file sharing network




May 15, 2002   Stanford Networking Seminar
         What is latent semantics?
    Selections people make are dependent:
    •If you buy baby formula, you are more likely to
    buy diapers.
    •If two people loved a show, they are more likely
    to agree on other shows.
  • Peer/Item matrix is “Market Basket” dataset.
    Similar to buyers/items, Document/terms, Web-
    pages/hyperlinks, movies/viewers.
  • Applications for extracting patterns from market
    basket data: Information Retrieval, Collaborative
    Filtering, Web search, Marketing, Recommendation
    Systems,…. (clustering, search, association rules)
?? P2P search – direct queries to peers with interests
that match yours
May 15, 2002        Stanford Networking Seminar
                         Challenges



     • Overlay topology (“networking aspects”) must be
       coupled with search strategy (“Information
       Retrieval/Data-Mining”)
     • “Traditional” IR and data-mining tools are not
       adapted to the highly distributed P2P setting.
          – Similarity metrics/clustering/ranking involve matrix
            operations on the “market basket” data: principal
            component analysis (LSI), eigenvalue computations,
            association rules…


May 15, 2002            Stanford Networking Seminar
                     Possession Rules
 • Rule(O): do you possess item O ?
 • Peer maintains a possession rule for each
   item in its index (subset if index is large)
 • Search strategy: a sequence of possession
   rules (with “hop counts”/search size limit)

Making this work:
               • “Network”: How to build overlay that supports
               possession rules
               • “IR/DM”: design search strategies that use
               possession rules (and work!)
May 15, 2002              Stanford Networking Seminar
           Possession-rules overlays
                Peer26                       Index of P26
                                             Rules/Items:
                                             Rule(A)
                                             Rule(B)
      item Rule(item)
                                             Rule(C )
           neighbors
                                             Rule(D)
      A    P11,p7,p3
                                     Example Search Strategy of P26:
      B        P2,p6,p9                2 hops in rule(A)
                                       4 hops in rule(B)
      C        P13,p15,p1              6 hops in rule(C )

      D        P4,p5,p10               4 hops in rule(A)
                                       3 hops in rule(D)
May 15, 2002              Stanford Networking Seminar
      Blind searching for O takes 13 probes
      Searching with rule(O) takes 2 probes
                                                     Rules/Items:
                                                     Rule(A)
                                                     Rule(B)
                                                     Rule(C )
                                                     Rule(D)




May 15, 2002           Stanford Networking Seminar
               Possession-rule overlay
Network is “gnutella-like”, within each rule
 • Coverage: The induced overlay on peers that satisfy
   each rule constitutes of large connected components.
 • Small degree: Each peer participates in a limited
   number of rules. (yet, overall there is a large number
   rules), for each rule it “participates” in, the peer
   maintains several participating neighbors.
 • Overlay and search boost each other (easy to find
   appropriate neighbors for each rule):
  • When you find O, you often discover multiple peers
    that have O; when you give O, the searcher informs
    you of other peers with O.
  • Peers that have O can find other peers that have O
(… can use “super-peer” overlay within each rule !!)
May 15, 2002         Stanford Networking Seminar
                Search strategies
• To beat blind search, associative search should probe
  peers that are more likely to answer than “random
  peers”
Associative search:
• RAPIER: Random Possession Rule – crudest strategy
• GAS: Greedy Selection – refined strategy

Blind search:
• Urand: (“gnutella”) all peers have same likelihood of
being probed in each query
• Prand: (“gnutella modified”) peers are probed
proportionally to their index size (RAPIER has same bias)

 May 15, 2002       Stanford Networking Seminar
    RAPIER – Random Possession Rule
  simplest possession-rule based strategy

   RAPIER Search strategy:
   • Repeat until found:
        – Pick a random item O from your index
        – Search peers that have this item (using rule(O))



  Straightforward to implement on top of a
  possession-rule overlay network



May 15, 2002          Stanford Networking Seminar
          Analysis: Itemsets Model

   • Items belong to “topics.” There are very
     many topics; but each peer can only select
     items from a fixed set of topics. Topic
     popularities can highly vary; but each peer
     has equal interest in each of “its” topics.
   We show that
   • RAPIER is at least as good as Prand
   • RAPIER is better than Prand when peers
     have fewer topics
   Simple model that hints on what is going on…

May 15, 2002      Stanford Networking Seminar
                      Experiments
      Data: used Client/Hostname matrix from proxy
      logs as peer/item matrix. Each entry, in turn, is
      treated as a search item.
        – Similarly-structured “market basket” data
        – Has rare items (which current P2P networks don’t
          support)
        – No universal model for market basket data
        – Can’t get a full index for many peers from current P2P
          networks… and these networks don’t reflect well on rare
          items.
   • Metric: ESS (Expected Search Size – number of
     peers probed till search is resolved). CDF of
     fraction of “searches” that have ESS below “x”.


May 15, 2002            Stanford Networking Seminar
      ESS – Expected Search Size
   • ESS: 1/(success probability in each probe) (when
      probes are “independent” – not true for GAS)
   • Probe success probability:
   • Urand: fraction of peers that have the item in their
     index
   • Prand: weight of each peer is its index size divided
     by sum of index sizes of all peers.
      – Success prob: (weight of peers with item) /
        (weight of peers without item)
   • RAPIER: the average, over possession rules peer
     participates in, of fraction of peers in rule that have
     the item.

May 15, 2002           Stanford Networking Seminar
  Peer-Item Matrix - Experiment
           Items
         0 0 1 ? 1? 1? 0 0                      0    0    0
         0 0 0 0 0 1   ? 0                      0    1?   1?
         1? 1? 0 0 0 0 1                        0    0    0
         0     0   1   0     1     0      0     0    1    0
         0     0   0   0     0     0      1     1    1    0
         1     1   0   0     0     0      0     0    1    0
         0     0   0   1     1     0      0     1    1    1
         0     0   1   1     0     0      0     0    1    0
         1     1   0   0     0     1      0     0    0    0
         0     1   0   0     1     0      0     0    1    0
May 15, 2002           Stanford Networking Seminar
                      Urand and Prand
  Urand Ps=3/9 ESS=3                                    Prand ESS=29/9
                Items
1/9       0      0    1   1     1     0      0     0    0   0    3/29

1/9       0      0    0   0     0     1      0     0    1   1    3/29
          1      1?   0   0     0     0      1     0    0   0
1/9       0      0    1   0     1     0      0     0    1   0    3/29

1/9       0      0    0   0     0     0      1     1    1   0    3/29
1/9       1      1    0   0     0     0      0     0    1   0    3/29
1/9       0      0    0   1     1     0      0     1    1   1     5/29
1/9       0      0    1   1     0     0      0     0    1   0    3/29
1/9       1      1    0   0     0     1      0     0    0   0    3/29
1/9       0      1    0   0     1     0      0     0    1   0    3/29

 May 15, 2002             Stanford Networking Seminar
      RAPIER (Random Possession Rule)
              rule                               rule
           0.5    Items                       0.5
               0     0    1   1     1     0      0      0   0   0
               0     0    0   0     0     1      0      0   1   1
               1     1?   0   0     0     0      1      0   0   0
               0     0    1   0     1     0      0      0   1   0
0.5            0     0    0   0     0     0      1      1   1   0
0.25           1     1    0   0     0     0      0      0   1   0
               0     0    0   1     1     0      0      1   1   1
               0     0    1   1     0     0      0      0   1   0
0.25           1     1    0   0     0     1      0      0   0   0
               0     1    0   0     1     0      0      0   1   0
      May 15, 2002            Stanford Networking Seminar
Caveat: comparing apples and oranges

   • When searching by possession rules we have bias
     towards peers that participate in more rules/ have
     more items.
   • But, with this bias, a strategy has better chance
     of finding what it is looking for! So…
   • We show that the likelihood of being probed is
     proportional to number of rules you participate in.
   • Prand “blind search” strategy has same bias.
   • Thus, it is “fair” to compare Prand search with
     possession-rule based RAPIER


 May 15, 2002       Stanford Networking Seminar
           GAS …Refining RAPIER
Ideas:
• Some rules are better than others (e.g., possession
  of a very popular item carries weaker information)
• Unsuccessful search carries information: suppose
  you lost something, you think you lost it at home. You
  search home going through various closets and
  drawers and don’t find it, then you may decide to go
  search the office, even if you have not completed an
  exhaustive search at home. What happened? The
  posterior distribution on the item’s location had
  changed as a result of the search.


               GAS – Greedy Strategy
May 15, 2002      Stanford Networking Seminar
Urand Blind search (Gnutella),
Prand Gnutella modified,
Rapier, GAS – our algorithms
                                 All Items




    May 15, 2002             Stanford Networking Seminar
   Rare Items: present in 1% of peers




May 15, 2002   Stanford Networking Seminar
         Rarer items: 0.1% of peers




May 15, 2002    Stanford Networking Seminar
     Even Rarer Item: 0.01% of peers




May 15, 2002   Stanford Networking Seminar
                GAS – Greedy Strategy
• Idea: use the search strategy that would have optimized
  your search on previous queries.
• Caveat: this is NP-Complete
• Can do: greedy approximation strategy: GAS
 GAS:
• initialize the “query vector” to a uniform distribution on
  previous selections.
• Iterate the following:
   – Apply the possession rule that maximizes success
     probability with respect to the query posterior
   – update the query posterior.

 Theorem: GAS is a constant factor approximation
 of the optimal strategy
 May 15, 2002        Stanford Networking Seminar
                 Building GAS strategies
• GAS:
  – Take a sample of items currently in your index D,E,F,G.
  – “search” for these items in each possession rule you
    participate A,B,C
  – obtain a matrix: fraction of peers with item x in rule(y)

                   Item       D         E        F        G
                  Rule()

                  rule(A) 0.03 *    0.2                   *
                  rule(B) *    0.04 *                     0.1
                  rule(C)     0.1       0.2      0.03 *

  May 15, 2002              Stanford Networking Seminar
               GAS strategy (example)
                  Item       D        E        F        G
                 Rule()

                 rule(A) 0.03 *    0.2 *
                 rule(B) *    0.04 *   0.1
                 rule(C) 0.1 0.2 0.03 *

         C,C,C,A,C,C,A,C,A,C,B,B,A,C,B,B,C,A,B,B,C

GAS search of size 21:                RAPIER search of size 21:
 10 probes in rule(C)                  7 probes in rule(C)
  6 probes in rule(B)                  7 probes in rule(B)
  5 probes in rule(A)                  7 probes in rule(A)

May 15, 2002              Stanford Networking Seminar
                    Summary
   • We proposed a general framework for associative
     P2P search: exploit patterns inherent in human
     selections to boost search. Adapted to the P2P
     setting.
   • Search strategies and the overlay structure are
     “symbiotic” and guided/boosted by previous
     selections/queries.
   • “Common good” in par with “own welfare”: All data
     maintained by each peer has direct personal benefit
     (like gnutella). Helping others helps you…
   • Possession rules:
      – Strategies are “approximations” to “standard”
         similarity metrics… that work!!.
      – Easy to find other sources of desired item (for
         alternative/parallel downloads)

May 15, 2002       Stanford Networking Seminar
                        Related work
• IR-DM: association rules/collaborative filtering/Web
  search
• P2P networks: unstructured networks; DHTs
   – DHTs have “symbiotic” overlay/search strategy
   – Caching at peers (Freenet) adapt overlay according to search
• Intersection:
   – Crespo/Garcia-Molina 02– routing indexes
   System isolates “topics”+map queries/items to topics.
    Peer knows “summary” of what can be reached thru it/each neighbor
   Query keywords are used to select a neighbor who is a best match
   Differences from our approach:
   – No connection between search and overlay topology
   – Uses only text/keywords. We use co-location associations between
      items.
   CG02: tradeoff between topic divergence (all nodes ending up with
      similar index “summary”); or restricted coverage (number of peers
      included in each peer summary);
   – neurogrid.net (Sam Joseph, U. Tokyo) “agent” text-based approach
       • Peers learn and remember content of other peers

   May 15, 2002            Stanford Networking Seminar
                       Future…
  • Integrate text matching (of query keywords) in
    search strategy (use rule(O) if query keywords
    match O’s metadata)
  • Select which possession rules to participate in (e.g.,
    using item popularity heuristic or GAS-like selection)
  • Search strategy gives more weight to more recent
    selections (are more indicative of next query)
  • Explore other types of propagation rules
  • P2P “communities” ?
  • Integrate “Recommendation Systems” in P2P ?
  • Implementation …

May 15, 2002        Stanford Networking Seminar
May 15, 2002   Stanford Networking Seminar
               Some Extra Comments…

   • Issues with straightforward importing of
     IR techniques
        – Vector space approach
        – Similarity metrics
   • Why we need to use several propagation
     rules in a search? (when searching
     according to “examples” in the index)




May 15, 2002         Stanford Networking Seminar
 “Straight” IR vector-space approach
  • Peers are mapped to vectors, according to their
    index content. Queries are mapped to the vectors
    in the same space.
  • Overlay topology is correlated with distances in
    this vector space (bias towards closer peers)
  • Search propagation targets regions of the space
    that are “closest” to the query.

   • #neighbors=O(dimension) - want small dimension
   • Yet, Matrix operations, e.g principal component
     analysis (LSI), are hard in our distributed setting
   • Yet, each peer should be able to compute the
     mapping for its queries and/or index
   • Proximity metric alone is insufficient (Need
     different propagation rules)
May 15, 2002        Stanford Networking Seminar
     Why we need several propagation
       rules for the same query –
       ”decision-tree like” search
    propagation rule =approx interest area
   Each peer covers several interest areas, peers have
      different sets of interest areas.
   Peer Query: 80% basketball 20%polo
   “World” Index: 5% basketball 0.1% polo
   All “basketball” lovers would be close matches; but
      need to direct search to more “polo” lovers
   multi-rule search strategy: “basketball” 200 peers;
      “polo” 200 peers

May 15, 2002        Stanford Networking Seminar

				
DOCUMENT INFO