Docstoc

Image Indexing and Retrieval

Document Sample
Image Indexing and Retrieval Powered By Docstoc
					     Topics in Database Systems: Data Management in
                   Peer-to-Peer Systems




                  Search in Unstructured P2p




                                                      1
p2p, Fall 05
                              Outline

         Search Strategies in Unstructured p2p


         Routing Indexes




                                                  2
p2p, Fall 05
     Topics in Database Systems: Data Management in
                   Peer-to-Peer Systems




        D. Tsoumakos and N. Roussopoulos, “A Comparison of Peer-to-
        Peer Search Methods”, WebDB03




                                                                      3
p2p, Fall 05
                                     Overview

       Centralized
      Constantly-updated directory hosted at central locations (do
      not scale well, updates, single points of failure)
       Decentralized but structured
      The overlay topology is highly controlled and files (or
      metadata/index) are not placed at random nodes but at
      specified locations
      “loosely” vs “highly-structured” DHT
       Decentralized and Unstructured
               peers connect in an ad-hoc fashion
               the location of document/metadata is not controlled by the system
               No guaranteed for the success of a search
               No bounds on search time


                                                                                   4
p2p, Fall 05
                 Flooding on Overlays




               xyz.mp3

                                        xyz.mp3 ?




                                               5
p2p, Fall 05
                 Flooding on Overlays




               xyz.mp3

                                           xyz.mp3 ?


                                        Flooding


                                                   6
p2p, Fall 05
                 Flooding on Overlays




               xyz.mp3

                                           xyz.mp3 ?


                                        Flooding


                                                   7
p2p, Fall 05
               Flooding on Overlays




                              xyz.mp3




                                        8
p2p, Fall 05
                     Search in Unstructured P2P



     Must find a way to stop the search Time-to-Leave (TTL)


     Exponential Number of Messages


     Cycles (?)




                                                              9
p2p, Fall 05
                       Search in Unstructured P2P


         BFS vs DFS
         BFS better response time, larger number of nodes
         (message overhead per node and overall)


       Note: search in BFS continues (if TTL is not reached), even
       if the object has been located on a different path


         Recursive vs Iterative
         During search, whether the node issuing the query direct
         contacts others, or recursively.
         Does the result follows the same path?


                                                                     10
p2p, Fall 05
                            Iterative vs. Recursive Routing
      Iterative: Originator requests IP address of each hop
          • Message transport is actually done via direct IP
      Recursive: Message transferred hop-by-hop
                                  K V
                                                                K V


               K V
                                                      K V


                                  K V


                                                                      K V
                     K V



K V

                                                                      K V
                                                K V
                           K V

                                                               retrieve (K1)

                                                                               11
p2p, Fall 05
                       Search in Unstructured P2P



      Two general types of search in unstructured p2p:
      Blind: try to propagate the query to a sufficient number
      of nodes (example Gnutella)

      Informed: utilize information about document locations
      (example Routing Indexes)



               Informed search increases the cost of join for
               an improved search cost




                                                                 12
p2p, Fall 05
                          Blind Search Methods

         Gnutella:
         Use flooding (BFS) to contact all accessible nodes within the
         TTL value
         Huge overhead to a large number of peers +
         Overall network traffic
         Hard to find unpopular items
         Up to 60% bandwidth consumption of the total Internet
         traffic




                                                                         13
p2p, Fall 05
                                 Overlay Networks
 •    P2P applications need to:
       – Track identities & (IP) addresses of peers
           • May be many!
           • May have significant Churn (update rate)
           • Best not to have n2 ID references

       – Route messages among peers
          • If you don’t keep track of all peers, this is “multi-hop”

 This is an overlay network
       – Peers are doing both naming and routing
       – IP becomes “just” the low-level transport

               • All the IP routing is opaque



                                                                        14
p2p, Fall 05
                        P2P Cooperation Models
•   Centralized model
     – global index held by a central authority
       (single point of failure)
     – direct contact between requestors and providers
     – Example: Napster
•   Decentralized model
     – Examples: Freenet, Gnutella
     – no global index, no central coordination, global behavior emerges from
       local interactions, etc.
     – direct contact between requestors and providers (Gnutella) or
       mediated by a chain of intermediaries (Freenet)
•   Hierarchical model
     – introduction of “super-peers”
     – mix of centralized and decentralized model
     – Example: DNS


                                                                            15
p2p, Fall 05
                       Free-riding on Gnutella [Adar00]

        •      24 hour sampling period:
                – 70% of Gnutella users share no files
                – 50% of all responses are returned by top 1% of sharing
                   hosts
        •      A social problem not a technical one
        •      Problems:
                – Degradation of system performance: collapse?
                – Increase of system vulnerability
                – “Centralized” (“backbone”) Gnutella  copyright issues?
        •      Verified hypotheses:
                – H1: A significant portion of Gnutella peers are free riders
                – H2: Free riders are distributed evenly across domains
                – H3: Often hosts share files nobody is interested in (are
                   not downloaded)



                                                                                16
p2p, Fall 05
               Free-riding Statistics - 1 [Adar00]




 H1: Most Gnutella users are free riders
 Of 33,335 hosts:
      – 22,084 (66%) of the peers share no files
      – 24,347 (73%) share ten or less files
      – Top 1 percent (333) hosts share 37% (1,142,645) of total files shared
      – Top 5 percent (1,667) hosts share 70% (1,142,645) of total files shared
      – Top 10 percent (3,334) hosts share 87% (2,692,082) of total files shared


                                                                                   17
p2p, Fall 05
                 Free-riding Statistics - 2 [Adar00]




        H3: Many servents share files nobody downloads
        Of 11,585 sharing hosts:
            – Top 1% of sites provide nearly 47% of all answers
            – Top 25% of sites provide 98% of all answers
            – 7,349 (63%) never provide a query response


                                                                  18
p2p, Fall 05
                            Free Riders

  • File sharing studies
     – Lots of people download
     – Few people serve files

  • Is this bad?
     – If there’s no incentive to serve, why do people do so?
     – What if there are strong disincentives to being a major
        server?




                                                                 19
p2p, Fall 05
                   Simple Solution: Thresholds

  • Many programs allow a threshold to be set
     – Don’t upload a file to a peer unless it shares > k files

  • Problems:
     – What’s k?
     – How to ensure the shared files are interesting?




                                                                  20
p2p, Fall 05
               Categories of Queries [Sripanidkulchai01]


  Categorized top 20 queries




                                                           21
p2p, Fall 05
               Popularity of Queries [Sripanidkulchai01]




  •    Very popular documents are approximately equally popular
  •    Less popular documents follow a Zipf-like distribution (i.e., the
       probability of seeing a query for the ith most popular query is
       proportional to 1/(ialpha))
  •    Access frequency of web documents also follows Zipf-like distributions
        caching might also work for Gnutella

                                                                           22
p2p, Fall 05
                   Caching in Gnutella [Sripanidkulchai01]

•      Average bandwidth consumption in tests: 3.5Mbps
•      Best case: trace 2 (73% hit rate = 3.7 times traffic reduction)




                                                                         23
    p2p, Fall 05
                    Topology of Gnutella [Jovanovic01]

   • Power-law properties verified (“find everything close by”)
   • Backbone + outskirts

  Power-Law     Random     Graph
  (PLRG):

  The node degrees follow      a
  power law distribution:

  if one ranks all nodes from the
  most connected to the least
  connected, then
  the i’th most connected node
  has ω/ia neighbors,

  where w is a constant.




                                                                  24
p2p, Fall 05
               Gnutella Backbone [Jovanovic01]




                                                 25
p2p, Fall 05
                Why does it work? It’s a small World! [Hong01]


  •    Milgram: 42 out of 160 letters from Oregon to Boston (~ 6 hops)
  •    Watts: between order and randomness
        – short-distance clustering + long-distance shortcuts




Regular graph:
                                Rewired graph (1% of nodes):   Random graph:
 n nodes, k nearest neighbors
                                 path length ~ random graph     path length ~ log (n)/log(k)
  path length ~ n/2k
                                 clustering ~ regular graph                 ~4
   4096/16 = 256

                                                                                      26
p2p, Fall 05
                       Links in the small World [Hong01]


   •   “Scale-free” link distribution
         – Scale-free: independent of the total number of nodes
         – Characteristic for small-world networks
         – The proportion of nodes having a given number of links n is:
                                     P(n) = 1 /n k
         – Most nodes have only a few connections
         – Some have a lot of links: important for binding disparate regions
           together




                                                                               27
p2p, Fall 05
               Freenet: Links in the small World [Hong01]



                                          P(n) ~ 1/n 1.5




                                                            28
p2p, Fall 05
               Freenet: “Scale-free” Link Distribution [Hong01]




                                                                  29
p2p, Fall 05
                         Gnutella: New Measurements
 [1] Stefan Saroiu, P. Krishna Gummadi, Steven D. Gribble:
 A Measurement Study of Peer-to-Peer File Sharing Systems,
 Proceedings of Multimedia Computing and Networking (MMCN)
 2002, San Jose, CA, USA, January 2002.

 [2] M. Ripeanu, I. Foster, and A. Iamnitchi.
 Mapping the gnutella network: Properties of large-scale peer-to-peer systems and implications for
 system design.
 IEEE Internet Computing Journal, 6(1), 2002

 [3] Evangelos P. Markatos,
 Tracing a large-scale Peer to Peer System: an hour in the life of Gnutella,
 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2002.

 [4] Y. HawatheAWATHE, S. Ratnasamy, L. Breslau, and S. Shenker.
 Making Gnutella-like P2P Systems Scalable. In Proc. ACM SIGCOMM (Aug. 2003).

 [5] Qin Lv, Pei Cao, Edith Cohen, Kai Li, Scott Shenker:
 Search and replication in unstructured peer-to-peer networks. ICS 2002: 84-95
                                                                                                 30
p2p, Fall 05
                        Gnutella: Bandwidth Barriers

 •    Clip2 measured Gnutella over 1 month:
       – typical query is 560 bits long (including TCP/IP headers)
       – 25% of the traffic are queries, 50% pings, 25% other
       – on average each peer seems to have 3 other peers actively connected
 •    Clip2 found a scalability barrier with substantial performance degradation if
      queries/sec > 10:
                10 queries/sec
             * 560 bits/query
             * 4 (to account for the other 3 quarters of message traffic)
             * 3 simultaneous connections
             67,200 bps
              10 queries/sec maximum in the presence of many dialup users
              won’t improve (more bandwidth - larger files)




                                                                                      31
p2p, Fall 05
                                 Gnutella: Summary

•   Completely decentralized
•   Hit rates are high
•   High fault tolerance
•   Adopts well and dynamically to changing peer populations
•   Protocol causes high network traffic (e.g., 3.5Mbps). For example:
     – 4 connections C / peer, TTL = 7
                                                  
                                              2 * i 0 C * (C  1)i  26 ,240
                                                   TTL
     – 1 ping packet can cause packets
•   No estimates on the duration of queries can be given
•   No probability for successful queries can be given
•   Topology is unknown  algorithms cannot exploit it
•   Free riding is a problem
•   Reputation of peers is not addressed
•   Simple, robust, and scalable (at the moment)




                                                                                32
p2p, Fall 05
                 Hierarchical Networks (& Queries)



   • DNS
      – Hierarchical name space (“clients” + hierarchy of servers)
      – Hierarchical routing w/aggressive caching
         • 13 managed “root servers”

   • Traditional pros/cons of Hierarchical data mgmt
      – Works well for things aligned with the hierarchy
         • Esp. physical locality
      – Inflexible
         • No data independence!




                                                                     33
p2p, Fall 05
                            Commercial Offerings

  •    JXTA
        – Java/XML Framework for p2p applications
        – Name resolution and routing is done with floods & superpeers
           • Can always add your own if you like

  •    MS WinXP p2p networking
        – An unstructured overlay, flooded publication and caching
        – “does not yet support distributed searches”

  •    Both have some security support
        – Authentication via signatures (assumes a trusted authority)
        – Encryption of traffic




                                                                         34
p2p, Fall 05
                             Lessons and Limitations

  •    Client-Server performs well
        – But not always feasible
            • Ideal performance is often not the key issue!

  •    Things that flood-based systems do well
        – Organic scaling
        – Decentralization of visibility and liability
        – Finding popular stuff (e.g., caching)
        – Fancy local queries

  •    Things that flood-based systems do poorly
        – Finding unpopular stuff [Loo, et al VLDB 04]
        – Fancy distributed queries
        – Vulnerabilities: data poisoning, tracking, etc.
        – Guarantees about anything (answer quality, privacy, etc.)


                                                                      35
p2p, Fall 05
                   Summary and Comparison of Approaches



                                               Search Cost
               Paradigm          Search Type                        Autonomy
                                               (messages)
               Breadth-first     String
                                               2 * i 0 C * (C  1)i very high
                                                    TTL
Gnutella
               search on graph   comparison
               Depth-first       String
FreeNet                                        O(Log n) ?           very high
               search on graph   comparison
               Implicit binary
Chord                            Equality      O(Log n)             restricted
               search trees
               d-dimensional
CAN                              Equality      O(d n^(1/d))         high
               space
               Binary prefix
P-Grid                           Prefix        O(Log n)             high
               trees




                                                                                  36
p2p, Fall 05
                            More on Search


  Search Options
     – Query Expressiveness (type of queries)
     – Comprehensiveness (all or just the first (or k) results
     – Topology
     – Data Placement
     – Message Routing




                                                                 37
p2p, Fall 05
                        Comparison


                      Gnutella       CAN   Others?
   Expressivness      
   Comprehensivness    
   Autonomy           
   Efficiency          
   Robustness         
   Topology           pwr law
   Data Placement     arbitrary
   Message Routing    flooding


                                                     38
p2p, Fall 05
                        Comparison


                      Gnutella        CAN       Others?
   Expressivness                   
   Comprehensivness                
   Autonomy                        
   Efficiency                        
   Robustness                      
   Topology           pwr law          grid
   Data Placement     arbitrary      hashing
   Message Routing    flooding       directed


                                                          39
p2p, Fall 05
                                     Parallel Clusters




                                                links out of these clusters not shown
        search at only a fraction
             of the nodes!

                                                                                        40
p2p, Fall 05
               Other Open Problems besides Search: Security


  •    Availability (e.g., coping with DOS attacks)
  •    Authenticity
  •    Anonymity
  •    Access Control (e.g., IP protection, payments,...)




                                                              41
p2p, Fall 05
                             Trustworthy P2P
  •    Many challenges here. Examples:
        – Authenticating peers

        – Authenticating/validating data
           • Stored (poisoning) and in flight

        – Ensuring communication

        – Validating distributed computations

        – Avoiding Denial of Service
           • Ensuring fair resource/work allocation

        – Ensuring privacy of messages
           • Content, quantity, source, destination


                                                      42
p2p, Fall 05
                   Authenticity



                       title: origin of species
                       author: charles darwin

               ?             date: 1859

                       body: In an island far,
                            far away ...

                                  ...



                                                  43
p2p, Fall 05
               More than Just File Integrity



                           title: origin of species
                           author: charles darwin

                ?                date: 1859 00

                           body: In an island far,
                                far away ...

                                  checksum


                                                      44
p2p, Fall 05
               More than Fetching One File

                                             T=origin
                                               Y=?
                                             A=darwin
                                               B=?



 T=origin        T=origin T=origin                T=origin
 Y=1800          Y=1859   Y=1859                  Y=1859
 A=darwin        A=darwin A=darwin                A=darwin
                  B=abcd



                                                         45
p2p, Fall 05
                                  Solutions


  •    Authenticity Function A(doc): T or F
       – at expert sites, at all sites?
       – can use signature expert           sig(doc)     user
  •    Voting Based
       – authentic is what majority says
  •    Time Based
       – e.g., oldest version (available) is authentic




                                                                46
p2p, Fall 05
                      Added Challenge: Efficiency


  •    Example: Current music sharing
       – everyone has authenticity function
       – but downloading files is expensive




 • Solution: Track peer
   behavior

                                      good peer                good peer
                                                    bad peer        47
p2p, Fall 05
                             Issues


  •     Trust computations in dynamic system
  •     Overloading good nodes
  •     Bad nodes can provide good content sometimes
  •     Bad nodes can build up reputation
  •     Bad nodes can form collectives
  •     ...




                                                       48
p2p, Fall 05
                      Security & Privacy


  • Issues:
     – Anonymity
     – Reputation
     – Accountability
     – Information Preservation
     – Information Quality
     – Trust
     – Denial of service attacks



                                           49
p2p, Fall 05
                             Blind Search Methods

       Modified-BFS:

       Choose only a ratio of the neighbors (some random subset)

               Iterative Deepening:


               Start BFS with a small TTL and repeat the BFS at
               increasing depths if the first BFS fails
               Works well when there is some stop condition and a
               “small” flood will satisfy the query
               Else even bigger loads than standard flooding


                                                    (more later …)

                                                                     50
p2p, Fall 05
                            Blind Search Methods
 Random Walks:
 The node that poses the query sends out k query messages to an
 equal number of randomly chosen neighbors

 Each step follows each own path at each step randomly choosing
 one neighbor to forward it

 Each path – a walker

 Two methods to terminate each walker:
  TTL-based or
  checking method (the walkers periodically check with the query source if the
 stop condition has been met)


 It reduces the number of messages to k x TTL in the worst case
 Some kind of local load-balancing
                                                                                  51
p2p, Fall 05
                      Blind Search Methods


 Random Walks:

 In addition, the protocol bias its walks towards high-degree
 nodes (choose the highest degree neighbor)




                                                                52
p2p, Fall 05
                            Blind Search Methods


  Using Super-nodes:


  Super (or ultra) peers are connected to each other
  Each super-peer is also connected with a number of lead nodes
  Routing among the super-peers
               The super-peers then contact their leaf nodes




                                                                  53
p2p, Fall 05
                       Blind Search Methods


Using Super-nodes:

Gnutella2
When a super-peer (or hub) receives a query from a leaf, it
forwards it to its relevant leaves and to neighboring super-peers
The hubs process the query locally and forward it to their
relevant leaves
Neighboring super-peers regularly exchange local repository
tables to filter out traffic between them




                                                                    54
p2p, Fall 05
                       Blind Search Methods

  Ultrapeers can be installed (KaZaA) or self-promoted (Gnutella)




                                                 Interconnection between
                                                      the superpeers



                                                                     55
p2p, Fall 05
                      Informed Search Methods


   Local Index

   Each node indexes all files stored at all nodes within a certain
   radius r and can answer queries on behalf of them


   Search process at steps of r, hop distance between two
   consecutive searches 2r+1


   Increased cost for join/leave
    • Flood inside each r with TTL = r, when join/leave the network




                                                                      56
p2p, Fall 05
                        Informed Search Methods


 Intelligent BFS


               query          ...        ?

    Nodes store simple statistics on its neighbors:
    (query, NeigborID) tuples for recently answered requests from or
    through their neighbors
    so they can rank them

    For each query, a node finds similar ones and selects a direction

    How?


                                                                        57
p2p, Fall 05
                        Informed Search Methods


 Intelligent or Directed BFS


               query          ...      ?

     •   Heuristics for Selecting Direction
          >RES: Returned most results for previous queries
          <TIME: Shortest satisfaction time
          <HOPS: Min hops for results
          >MSG: Forwarded the largest number of messages (all types),
            suggests that the neighbor is stable
          <QLEN: Shortest queue
          <LAT: Shortest latency
          >DEG: Highest degree

                                                                        58
p2p, Fall 05
                           Informed Search Methods

 Intelligent or Directed BFS


               • No negative feedback
               • Depends on the assumption that nodes specialize in certain
               documents




                                                                              59
p2p, Fall 05
                        Informed Search Methods

 APS

 Again, each node keeps a local index with one entry for each object it has
 requested per neighbor –
 this reflects the relative probability of the node to be chosen to forward
 the query

 k independent walkers and probabilistic forwarding
 Each node forwards the query to one of its neighbor based on the local
 index (for each object, choose a neighbor using the stored probability)

 If a walker, succeeds the probability is increased, else is decreased –
 Take the reverse path to the requestor and update the probability, after a
 walker miss (optimistic update) or after a hit (pessimistic update)



                                                                           60
p2p, Fall 05
     Topics in Database Systems: Data Management in
                   Peer-to-Peer Systems




        Q. Lv et al, “Search and Replication in Unstructured Peer-to-
        Peer Networks”, ICS’02




                                                                        61
p2p, Fall 05
               Search and Replication in Unstructured Peer-to-Peer
                                       Networks


         Type of replication depends on the search strategy used


         (i)   A number of blind-search variations of flooding
         (ii) A number of (metadata) replication strategies




               Evaluation Method: Study how they work for a number of
               different topologies and query distributions




                                                                        62
p2p, Fall 05
                                Methodology
 Three aspects of P2P

 Performance of search depends on

  Network topology: graph formed by the p2p overlay network

  Query distribution: the distribution of query frequencies for
 individual files

  Replication: number of nodes that have a particular file

     Assumption: fixed network topology and fixed query distribution
     Results still hold, if one assumes that the time to complete a search
     is short compared to the time of change in network topology and in
     query distribution


                                                                             63
p2p, Fall 05
               Network Topology




                                  64
p2p, Fall 05
                        Network Topology
 (1) Power-Law Random Graph
 A 9239-node random graph
 Node degrees follow a power law distribution
               when ranked from the most connected to the least, the i-th
               ranked has
                      ω/ia, where ω is a constant
 Once the node degrees are chosen, the nodes are connected
 randomly




                                                                            65
p2p, Fall 05
                       Network Topology

 (2) Normal Random Graph

 A 9836-node random graph




                                          66
p2p, Fall 05
                            Network Topology

(3) Gnutella Graph (Gnutella)

A 4736-node graph obtained in Oct 2000
Node degrees      roughly    follow   a   two-segment   power   law
distribution




                                                                      67
p2p, Fall 05
                         Network Topology

(4) Two-Dimensional Grid (Grid)

A two dimensional 100x100 grid




                                            68
p2p, Fall 05
                           Query Distribution

    Assume m objects
    Let qi be the relative popularity of the i-th object (in terms of
    queries issued for it)

    Values are normalized Σ i=1, m qi = 1

     (1) Uniform: All objects are equally popular
                                qi = 1/m


     (2) Zipf-like
                                qi  1 / iα



                                                                    69
p2p, Fall 05
                             Replication
    Each object i is replicated on ri nodes and the total number of
    objects stored is R, that is

                               Σ i=1, m ri = R
    (1) Uniform: All objects are replicated at the same number of
        nodes
                                 ri = R/m

    (2) Proportional: The replication of an object is proportional to
        the query probability of the object
                                  ri  qi
    (3) Square-root: The replication of an object i is proportional to
        the square root of its query probability qi
                                 ri  √qi

                                                                         70
p2p, Fall 05
                 Query Distribution & Replication

    When the replication is uniform, the query distribution is
    irrelevant (since all objects are replicated by the same amount,
    search times are equivalent for both hot and cold items)
    When the query distribution is uniform, all three replication
    distributions are equivalent (uniform!)
    Thus, three relevant combinations query-distribution/replication


                (1) Uniform/Uniform
                (2) Zipf-like/Proportional
                (3) Zipf-like/Square-root




                                                                   71
p2p, Fall 05
                              Metrics


    Pr(success): probability of finding the queried object before the
    search terminates

    #hops: delay in finding an object as measured in number of hops




                                                                      72
p2p, Fall 05
                             Metrics


    #msgs per node: Overhead of an algorithm as measured in
    average number of search messages each node in the p2p has to
    process

    #nodes visited

    Percentage of message duplication

    Peak #msgs: the number of messages that the busiest node has
    to process (to identify hot spots)


    These are per-query measures
    An aggregated performance measure, each query convoluted with
    its probability
                                                                73
p2p, Fall 05
                         Simulation Methodology

    For each experiment,
    First select the topology and the query/replication distributions


    For each object i with replication ri, generate numPlace different sets
    of random replica placements (each set contains ri random nodes on
    which to place the replicas of object i)

    For each replica placement, randomly choose numQuery different nodes
    form which to initiate the query for object i


    Thus, we get numPlace x numQuery queries
    In the paper, numPlace = 10 and numQuery = 100 -> 1000 different
    queries per object



                                                                          74
p2p, Fall 05
                      Limitation of Flooding
   Choice of TTL
    Too low, the node may not find the object, even if it
   exists
    Too high, burdens the network unnecessarily



                                               Search for an object
                                               that is replicated at
                                               0.125% of the nodes (~11
                                               nodes if total 9000)
                                               Note that TTL depends
                                               on the topology
                                               Also      depends      on
                                               replication   (which    is
                                               however unknown)

                                                                            75
p2p, Fall 05
                    Limitation of Flooding

    Choice of TTL


                                             Overhead
                                             Also depends   on
                                             the topology




                                                                 76
p2p, Fall 05
                     Limitation of Flooding


  There are many duplicate messages (due to cycles)
  particularly in high connectivity graphs
  Multiple copies of a query are sent to a node by multiple
  neighbors


  Duplicated messages can be detected and not forwarded
  BUT, the number of duplicate messages can still be
  excessive and worsens as TTL increases




                                                              77
p2p, Fall 05
               Limitation of Flooding




                                        Different nodes




                                                          78
p2p, Fall 05
          Limitation of Flooding: Comparison of the topologies

    Power-law and Gnutella-style graphs particularly bad with
    flooding
               Highly connected nodes means higher duplication
               messages, because many nodes’ neighbors overlap
    Random graph best,
               Because in truly random graph the duplication ratio
               (the likelihood that the next node already received
               the query) is the same as the fraction of nodes visited
               so far, as long as that fraction is small


    Random graph better load distribution among nodes



                                                                         79
p2p, Fall 05
                   Two New Blind Search Strategies



        1.     Expanding Ring – not a   fixed   TTL   (iterative
               deepening)


        2. Random Walks (more details) – reduce number of
           duplicate messages




                                                                   80
p2p, Fall 05
               Expanding Ring or Iterative Deepening

        Note that since flooding queries node in parallel, search
        may not stop even if the object is located

        Use successive floods with increasing TTL

         A node starts a flood with a small TTL
         If the search is not successful, the node increases the
        TTL and starts another flood
         The process repeats until the object is found


        Works well when hot objects are replicated more widely
        than cold objects


                                                                    81
p2p, Fall 05
               Expanding Ring or Iterative Deepening (details)

        Need to define
         A policy: at which depths the iterations are to occur (i.e.
        the successive TTLs)
        A time period W between successive iterations
                after waiting for a time period W, if it has not
               received a positive response (i.e. the requested
               object), the query initiator resends the query with a
               larger TTL

        Nodes maintain ID of queries for W + ε
        Α node that receives the same message as in the previous
        round does not process it, it just forwards it



                                                                        82
p2p, Fall 05
                            Expanding Ring

        Start with TTL = 1 and increase it linearly at each time by
        a step of 2

                                               For replication over
                                               10%, search stops at
                                               TTL 1 or 2




                                                                      83
p2p, Fall 05
                          Expanding Ring

   Comparison of     message   overhead    between   flooding   and
   expanding ring




    Even for objects that are replicated at 0.125% of the
    nodes, even if flooding uses the best TTL for each topology,
    expending ring still halves the per-node message overhead

                                                                      84
p2p, Fall 05
                            Expanding Ring


        More pronounced improvement for Random and Gnutella
        graphs than for the PLRG partly because the very high
        degree nodes in PLGR reduce the opportunity for
        incremental retries in the expanding ring


        Introduce slight increase in the delays of finding an object:
        From 2 to 4 in flooding to 3 to 6 in expanding ring




                                                                        85
p2p, Fall 05
                           Random Walks

     Forward the query to a randomly chosen neighbor at each step
     Each message a walker
     k-walkers
     The requesting node sends k query messages and each query
     message takes its own random walk


     k walkers after T steps should reach roughly the same number of
     nodes as 1 walker after kT steps
     So cut delay by a factor of k


     16 to 64 walkers give good results

                                                                    86
p2p, Fall 05
                            Random Walks

     When to terminate the walks
      TTL-based
      Checking: the walker periodically checks with the original
     requestor before walking to the next node (again uses (a larger)
     TTL, just to prevent loops)


     Experiments show that
               checking once at every 4th step strikes a good balance
               between the overhead of the checking message and the
               benefits of checking




                                                                   87
p2p, Fall 05
                          Random Walks

     When compared to flooding:
     The 32-walker random walk reduces message overhead by roughly
     two orders of magnitude for all queries across all network
     topologies at the expense of a slight increase in the number of
     hops (increasing from 2-6 to 4-15)


     When compared to expanding ring,
     The 32-walkers random walk outperforms expanding ring as well,
     particularly in PLRG and Gnutella graphs




                                                                  88
p2p, Fall 05
                          Random Walks

     Keeping State

      Each query has a unique ID and its k-walkers are tagged with
     this ID
      For each ID, a node remembers the neighbor it has forwarded
     the query
      When a new query with the same ID arrives, the node forwards
     it to a different neighbor (randomly chosen)

     Improves Random and Grid by reducing up to 30% the message
     overhead and up to 30% the number of hops
     Small improvements for Gnutella and PLRG



                                                                 89
p2p, Fall 05
                            Principles of Search



     Adaptive termination is very important
               Expanding ring or the checking method
     Message duplication should be minimized
               Preferably, each query should visit a node just once

     Granularity of the coverage should be small
               Increase of each additional step should not significantly
               increase the number of nodes visited




                                                                      90
p2p, Fall 05
                 Replication

     Next time




                               91
p2p, Fall 05

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:2/29/2012
language:
pages:91