Part II-1 Centralized P2P by yaoyufang

VIEWS: 41 PAGES: 163

									        Part II
Infrastructure of P2P
 Professor Shiao-Li Tsao
 Dept. Computer Science
 National Chiao Tung University

   Centralized P2P
   Unstructured P2P
   Structured P2P
   Hybrid P2P
   Hierarchical P2P

Major Operations of P2P

   Connect
       How to join a P2P overlay network?
   Search
       How to search an object in the P2P network?
   Download
       How to download the object?

     Part II-1
  Centralized P2P
Professor Shiao-Li Tsao
Dept. Computer Science
National Chiao Tung University

   Centralized index model
   Case study - Napster
Centralized Index Model (1/3)

   Utilize a central directory for object location,
    ID assignment, etc.
   For file-sharing P2P, location inquiry form
    central servers then downloaded directly from
Centralized Index Model (2/3)

                Centralized Repository

     upload index                    2    Query
     (connect)                           (search)

       A                                     B

                     3 download
Centralized Index Model (3/3)

   Benefits
       Simplicity
       Efficient search
       Limited bandwidth usage
   Drawbacks
       Unreliable (single point of failure)
       Performance bottleneck
       Scalability limits (scale the central directory)
       Vulnerable to DoS attacks
       Copyright infringement
Case Study - Napster
   When and who
       January 1999
       Shawn Fanning, a freshman of Northeastern
   Why
       Difficult to find and download music over networks
       Want to share music with friends
   How
       A program that allowed computer users to share and
        swap files, specifically music, through a centralized
        file server
       Napster – the first popular P2P file-sharing application
Napster: A Brief History(1/2)
   May 1999: Shawn Fanning (freshman, Northeastern
    Univ., dropout) founds Napster Inc.
   Success of Napster
       First massively popular peer-to-peer file sharing systems
       Simultaneous users: 640000 (November 2000)
       More than 60 million users downloaded the software
       Universities begin to block Napster due to overwhelmed
        bandwidth consumption
Napster: A Brief History(2/2)
   Law cases of Napster
      Dec. 7, 1999: Recording Industry Association of America (RIAA) sues Napster for
       copyright infringement
      April 13, 2000: Heavy metal rock group Metallica sues Napster for copyright
      April 27, 2000: Rapper Dr. Dre sues Napster
      July 2000: Court orders Napster to shut down
      Oct. 2000: German media firm Bertelsmann becomes a partner and drops lawsuit
      July 2001: Napster’s server shut down completely
      May 17, 2002: Napster announced that its assets would be acquired by
       Bertelsmann for $8 million
   Reborn of Napster
      Nov. 2002: Napster's brand and logos were acquired at bankruptcy auction by the
       company Roxio, Inc.
      May 2003: Roxio acquired pressplay for the re-launch of Napster music pay
      Oct. 2003: Napster 2.0 was announced
      2005: The second most popular legal music service overall (behind Apple iTunes)
      Revenue $94.69M in 2005, $108.73M in 2006.
Napster: System Overview

   A large cluster of dedicated central servers
    maintain an index of shared files
   The centralized servers also monitor the state
    of each peer and keep track of metadata
       The metadata is returned with the results of a
   Each peer maintains a connection to one of
    the central servers
Napster Operation: Connect (1/4)
File list and IP address
is uploaded
Napster Operation: Search (2/4)
User requests search
at server


Napster Operation: Download (3/4)
User pings hosts that
apparently have data to
look for best transfer rate
Napster Operation: Download (4/4)
User choose to initiate a
file exchange directly
Napster: Summary
   Napster is not a pure P2P system but it was the first one that
    raised important issues to the P2P community
   Hybrid decentralized unstructured system
      File transfer is decentralized but locating content is centralized

      Combination of client/server and P2P approaches

   Napster protocol is proprietary
      Stanford University senior David Weekly posted the protocol in
      Napster requested that he remove it, but Weekly created the
        OpenNap project instead
   Napster introduces two major problems
      Unreliable: central indexing server represents a single point of
      Legal responsibility for music files sharing
     Part II-2
 Unstructured P2P
Professor Shiao-Li Tsao
Dept. Computer Science
National Chiao Tung University

   Introduction
   Flooded requests model
   Gnutella
   Structella: Gnutella on a structured overlay
   Supernode model
   FastTrack and Kazaa

   Without the central index, a peer may blindly
    flood a query to the network among peers or
    among supernodes
       Flooded requests model
       Supernode model (hierarchical)
Flooded Requests Model (1/3)

   A pure P2P model
   Each request is flooded (broadcast) to
    directly connected peers, which flood their
       Until the request is answered or with a certain
        scope (TTL limit)
   This model is used by Gnutella
Flooded Requests Model (2/3)


Flooded Requests Model (3/3)

   Benefits
       Highly decentralized
       Reliability and fault-tolerance
   Drawbacks
       Excessive query traffic
       Not scalable
       Fail to find content that is actually in the system
Case Study - Gnutella

   Gnutella is an open, decentralized P2P
    search protocol
   Gnutella uses a fully distributed architecture
    without any central entity
   It relies on flooding to route message through
    the Gnutella network
Gnutella: A Brief History
   Developed by Justin Frankel and Tom Pepper of
    Nullsoft (Winamp)
   Gnutella = GNU + Nutella
   March 14, 2000: released and thousands downloaded
    that day
   The next day, AOL (the owner of Nullsoft) withdrawn
   A few days the protocol had been reverse engineered
    by Bryan Mayland and became open source
   The development of the Gnutella protocol is currently
    led by the GDF (Gnutella Developer Forum)
   Several open source Gnutella clients
      LimeWire, Morpheus, Gnucleus, etc.
Gnutella System Overview

   Build at the application level
   Peers self-organize into an application-level
   Each peer maintains connections with other
    peers (e.g., 4)
   Search: each peer initiates a controlled
    flooding through the network by sending a
    query packet to all of its neighbors
       TTL is decremented on each hop
Gnutella Protocol (1/4)
   Discovery of peers and searching for files are implemented by
    passing five descriptors (message types) between nodes
     Group membership message: Ping and Pong

     Search message: Query and QueryHit

     File transfer message: PUSH

   Each message has a randomly generated identifier
   Messages can be broadcasted or simply back-propagated
   To prevent re-broadcasting and implement back-propagation,
    each node keeps a short memory of the recently routed
   Messages are flagged with time-to-live (TTL) and “hops passed”
Gnutella Protocol Message Types

Type       Description                 Contained Information
Ping       Announce availability and   None
           probe for other servents

Pong       Response to a Ping          IP address and port# of responding
                                       servent; number and total kb of file
Query      Search request              Minimum network bandwidth of
                                       responding servent; search criteria
QueryHit   Returned by servents that   IP address, port# and network
           have the requested file     bandwidth of responding servent;
                                       number of results and result set
Push       File downloaded requests    Servent identifier; index of requested
           for firewalled servents     file; IP address and port to send file to
Gnutella Protocol: Connect (2/4)

   Peer discovery
       IRC (Internet Relay Chat), web pages, Ping-Pong
       Send GNUTELLA CONNECT to one known
        address node then wait for GNUTELLA OK
Gnutella Connect Operation

   Each peer connects to any peer already on the Gnutella
   Upon connecting,
      The peer announce its presence to the neighboring
      The neighboring peers propagate the announcement
       until it reaches all peers on the network
   Upon receiving the announcement
      The contacted peer responds with a bit of information
       about itself
      For example, number of files and amount of disk
       space on the particular peer to share with the network
Gnutella Connect Operation


Gnutella Discovery Operation

       Ping                           Ping

      Pong                     Pong
Ping/Pong Routing Example

Gnutella Protocol: Search (3/4)

   Search
       Query messages which contain a user specified
        search string are broadcasted
       The neighboring peers propagate the search
        query as well as match against locally stored file
       QueryHit messages are back-propagated replies
        to Query messages and include information
        necessary to download a file
Gnutella Search & Transfer Operations
Gnutella Protocol: Download (4/4)
   File download
       The file download protocol is HTTP, i.e., HTTP
        GET request
           Each Gnutella peer has web browser functions built-in
       A file Push request is sent to the file provider when
        this peer is behind a firewall

                        Peer is behind a firewall
Query/QueryHit/Push Routing Example

Gnutella Measurement (1/3)

   In November 2000
       36% user-generated traffic
        (QUERY messages)
       55% overhead traffic (PING
        and PONG messages)
   In June 2001, these
    problems were solved with
    the arrival of newer Gnutella
       92% QUERY messages
       8% PING messages
Gnutella Measurement (2/3)

   In power-law networks,
    most nodes have few
    links and a tiny number
    of hubs have a large
    number of links
   Gnutella is similar to a
    power-law network,
    thus being able to
    operate in highly
    dynamic environments
Gnutella Measurement (3/3)
   Jovanović et al. develop a
    Gnutella network crawler
    called gnutcrawl
       Obtain several instances of
        Gnutella network topology
        between Nov. 13 and Dec. 28,
           This topology instance consists
            of 1026 nodes and 3752 edges
           The diameter of graph is 8
       Results
           Gnutella network topology
            exhibits strong small-world
           Discover a power-law
            distribution of node degrees
Free Riding on Gnutella

   Peers that free ride on
    Gnutella are those that
    only download files for
    themselves without
    ever providing files for
    download by others
   Most of Gnutella users
    are free riders
       22,084 (66%) of the
        peers share no files
       24,347 (73%) share ten
        or less files
Gnutella Summary

   A fully distributed peer-to-peer protocol
       Reliability and fault-tolerance properties
       Flooding raises questions of cost and scalability
   The current Gnutella protocol can not scale
    beyond a network size of a few thousand
    nodes without becoming fragmented
[1] The Gnutella Protocol Specification v4.0.
[2] M. Ripeanu, “Peer-to-Peer Architecture Case Study: Gnutella
    Network,” University of Chicago Technical Report TR-2001-26.
[3] M. Jovanović, F. Annexstein, and K. Berman “Scalability issues
    in large peer-to-peer networks - a case study of Gnutella,”
    Technical report, University of Cincinnati, 2001.
[4] E. Adar and B. Huberman, “Free Riding on Gnutella,” First
    Monday 5, October 2000.
[5] M. Portmann, P. Sookavatana, S. Ardon, and A. Seneviratne,
    “The cost of peer discovery and searching in the Gnutella peer-
    to-peer file sharing protocol,” in Proc. of ICON’01, Vol. 1, pp.
    263-268, 2001.
   Part II-3
Structured P2P
    Professor Shiao-Li Tsao
   Dept. Computer Science
National Chiao Tung University

   Introduction
   Document routing model
   CAN
   Chord
   Pastry

   Most of P2P file sharing systems before
       Napster is expensive and vulnerable
       Gnutella is not scalable and may fail to find content
   How to make a scalable peer-to-peer file distributed
       Phase 1: find the peer from whom to retrieve the file
         Need scalable indexing mechanisms

       Phase 2: peer-to-peer file transfer process
         It is inherently scalable
Document Routing Model (1/5)

   Each peer is assigned a random or hashed
    ID and knows a given number of peers
   An ID is assigned to every shared document
    based on a hash function
   A document is published (shared) to the peer
    with the ID that is most similar to the
    document ID
   A request will go to the peer with the ID most
    similar to the document ID
Document Routing Model (2/5)
                       File transfer
                                                   ID 5000

                                                   File ID=h(data)=0008
                                                          0005 …
                                                          0008 at ID5000
                                                          xxxx …

File ID=h(data)=0008                                       ID 0100

               ID 1000                   ID 0200
Document Routing Model (3/5)

   Benefits
       Scalability: more efficient searching
       Logarithmic bounds to locate a document
       Fault tolerance
   Drawbacks
       Routing table maintenance
       Network partitioning may cause an islanding
Document Routing Model (4/5)

   P2P systems implement the document
    routing model
       CAN: S. Ratnasamy et al., UC Berkeley, 2001
       Chord: I. Stoica et al., MIT and Berkeley, 2001
       Pastry: A. Rowstron and P. Druschel, MS, UK,
       Tapestry: Ben Y. Zhao et al., UC Berkeley, 2001
   These systems are also known as DHT-
    based P2P systems
Document Routing Model (5/5)

   The hash table is a data structure that
    efficiently maps keys onto values
   The distributed hash table (DHT)
       Distributed, Internet-scale, hash table
       Lookup, insertion and deletion of (key, value)
       Only support exact-match search, rather than
        keyword search
Context-Addressable Network
   CAN resembles a hash
    table and performs
    operations such as
    insertion, lookup and
    deletion of (key,value) pairs
   CAN space
       A virtual d-dimensional
        Cartesian coordinate space
        on a d-torus
       This virtual coordinate
        space is used to store
        (key,value) pairs
       The entire coordinate space
        is dynamically partitioned
        among all the nodes in the
CAN System Overview

   Every CAN node owns its distinct zone
   Every CAN node learns and maintains a
    coordinate routing table that holds the IP
    addresses and virtual coordinate zone of each
    of its immediate neighbors
   A key is mapped onto a point P in the
    coordinate space using uniform hash function
   The (key, value) pair is stored at the node that
    owns the zone within which the point P lies
CAN Example (1/4)

   Every CAN node owns
    a zone in the overall
   Example
       2-dimensional [0,1]×[0,1]
        space with 5 nodes join
        in succession                 1
       For a 2-d space, split
        along X-axis first, then Y,
        then X again followed by
        Y and so forth
CAN Example (2/4)


   1       2        1

CAN Example (3/4)

                            1                2              4
         2       4   (0.0-0.5, 0.5-1.0)     (0.5-         (0.75-
                                           0.75,         1.0, 0.5-
                                          0.5-1.0)         1.0)


             3              5                        3
                     (0.0-0.5, 0.0-0.5)   (0.5-1.0, 0.0-0.5)
CAN Example (4/4)

                                                    (0.0-0.5, *)                 (0.5-1.0, *)
       1                2              4                     0                1
                       (0.5-         (0.75-                  split along the x
(0.0-0.5, 0.5-1.0)
                      0.75,         1.0, 0.5-
                                                                     (0.5-1.0,          (0.5-1.0,
                     0.5-1.0)         1.0)
                                                                      0.0-0.5)           0.5-1.0)
                                                    0        1                0        1
                                                                           split along the y
                                                5                1           3

                                                                                          0     1

       5                        3                                                     2             4
(0.0-0.5, 0.0-0.5)   (0.5-1.0, 0.0-0.5)
                                                5 CAN nodes and its corresponding
                                                binary partition tree. Think of each
                                                existing zone as a leaf.
CAN Construction (1/4)

   When a new node joins, an existing node
    splits its allocated zone in half, retains half
    and hands the other half to the new one
       First find a node already in the CAN
       Next, find a node whose zone will be split
       Finally, neighbors of the split zone are notified
CAN Construction (2/4)

   Bootstrapping
       A new node looks up the                        6      2
        CAN domain name in               6
        DNS to retrieve a                5
        bootstrap node’s IP       bootstrap            3      1      5
                                  node 4
       The bootstrap node then                               4
        supplies the IP                  2
        addresses of nodes               1
        currently in the system
                                              0   1   2 3 4       5 6 7
                                  new node
CAN Construction (3/4)
   Finding a zone
       The new node randomly               7
        chooses a point P and                             6      2
        sends a JOIN request                6
        destined for point P                                         P
       Each existing CAN node              5
                                                          3      1       5
        uses the CAN routing                4
        mechanism to forward the
        request                             3
       The current occupant node                                4
        splits its zone in half and         2
        assigns one half to the new
        node                                1
       The (key, value) pairs from         0
        the half zone to be handed
        over are transferred to the              0   1   2 3 4       5 6 7
        new node                      new node
CAN Construction (4/4)

   Joining the routing
       The new node learns the IP      7
                                                     6       2
        addresses of its coordinate     6
        neighbor set from the
        previous occupant               5
                                                     3   1       7   5
       The previous occupant           4
        updates its neighbor set        3
       Both the new and old                                 4
        nodes’ neighbors are            2
        informed of this reallocation   1
        of space
            Through immediate and
             periodic update messages       0   1   2 3 4        5 6 7
CAN Routing
   Data stored in the CAN is addressed by name (i.e. key) not
    location (i.e. IP address)
   A node routes a message towards its destination by greedy
    forwarding to the neighbor with closest coordinates
      If nodes crash, many different paths exist (routing fault tolerance)

      If lose all neighbors, and repair mechanisms haven’t yet rebuilt
       neighbor states, then perform an expanding ring search to locate
       and forward to a closer node
   For d dimensions partitioned into n equal zones
      Each node maintains 2d neighbors

      The average routing path length: (d/4)(n1/d) hops
CAN Routing Example

   A CAN message includes
    the destination coordinates                                   d4
   Using the neighbor
    coordinate set, a node        6
    routes a message towards      5
    its destination by greedy     4
    forwarding to the neighbor
    with closest coordinates      3

   Example: 2-dimensional        2
    space                         1
                                      0   1   2   3   4   5   6   7
CAN Node Departure

   When a node leaves, it explicitly hands over its
    zone and the associated (key, value) database
    to one of its neighbors
     The neighbor can merge the departing node’s
      zone to a single zone
     Or the zone volume is smallest

         A node may hold more than one zone results in
CAN Maintenance (1/3)

   The (key, value) pairs are periodically
    refreshed by the holders of the data
   Each node sends periodic update messages to
    its neighbors giving its zone coordinates and a
    list of its neighbors and their zone coordinates
      The absence of update message signals a
        failure of neighbor
CAN Maintenance (2/3)

   When node or network failures happen, the takeover
    mechanism ensures one of the failed node’s
    neighbors can take over the zone
       A neighboring node is chosen that is still alive and has a
        small zone volume
       However, the (key, value) pairs held by the failed node are
        lost until the state is refreshed
   When the simultaneous failure of multiple adjacent
    nodes are involved
       First perform an expanding ring search to rebuild sufficient
        neighbor states
       Initiate the takeover mechanism
CAN Maintenance (3/3)

   A background zone-reassignment algorithm retains
    the one-to-one node to zone assignment and
    prevents fragmentations
   When a leaf x is removed
       Find a leaf node y that is
            Either x’s sibling, thus zones x and y merge into a single
            Or descendant of x’s sibling where y’s sibling is also a leaf
       y takes over x’s zone
       y’s sibling takes over y’s previous zone
    Zone Reassignment Example

        1                                  8
                     3                                                                           8
4                                                                                                            Node 11 takes
                                                                                                             over the
3                    takeover algo.
                                                                                                             combined zone
        4            6
                with smaller
                                               10       1                                            X
2               zone volume                                                                              ↓DFS to find two
                                   9                                                                     sibling leaves
1                                                            2     3    X
                                                                        4        5     6     7             10    11
        5            7                         11
0                                                      Leaf node 5 is node 4’s
                                                       sibling and can merge         Node 10 is the descendant of node 9
    0       1    2       3     4       5       6   7   zones 4 and 5                 The sibling of node 10 is also a leaf

        Node 6 discovers sibling nodes 10 and
        11 by background reassignment
CAN Design Improvements (1/5)

   The hops in the CAN path are application
    level hops, not IP-level hops
       The latency of each hops might be substantial
   The average latency of a lookup is the
    average number of CAN hops times the
    average latency of each CAN hop
   The improvements aim to reduce the latency
    of CAN routing
       Reduce either the path length or the per-CAN-hop
CAN Design Improvements (2/5)

   Multiple dimensions
       Increase the number
        implies that a node has
        more neighbors
   Multiple coordinate
       Each coordinate space is
        called a reality
       Increase the reality
        implies that the distant
        portions of the coordinate
        space may be reached in
        a single hop
CAN Design Improvements (3/5)

   Multiple dimensions vs.
   Both yield shorter path

                                   # hops
    lengths, but higher per-node                     a single node is assigned
                                                     r zones, one on every
    neighbor state and                               coordinate space
    maintenance traffic
                                                               the path length
   Increasing d results in                                    scales as O(d(n1/d))
    shorter lengths than
    increasing r
   But multiple r improves
    more data availability and
    routing fault-tolerance
                                            # neighbors maintained per node
CAN Design Improvements (4/5)

   Better CAN routing metrics
       RTT-weighted routing
   Overloading coordinate zones
       Multiple nodes share the same zone
CAN Design Improvements (5/5)
   Multiple hash functions
       Use k hash functions to map a single key onto k points
   Topologically-sensitive construction of the CAN overlay network
       Distributed binning based on distances from landmarks
   More uniform partitioning
   Caching and replication techniques for “hot spot” management
CAN Summary

   CAN is completely distributed, scalable, and
   For d dimensions and n nodes
       Per-node neighbor state: O(d)
       The node insertion affects O(d) existing nodes
            Independent of # nodes in the system
       The routing path length: O(dn1/d) hops
       When d=logn, CAN resembles the other DHT
        algorithms (O(logn) )

   Chord is a distributed lookup protocol that tries to
    efficiently find the location of the node that stores a
    desired data
   Chord protocol supports just one operation: given a
    key, it maps the key onto a node
   Chord maps its nodes to a one-dimensional space
   In an N-node network, each node maintains
    information about only O(logN) other nodes, and a
    lookup requires only O(logN) messages
Chord System Overview

   m-bit key/node identifier using SHA-1 hash function
    (m must be large enough)
   These identifiers are ordered on an identifier circle
    modulo 2m
       Chord ring: one-dimensional circular key space
   Key k is assigned to the successor, the first node
    whose identifier is equal to or follows k in the
    identifier space
   Each node maintains
       A routing table with (at most) m entries, called finger table
       The previous node on the identifier circle, called
Example: Chord Ring with m=6

  successor(K54) is N56

                                                        successor(K10) is N14
                                                             predecessor(N14) is N8

    successor(K38) is N38

An identifier circle consisting of 10 nodes storing 5 keys
The Finger Table

   The ith entry at node n contains the identity of
    the first node s that succeeds n by at least 2i-
    1 on the identifier circle, where 1 ≤ i ≤ m

   ith finger s = successor(n+2i-1) modulo 2m
   A table entry includes both Chord identifier
    and IP address of the relevant node
   The first finger of node n is its immediate
    successor which also called the successor
 Example: The Finger Table
Finger table entries point to the     N14 is the first node that
first node greater than or equal to   succeeds (8+21-1) mod 26=9
a distance 2i-1 away from the
node, for 1≤i≤m , modulo 2m

   N42 is the first node that
   succeeds (8+26-1) mod 26=40
Chord Ring Construction

   Decide the m number
   Map identifiers (0~2m-1) to nodes or keys
   Identifiers are ordered on an identifier circle
    modulo 2m
   The first node joins the Chord ring
Chord Node Join

   When node n first starts
       It calls n.join(n’), this
        function ask n’ to find the
        immediate successor of
       or n.create() to create a
        new Chord network
Node Join Steps

   Node n tries to join an existing Chord
       s=Lookup (n)
       Copy the keys between (s.predecessor, n) from s
        to n
       n.predecessor = s.predecessor
       n.sucessor =s
       s.predecessor.successor = n
       s.predecessor=n
       Update finger tables
 Example: Chord Node Join

                                                1       keys less
                                           0        2    than 14
N14+1   [15, 0)     N3
                                   15                   3
N14+2   [0, 2)      N3                                              N3+1   [4, 5)    N5

N14+4   [2, 6)      N3                                              N3+2   [5, 7)    N5
                           14                               4
N14+8   [6, 14)     N3                                              N3+4   [7, 11)   N10
                                                                    N3+8   [11, 3)   N3
                         13                                 5

                                                                    N5+1   [6, 7)    N10
                            12                                      N5+2   [7, 9)    N10
                                                        7           N5+4   [9, 13) N10
                                                                    N5+8   [13, 5) N3
                 N10+1   [11, 12)    N14
                                     N3    10       8
                 N10+2            N14
                         [12, 14) N3
                 N10+4   [14, 2)     N14
                 N10+8   [2, 10)     N3
Simple Key Location

   Key Lookup: determine the
    successor of the key
   Each node only knows how
    to contact its current
    successor node on the
    identifier circle
   Lookup uses a number of
    message linear in the
    number of nodes
   Example
       Path taken by a query form
        node 8 for key 54
Scalable Key Location
   Lookup using the finger
       Node n calls                      id falls between n and
        find_successor(id) to find        its successor
        the successor node of an
        identifier id                         n searches finger
       The closer n’ is to id, the           table for the node
                                              n’ whose ID most
        more it will know about the           immediately
        identifier circle in the region       precedes id
        of id
   Theorem: the number of
    nodes that must be
    contacted to find a
    successor in an N-node
    network is O(logN)
   Example: Lookup for Key 54
   N51+1    N56        return successor
   N51+2    N56
   N51+4    N56                                                       N8+1    N14
   N51+8    N1                                                        N8+2    N14
   N51+16   N8                                                        N8+4    N14
   N51+32   N21                                                       N8+8    N21
                                                                      N8+16   N32
                                 +23                                  N8+32   N42
       N42+1      N48                           +25

       N42+2      N48                                          closest_preceding_node(K54)

       N42+4      N48
       N42+8      N51
       N42+16     N1
       N42+32     N14
                                   54 - 8 = 46 = 101110two = 25 + 0 + 23 + 22 + 21 + 0
Chord Node Departure

   A node transfers its keys to its successor
    before it departs
   It also notify its predecessor and successor
    before leaving
       Assume that node n has sent its predecessor to
        successor s, and the last node in its successor list
        to predecessor p
       p will remove n from its successor list, and add
        the last node in n's successor list to its own list
       s will replace its predecessor with n's predecessor
Chord Stabilization

   Must ensure each
    node’s successor
    pointer is up to date
   Each node runs a
    stabilization protocol
    periodically in the
    background and which
    updates Chord’s finger
    tables and successor
       if predecessor has failed, n will
       accept a new predecessor in notify
Chord Stabilization when Node Joins






Chord Failure and Replication

   To increase robustness, each node maintains a
    successor list containing its first r successors
       if successor fails, substitute the first live entry in its list
       r = Ω(logN) by theorems
   If a node fails during find_successor procedure, the
    lookup proceeds, after a timeout, by trying the next
    best predecessor among the nodes in the finger
    table and the successor list
   Store replicas of the data associated with a key at
    the k nodes succeeding the key
Chord Summary

   Chord is
       Efficiency: O(logn) messages per lookup
       Scalability: O(logn) states per node
       Robustness: surviving massive failures
   Chord consists of
       Consistent hashing
       Small routing tables size: O(logn)
       Fast join/leave protocol

   Pastry is a peer-to-peer content location and
    routing system based on a self-organizing
    overlay network of nodes
   A circular nodeId space
   Each node maintains some routing states
   Pastry takes into account network locality to
    reduce the routing latency
Pastry System Overview

   A 128-bit circular nodeId space ranges from 0 to
   Each node is randomly assigned a unique identifier
   NodeIds and keys are a sequence of digits with
    base 2b
   Each node maintains a routing table, neighborhood
    set and leaf set
   A node routes the message to the node with a
    nodeId that is numerically closest to the given key
The Routing Table R
   The routing table is log2bN
    rows with 2b-1 entries each
       The top row is row zero
   The entries at row n refer to a
    node whose nodeId shares the
    present node’s id in the first n
    digits, but whose n+1th digit is
    different from the n+1th digit of
    the present node’s id
   Each entry contains IP address
    of one node with the
    appropriate prefix
   Example: 16-bit nodeIds with
    base 4 (b=2), a node with
    nodeId 10233102

                  Routing table format: matched digits–column number–rest of ID
The Neighborhood Set M

   The neighborhood set contains the nodeIds
    and IP addresses of |M| closest nodes
    (according the proximity metric)
       |M| is 2b or 2x2b
   The neighborhood set is not used in routing,
    but is used in maintaining locality properties
   Example: a node with nodeId 10233102, M =
The Leaf Set L

   The leaf set contains |L|/2 nodes with
    numerically closest larger nodeIds and |L|/2
    nodes with numerically closest smaller
       |L| is 2b or 2x2b
   The leaf set serves as a fall back for routing
   Example: a node with nodeId 10233102, l = 8
Pastry Routing

   When a message with key D arrives at a node with
    nodeId A, there are three cases
       If key D is within the range of leaf set, the message is
        forwarded directly to the node in the leaf set
       Else, search the routing table and forward the message to
        a node whose nodeId shares a prefix with the key by at
        least one more digit
       If no such node, forward the message to a node whose
        nodeId shares a prefix with the key as long as the current
        node, and is numerically closer to the key than the present
        node’s id
   The expected number of routing steps is O(logN),
    where N is the number of Pastry nodes
Pseudo Code for Pastry Routing

                           (1) Node is in the leaf set

                        (2) Forward message to a
                       closer node (Better match)

                       (3) Forward towards numerically
                       Closer node (not a better match)
                         D: Message Key
                         Li: ith closest NodeId in leaf set
                         shl(A, B): Length of prefix shared
                                       by nodes A and B
                          i : (j, i)th entry of routing table
Pastry Routing Example
Pastry Node Join
       A nearby Pastry node A (according the proximity metric) is
        located by using expanding ring IP multicast or out-of-band
       A new node with nodeId X asks A to route a join message to the
        existing zone Z whose id is numerically closest to X
            A forwards the join message to B, C, …, Z
     X initiates its own state tables
        Obtain the ith row of routing table from the ith node encountered
         along the route from A to Z
              X’s Row 0 = A’s row 0 (X0 = A0)
              X’s Row 1 = B’s row 1 (X1 = B1)
              …etc.
         A’s neighborhood set is the basis for X’s
         Z’s leaf set is the basis for X’s
     X transmits its resulting state to nodes that need to be aware of
      its arrival, and states in all affected nodes are updated
Pastry Maintenance (1/2)

   Nodes in the Pastry network may fail or
    depart without warning
   A Pastry node is considered failed when its
    immediate neighbors can no longer
    communicate with it
   To replace a failed node in the leaf set
       Contact the live node with the largest index on the
        side of the failed node, and asks for that node’s
        leaf table
Pastry Maintenance (2/2)

   To replace a failed node in the neighborhood
       Ask other members for their neighborhood tables,
        checks the distance of each discovered nodes,
        and updates its neighborhood set
   To repair a failed routing table entry
       Contact the node referred to by another entry of
        the same row, and asks for that node’s entry
       If no live node in the same row, contact an entry
        of the next row
Repair a Failed Routing Table Entry
Routing Performance

                            |L|=16 * b=4 * |M|=32 * 200,000 lookups

 Source: Rowstron & Drushel, 2001
                                     CMPT 880: P2P Systems - SFU      105
Pastry routing

   Source: Rowstron & Drushel, 2001

                                      CMPT 880: P2P Systems - SFU   106
Routing with failures

 Source: Rowstron & Drushel, 2001
                                    CMPT 880: P2P Systems - SFU   107
Pastry locality

                          |L|=16 * b=4 * |M|=32 * 200,000 lookups
  Source: Rowstron & Drushel, 2001

                                     CMPT 880: P2P Systems - SFU    108
Pastry Summary

   Pastry routes to any node in the overlay
    network in O(logN) steps
   Pastry maintains routing tables with O(logN)
   Several applications have been built on top of
       PAST uses the fileId as the key
       SCRIBE uses the topicId as the key
[1] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker,
    “A Scalable Content-Addressable network,” in Proc. SIGCOMM, San
    Diego, CA, Aug. 2001, pp. 161-172.
[2] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan,
    “Chord: A Scalable Peer-to-Peer Lookup Service for Internet
    Applications,” in Proc. SIGCOMM, San Diego, CA, Aug. 2001, pp.
[3] A. Rowstron and P. Druschel, “Pastry: Scalable, Distributed Object
    Location and Routing for Large-Scale Peer-to-Peer Systems,” in
    Proc. Middleware, Heidelberg, Germany, Nov. 2001, pp. 329-350.
       Part II-4
      Hybrid P2P
Professor Shiao-Li Tsao
Dept. Computer Science
National Chiao Tung University
Structella (1/2)

   Two types of peer-to-peer overlays
     Unstructured overlays support complex queries but

       it is inefficient
     Structured overlays provide efficient discovery but
       don’t support complex queries
   Castro et al. proposed Structella to build Gnutella on a
    structured overlay
     Use Pastry’s overlay and maintenance algorithms

     Retain the content placement and discovery

       mechanisms of Gnutella to support complex queries
Structella (2/2)

   Maintenance overhead
       Compare Structella and
        the optimized Gnutella
       It is commonly believed
        that unstructured
        overlays have lower
        maintenance overhead
        than structured overlays
       In contrast, Structella
        exploits overlay
        structure to reduce
Supernode Model (1/3)

   Each peer is either designated as a supernode or
    assigned to a supernode
   Supernode acts both as a local central index for files
    shared by local peers and as an equal in a network
    of supernodes
   Peers connect to their local supernode to upload
    information about the files they share, and to
    perform searches
   Supernodes are equal in search while all peers are
    equal in download
   Examples: FastTrack and Kazaa
Supernode Model (2/3)


                           peer node
Supernode Model (3/3)

   Benefits
       No single point of failure
   Drawbacks
       Supernode may become overloaded or been
       Copyright infringement
   Proprietary and encrypted protocol used by Kazaa, Grokster, and
    iMesh file-sharing programs
   Presumed architecture
      Supernodes index data, provide search capabilities for a set of
       data peers and forward queries to other superpeers
      A central super-superpeer

   Employ UUHash hashing algorithm to allow downloading from
    multiple sources
      RIAA takes advantage of UUHash (hash files very quickly) to
       spread false files on the network
   Reverse engineering clients
      giFT-FastTrack, iMesh, Grokster, Kazaa, etc. connect to the
       decentralized network
   Hash the first 300 kilobytes using MD5
   Then apply a smallhash function (identical to the CRC32 checksum used by
    PNG) to 300 KB blocks at file offsets 2n MB.
       offset 1 MB, 300 KB hashed
       offset 2 MB, 300 KB hashed
       offset 4 MB, 300 KB hashed
       offset 8 MB, 300 KB hashed
       ...
       last 300 KB of file hashed
   Finally the last 300 KB of the file are hashed. If the last 300 KB of the file
    overlap with the last block of the 2n sequence this block is ignored in favor
    of the file end block.
   The actual hash used on the FastTrack network is a concatenation of 128
    bit MD5 of the first 300 KB of the file and a sparse 32 bit smallhash
    calculated in the way described above. The resulting 160 bits when
    encoded using Base64 become the UUHash.
Kazaa: A Brief History

   Both Kazaa and FastTrack protocol are the
    brainchild of Niklas Zennström and Janus Friis and
    were introduced in March 2001 by their Dutch
    company Consumer Empowerment
   When the owners were sued in the Netherlands in
    2001, Consumer Empowerment sold the Kazaa
    application to Sharman Networks, headquartered in
    Australia and incorporated in Vanuatu
   Kazaa Media Desktop,
Kazaa System Overview
   Kazaa is proprietary
      Few details are publicly available

      Traffic on control channel is encrypted
      Traffic on download channel is not encrypted HTTP transfers

   Two-tier hierarchical system with two classes of peers
      Ordinary Node (ON)

      Super Node (SN): mini Napster-like hub

   Each supernode knows where the other supernodes are
   List of potential supernodes included within software download
   Query
      Node first sends query to supernode

      Supernode may forward query to a subset of supernodes
Kazaa Measurement
   A measurement study in 2003
   There are roughly 30,000 SNs
   The SNs form the backbone of the Kazaa network which is self-
    organizing and is managed with a distributed, but proprietary, gossip
   The average supernode lifetime is about 2.5 hours
   Each supernode maintains a list of SNs
   The SNs frequently exchange (possibly subsets) of these lists with each
   SNs establish both short-lived and long-lived TCP connections with
    each other
   Each SN has about 40-60 connections to other SNs at any given time
   Each SN has about 100 to 200 children ONs at any given time
   Each SN maintains a database, storing the metadata of the files its
    children are sharing
   SNs do not exchange metadata with each other
Kazaa Summary

   The supernodes form the backbone of the
    Kazaa network
   Combine the efficiency of a centralized
    search with the autonomy, load balancing
    and robustness to attacks
   No central point of failure
   Copyright infringement

[1] J. Liang, R. Kumar and K.W. Ross, “Understanding KaZaA,”
    submitted, 2004.
[2] Miguel Castro, Manuel Costa, and Antony Rowstron, “Should We
    Build Gnutella on a Structured Overlay?” 2nd Workshop on Hot
    Topics in Networks (HotNets-II), November 2003.
      Part II-5
  Hierarchical P2P
Professor Shiao-Li Tsao
Dept. Computer Science
National Chiao Tung University

   Introduction
   Superpeer network
   Hierarchical P2P system

   Motivation
       Traditional P2P systems organize peers into a flat overlay
       However, peers differ in up times, bandwidth connectivity,
        and CPU power
       Large-scale P2P systems should exploit the heterogeneity
        of peers
       To exploit the heterogeneity, peers should be organized in
        a hierarchy
   We introduce some hierarchical P2P systems
       Superpeer network
       HIERAS: A DHT based hierarchical P2P routing algorithm
       Hierarchical peer-to-peer system
Superpeer Network
Superpeer Network (1/3)

   A P2P network consisting of superpeers and
    their clients
   Superpeer
       A node acts both as a server to a set of clients
        and as an equal in a network of super-peers
       Keep an index over its clients’ data
Superpeer Network (2/3)

   Superpeer redundancy
       A superpeer is k-redundant if there are k nodes sharing
        the super-peer load
       Superpeer redundancy improves both reliability and
        performance of the superpeer network, and reduces
        individual load

No redundancy                             2-redundancy
Superpeer Network (3/3)

   Layer management
       Superpeer P2P systems divides peers into two layers:
        superlayer and leaf-layer
       The lack of an appropriate layer size ratio maintenance
           The ratio of the number of leaf-peers to the number of
       Xiao et al. propose a dynamic layer management algorithm
        (DLM) to adaptively adjust peers between superlayer and
           Step 1: Information collection
           Step 2: Maintain appropriate layer-size-ratio
           Step 3: Scaled comparisons of metric values of peers
           Step 4: Promotion or demotion
Hierarchical P2P Routing

   In current DHT based routing algorithms, a
    routing hop could happen between two peers
    with long network link latency
   A multi-layer DHT based P2P routing
    algorithm called HIERAS
       Combine a hierarchical structure with the current
        DHT based routing algorithms
       Keep scalability property of current DHT
        algorithms, i.e., routing finishes within O(logN)
HIERAS Hierarchical Architecture (1/2)

   Hierarchical P2P layers
       A lot of P2P rings coexist in different layers
       In each P2P layer, all the peers are grouped into several
        disjointed P2P rings
           The biggest P2P ring in the highest layer which contains all the peers
           Group topologically adjacent peers into other P2P rings in lower
       A peer must belong to a P2P ring in each layer
           If hierarchy depth is k, a peer belongs to k P2P rings with one in each
       Average link latency between two peers in lower level rings is
        much smaller than higher level rings

HIERAS System Example
                   3 layer-2 P2P rings: P1, P2 and P3

P is the biggest
layer-1 ring

                             Node A is a member of P3
                             and also belongs to P

HIERAS Hierarchical Architecture (2/2)

   The distributed binning scheme is used to
    determine to which rings a new node should
    be added
       The scheme is a topology measurement
       A well-known set of machines as the landmark
       Nodes partition into bins such that nodes that fall
        within a given bin are relatively closer to each
        other in terms of network latency (ping)

Level 0 for latencies in [0,20]         The order information is created
Level 1 for latencies in [20,100]       according to the measured latencies
Level 2 for latencies larger than 100   to the 4 landmarks L1, L2, L3, and L4

HIERAS System Design (1/4)

   HIERAS is a multi-layer DHT based routing
   HIERAS is built on top of an existing DHT
    routing algorithm
       Each node or file is given a unique identifier
        generated by using a hash algorithm
       Use Chord algorithm as the underlying routing
       Assume that the identifier length is k bit and the
        hierarchy depth is m
HIERAS System Design (2/4)

   Finger table
       Each node has a finger table with at most k
       The highest layer finger table in HIERAS is the
        same as Chord finger table
       Each node in HIERAS creates m-1 other finger
        tables in lower layer P2P rings it belongs to
           Only the peers within its corresponding lower layer ring
            can be put into this finger table
       Example: a two-layer system with 3 landmarks, 28
HIERAS System Design (3/4)

    The layer-1 successor nodes can   All layer-2 successor nodes
    be chosen from all system peers   belong to layer-2 P2P ring
                                      “012” as node 121

HIERAS System Design (4/4)
   Landmark table
       Record the IP addresses of all landmark nodes
   Ring table
       Used to maintain information of different P2P rings and to find a node in
        this particular P2P ring when a new node is added into system
       Stored on the node whose nodeid is the numerically closest to its ringid
       Duplicated on several nodes for fault tolerance
                        defined by the landmark
                        order information such as

         generated by using hash on
         the ringname

HIERAS Routing Algorithm

   The higher the layer, the more nodes are
    included, and more closer to the destination
   In a m-layer HIERAS system, a routing
    procedure has m loops
       At lowest layer, route to the closest node at this
        layer which the request originator is located in
       Move up, do the same thing and repeat
       Eventually reach the biggest P2P ring and the
        destination node, and the location information of
        the requested file is returned to the originator

HIERAS Routing Example
Destination node
                                            Request originator

                                                   Layer 1 ring
                                                   100ms latency each hop

                                                    3 layer-2 rings
                                                    25ms latency each hop

   Assume 4 routing hops
   Original Chord needs 4x100ms, but HIERAS needs 25ms+3x100ms
HIERAS Node Join (1/2)

   When a new node n joins, it sends a join
    message to a nearby node n’
   n get information of landmarks from n’ and
    create the landmark table
   n use the distributed binning scheme to
    determine the suitable P2P rings (ringname) it
    should join
   Create the highest layer finger table
       Learn fingers by asking n’ to look up in the whole P2P
        overlay network

HIERAS Node Join (2/2)
   Create finger tables in lower layers
       Assume that node c stores the ring table of a specific ring and
        node p in that particular ring
       n calculates the ringid of this ring and sends a ring table request
        message to c using the highest layer finger table
       c responses with the stored ring table, then n knows several
        nodes in this ring
       n sends a finger table creation request to p
       p updates its table and creates the finger table for n, then sends
        back to n
       n compares its nodeid with the nodeids in the ring table
           If n should replace one of them, it sends a ring table modification
            message back to c and c modifies the ring table
       The above procedures will repeat m times in a m-layer HIERAS

HIERAS Maintenance

   A node may leave or fail silently
   As in Chord algorithm, a node keeps a
    successor list of its r nearest successors in
    each layer
   The maintenance overhead is affordable
    because the nodes within the low layer rings
    are topologically closer

HIERAS Performance Evaluation (1/3)
                Both have good scalability.
 HIERAS Performance Evaluation (2/3)

The average routing latency per hop in HIERAS is much smaller than Chord
HIERAS Performance Evaluation (3/3)

 HIERAS spends 1.55% more           The average routing latency in
 hops per request than Chord on     HIERAS is only 54.07% of Chord

 But only 1.887% hops per request
 are taken In the higher layer
HIERAS Summary

   HIERAS creates many small P2P rings in
    different layers by grouping topologically
    adjacent nodes together
   The hierarchical structure can improve
    routing performance of current DHT routing

Hierarchical P2P System
Hierarchical P2P System (1/2)

   Internet uses hierarchical routing and has several
    benefits such as scalability and administrative
   In contrast, Chord, CAN, Pastry and Tapestry are all
    flat DHT designs without hierarchical routing
   Garces-Erice et al. present a framework for
    hierarchical DHTs
       Peers are organized into groups
       Each group has its autonomous intra-group overlay
        network and lookup service
       A top-level overlay is defined among the groups
Hierarchical P2P System (2/2)

   Several advantages compared to the flat overlay
       Exploiting heterogeneous peers: more stable peers in
        top-level overlay
       Transparency: key moved, change intra-group lookup
       Faster lookup time: number of groups << total number
        of peers
       Less messages in the wide-area: most overlay
        reconstruction messages happen inside groups
Hierarchical Framework (1/6)
   Focus on a two-tier hierarchy
   The peers are organized into groups
     Peers in the same group are topologically close

   The groups are organized into a top-level overlay network
   Each group has one or more superpeers
   The superpeers like gateways are used for inter-group query
   Within each group there is an overlay network that is used for
    query communication among peers in the group
     Each of the groups operates autonomously from the other groups

   The peers send lookup query messages to each other using a
    hierarchical overlay network
Hierarchical Framework (2/6)
Hierarchical Framework (3/6)

   Hierarchy and group management
       The joining peer p which belongs to group g contacts another
        peer p’ to lookup p’s group using key g
            If the group id of the returned superpeer is g, p joins the group and
             notifies the superpeers of its capability
            If not g, a new group g is created and p is the only superpeer
       Superpeers are the most stable and powerful group nodes
       Superpeer keep an ordered list of the superpeer candidates
            This list is sent periodically to the regular peers of the group
       When a superpeer fails or disconnects
            The first regular peer in the list becomes superpeer, joins the top-
             level overlay, and informs all peers in its group and the superpeers
             of the neighboring groups
Hierarchical Framework (4/6)

   Hierarchical lookup service
       The querying peer sends
        query message to one of
        the superpeers in its group
       The top-level overlay first
        determines the group
        responsible for the key
       The responsible group then
        uses its intra-group overlay
        to determine the specific
        peer that is responsible for
        the key
Hierarchical Framework (5/6)

   Intra-group lookup
       At the intra-group level, the groups can use
        different overlays
       If a small group, the intra-group lookup can be
        O(1) step when CARP is used within the group
       If a large group, the intra-group lookup can be
        O(log M) when a DHT overlay structure is used
        within the group, where M is the number of peers
        in the group
Hierarchical Framework (6/6)

   Cooperative caching
       Expect peers in a same group to be topologically close and
        to be interconnected by high-speed links
       To reduce file transfer delays, files are cached in the
        groups where they have been previously requested
           When peer p wants to obtain a file with key k, it first uses intra-
            group lookup to find the peer p’ that would be responsible for k
           If p’ has a local copy, it returns the file to p
           Otherwise, p’ uses the hierarchical DHT to obtain the file,
            caches a copy, and forwards the file to p
Chord Instantiation (1/3)

   A particular instantiation of a two-tier
    hierarchy that uses Chord for the top-level
   Regular Chord ring
       Each peer and each key has a m-bit id, ids are
        ordered on a circle modulo 2m
       Key is assigned to the successor
       The successor, predecessor, and fingers make up
        the finger table
Chord Instantiation (2/3)
   Inter-group Chord ring
       In the top-level overlay network,
        each node is a group of peers
       Each node in top-level Chord
        has a predecessor and
        successor vector which hold the
        IP addresses of superpeers
       Each finger is also a vector
       When the superpeers of a group
        changes, they immediately
        updates the vector of the
        predecessor and successor
       However, lazy update of the
        fingers when detecting invalid
       To route a request to a group,
        choose a random superpeer
        from the vector and forward to it

Chord Instantiation (3/3)

                     P peers, I groups
                     stable peer will fail with probability ps
                     instable peer will fail with probability pr
Hierarchical P2P System: Summary

   The hierarchical organization improves
    overall system scalability and offers various
    advantages over a flat organization
   By gathering peers into groups based on
    topological proximity, it generates less
    messages and can significantly improve the
    lookup performance
[1] B. Yang and H. Garcia-Molina, “Designing a super-peer network,”
    in Proceedings of 19th IEEE International Conference on Data
    Engineering (ICDE’03), pp. 49 - 60, 5-8 March 2003.
[3] Li Xiao, Z. Zhuang, and Liu Yunhao, “Dynamic Layer
    Management in Superpeer Architectures,” Transactions on
    Parallel and Distributed Systems, vol.16, no.11, pp. 1078-1091,
    Nov. 2005.
[4] Zhiyong Xu, Rui Min and Yiming Hu, “HIERAS: A DHT Based
    Hierarchical P2P Routing Algorithm,” in Proceedings of the 2003
    International Conference on Parallel Processing (ICPP’03), pp.
[5] L. Garces-Erice, E.W. Biersack, P.A. Felber, K.W. Ross, and G.
    Urvoy-Keller, “Hierarchical Peer-to-peer Systems,” in
    Proceedings of ACM/IFIP International Conference on Parallel
    and Distributed Computing (Euro-Par 2003), August 26-29, 2003.

To top