Docstoc

Distributed ary System Algorithms for Distributed Hash Tables

Document Sample
Distributed ary System Algorithms for Distributed Hash Tables Powered By Docstoc
					       Distributed k-ary System
     Algorithms for Distributed Hash Tables




                           Ali Ghodsi
                          aligh@kth.se
                  http://www.sics.se/~ali/thesis/

PhD Defense, 7th December 2006, KTH/Royal Institute of Technology   1
       Distributed k-ary System
     Algorithms for Distributed Hash Tables
                           Ali Ghodsi
                          aligh@kth.se
                  http://www.sics.se/~ali/thesis/




PhD Defense, 7th December 2006, KTH/Royal Institute of Technology   2
Presentation Overview

• Gentle introduction to DHTs
• Contributions
• The future




                                3
What’s a Distributed Hash Table (DHT)?
• An ordinary hash table , which is distributed
     Key        Value
     Alexander Berlin
     Ali        Stockholm
     Marina     Gothenburg
     Peter      Louvain la neuve
     Seif       Stockholm
     Stefan     Stockholm

• Every node provides a lookup operation
    •Provide the value associated with a key

• Nodes keep routing pointers
    •If item not found, route to another node
                                                  4
So what?

•Characteristic propertiesSelf-management routing info:
                             Time to find data is
                           Store number of items
                                       logarithmic
   •Scalability                      proportional to number of
                                • Ensure routing information
                                    Size of routing tables is
      •Number of nodes can be huge isnodes
                                      up-to-date
                                       logarithmic
      •Number of items can be hugeTypically:
                                    Example:
                             Self-management of items:
                              With D items and n nodes
                             • Ensure that data is always
   •Self-manage in presence joins/leaves/failures
                                log2(1000000)≈20
      •Routing information    Store D/n itemsavailable
                               replicated and per node
                                       EFFICIENT!
      •Data items                Move D/n items when
                                  nodes join/leave/fail
                                          EFFICIENT!




                                                                 5
Presentation Overview


•…
•…
 • What’s been the general motivation for DHTs?
•…
•…



                                                  6
Traditional Motivation (1/2)
• Peer-to-Peer filesharing very
  popular

• Napster                                   central index


  • Completely centralized
  • Central server knows who has what
  • Judicial problems
                                         decentralized index
• Gnutella
  • Completely decentralized
  • Ask everyone you know to find data
  • Very inefficient
                                                               7
Traditional Motivation (2/2)
• Grand vision of DHTs
   • Provide efficient file sharing

• Quote from Chord: ”In particular, [Chord] can help
  avoid single points of failure or control that systems
  like Napster possess, and the lack of scalability that
  systems like Gnutella display because of their
  widespread use of broadcasts.” [Stoica et al. 2001]

• Hidden assumptions
   •   Millions of unreliable nodes
   •   User can switch off computer any time (leave=failure)
   •   Extreme dynamism (nodes joining/leaving/failing)
   •   Heterogeneity of computers and latencies
   •   Unstrusted nodes

                                                               8
Our philosophy
• DHT is a useful data structure

• Assumptions might not be true
  • Moderate amount of dynamism
  • Leave not same thing as failure

• Dedicated servers
  • Nodes can be trusted
  • Less heterogeneity

• Our goal is to achieve more given stronger
  assumptions
                                               9
Presentation Overview


•…
•…
 • How to construct a DHT?
•…
•…



                             10
How to construct a DHT (Chord)?
• Use a logical name space, called the identifier
  space, consisting of identifiers {0,1,2,…, N-1}

• Identifier space is a logical ring modulo N

• Every node picks a random identifier

                                                 0
                                            15       1
• Example:                             14                2

   • Space N=16 {0,…,15}          13                         3


   • Five nodes a, b, c, d       12                              4
       •   a picks 6
       •   b picks 5              11                         5
       •   c picks 0
                                       10                6
       •   d picks 5
                                             9       7
       •   e picks 2                             8                   11
Definition of Successor

• The successor of an identifier is the
   first node met going in clockwise direction
   starting at the identifier



                                                0
                                           15       1
• Example                             14                2

  • succ(12)=14                 13                          3

  • succ(15)=2                 12                               4
  • succ(6)=6
                                 11                         5

                                      10                6
                                            9       7
                                                8                   12
        Where to store data (Chord) ?
        •Use globally known hash function, H
Store number of items
        •Each item <key,value> gets
   proportional to number of                             Key         Value

                                                         Alexander   Berlin
   nodes identifier H(key)                               Marina      Gothenburg
                                                                     Louvain la
                                                         Peter
Typically:                                                           neuve

        •Store each item at its successor                Seif        Stockholm


With D items and nn is responsible for item k
           •Node nodes
                                                         Stefan      Stockholm



                                                           0
Store D/n items per node                            15               1
                                               14                             2
Move D•Example
      /n items when                       13                                      3
 nodes join/leave/fail
          • H(“Marina”)=12
                                         12                                           4
           • H(“Peter”)=2
         EFFICIENT!

             • H(“Seif”)=9                11                                      5
             • H(“Stefan”)=14                                                 6
                                               10
                                                     9               7
                                                            8                             13
Where to point (Chord) ?
•Each node points to its successor
   •The successor of a node n is succ(n+1)
   •Known as a node’s succ pointer

•Each node points to its predecessor
   •First node met in anti-clockwise direction starting at n-1
   •Known as a node’s pred pointer
                                                      0
                                                 15       1
•Example                                    14                2
   • 0’s successor is succ(1)=2
                                       13                         3
   • 2’s successor is succ(3)=5
   • 5’s successor is succ(6)=6       12                              4
   • 6’s successor is succ(7)=11
   • 11’s successor is succ(12)=0      11                         5

                                            10                6
                                                  9       7
                                                      8                   14
DHT Lookup                                    Key          Value


                                              Alexander    Berlin


•To lookup a key k                            Marina       Gothenburg



   • Calculate H(k)                           Peter        Louvain la neuve



   • Follow succ pointers until               Seif         Stockholm

       item k is found
                                              Stefan       Stockholm




•Example                                             15
                                                           0
                                                                       1
   • Lookup ”Seif” at node 2                 14                               2

                                        13                                        3
   • H(”Seif”)=9
                                       12                                             4
   • Traverse nodes:
       • 2, 5, 6, 11 (BINGO)
                                        11                                        5

   • Return “Stockholm” to initiator         10                               6
                                                       9               7
                                                            8                             15
       Speeding up lookups
       • If only pointer to succ(n+1) is used
           • Worst case lookup time is N, for N nodes
Time to find data is
         • Improving lookup time
   logarithmic
            • Point to is
Size of routing tables succ(n+1)                   15
                                                        0
                                                            1
            • Point to succ(n+2)
   logarithmic                                14                2
           • Point to succ(n+4)
Example:                                 13
           • Point to succ(n+8)                                     3
log2(1000000)≈20
           •…
                                        12                              4
           • Point
      EFFICIENT! to succ(n+2 )
                              M



                                         11                         5
       • Distance always halved to
           the destination                    10                6

                                                    9       7
                                                        8
                                                                            16
Dealing with failures
• Each node keeps a successor-list
   • Pointer to f closest successors              15
                                                       0
                                                           1
      •   succ(n+1)                          14                2
      •   succ(succ(n+1)+1)             13                         3
      •   succ(succ(succ(n+1)+1)+1)
      •   ...                          12                              4


                                        11                         5

• If successor fails                         10                6

   • Replace with closest alive successor          9
                                                       8
                                                           7




• If predecessor fails
   • Set pred to nil

                                                                           17
    Handling Dynamism

•    Periodic stabilization used to make pointers
     eventually correct
    •   Try pointing succ to closest alive successor

    •   Try pointing pred to closest alive predecessor




                                                         18
Presentation Overview

• Gentle introduction to DHTs
• Contributions
• The future




                                19
Outline


•…
•…
 • Lookup consistency
•…
•…




                        20
Problems with periodic stabilization

• Joins and leaves can result in
  inconsistent lookup results
  • At node 12, lookup(14)=14
  • At node 10, lookup(14)=15




               12       14
      10                        15




                                       21
Problems with periodic stabilization

• Leaves can result in routing failures




                   13
        10                         16




                                          22
Problems with periodic stabilization

• Too many leaves destroy the system
   • #leaves+#failures/round < |successor-list|




              11    12           14
       10                              15




                                                  23
Outline


•…
•…
 • Atomic Ring Maintenance
•…
•…




                             24
Atomic Ring Maintenance

• Differentiate leaves from failures
  • Leave is a synchronized departure

  • Failure is a crash-stop


• Initially assume no failures
• Build a ring initially



                                        25
Atomic Ring Maintenance

• Separate parts of the problem
  • Concurrency control
    • Serialize neighboring joins/leaves


  • Lookup consistency




                                           26
Naïve Approach

• Each node i hosts a lock called Li

  • For p to join or leave:
     • First acquire Lp.pred
     • Second acquire Lp
     • Third acquire Lp.succ
     • Thereafter update relevant pointers

• Can lead to deadlocks


                                             27
Our Approach to Concurrency Control

• Each node i hosts a lock called Li
  • For p to join or leave:
     • First acquire Lp
     • Thereafter acquire Lp.succ
     • Thereafter update relevant pointers

• Each lock has a lock queue
  • Nodes waiting to acquire the lock




                                             28
Safety

• Non-interference theorem:
  • When node p acquires both locks:
    • Node p’s successor cannot leave

    • Node p’s ”predecessor” cannot leave

    • Other joins cannot affect ”relevant”
      pointers



                                             29
Dining Philosophers

• Problem similar to the
  Dining philosophers’
  problem

• Five philosophers around a table
  • One fork between each philosopher (5)
  • Philosophers eat and think
  • To eat:
    • grab left fork
    • then grab right fork

                                            30
Deadlocks

• Can result in a deadlock
  • If all nodes acquire their first lock
  • Every node waiting indefinitely for second lock


• Solution from Dining philosophers’
  • Introduce asymmetry
  • One node acquires locks in reverse order


• Node with highest identifier reverses
  • If n<n.succ, then n has highest identity


                                                      31
Pitfalls
• Join adds node/“philosopher”
  • Solution: some requests in the lock queue
    forwarded to new node


                            12    14
                                 14, 12
                                  12


                 12        14
       10                         15




                                                32
Pitfalls

• Leave removes a node/“philosopher”
  • Problem:
    if leaving node gives lock queue to its
    successor, nodes can get worse position in
    queue: starvation


• Use forwarding to avoid starvation
  • Lock queue empty after local leave request



                                                 33
Correctness

• Liveness Theorem:
  • Algorithm is starvation free
    • Also free from deadlocks and livelocks


• Every joining/leaving node will
  eventually succeed getting both locks




                                               34
Performance drawbacks
• If many neighboring nodes leaving
  • All grab local lock
  • Sequential progress


                  12        14
       10                        15


• Solution
  • Randomized locking
  • Release locks and retry
  • Liveness with high probability
                                      35
Lookup consistency: leaves

• So far dealt with concurrent joins/leaves
  • Look at concurrent join/leaves/lookups


• Lookup consistency (informally):
  • At any time, only one node responsible for
    any key

  • Joins/leaves should “not affect”
    functionality of lookups


                                                 36
Lookup consistency

• Goal is to make joins and leaves appear as if
  they happened instantaneously

• Every leave has a leave point
  • A point in global time, where the whole system
    behaves as if the node instantaneously left


• Implemented with a LeaveForward flag
  • The leaving node forwards messages to successor if
    LeaveForward is true


                                                     37
    Leave Algorithm
Node p             Node q (leaving)              Node r

                            LeaveForward=true
              leave point




                                                     pred:=p




    succ:=r



                            LeaveForward=false
                                                          38
Lookup consistency: joins

• Every join has a join point
  • A point in global time, where the whole
    system behaves as if the node
    instantaneously joined


• Implemented with a JoinForward flag
  • The successor of a joining node forwards
    messages to new node if JoinForward is
    true


                                               39
Join Algorithm
Node p       Node q (joining)   Node r

                                    Join
                                    Point
                                    JoinForward=true
                                    oldpred=pred
                                    pred=q
                 pred:=p
                 succ:=r



   succ:=q




                                    JoinForwarding=false



                                                       40
Outline


•…
•…
 • What about failures?
•…
•…




                          41
Dealing with Failures

• We prove it is impossible to provide
  lookup consistency on the Internet

• Assumptions
  • Availability (always eventually answer)
  • Lookup consistency
  • Partition tolerance

• Failure detectors can behave as if the
  networked partitioned
                                              42
Dealing with Failures

• We provide fault-tolerant atomic ring
  • Locks leased
  • Guarantees locks are always released


• Periodic stabilization ensures
  • Eventually correct ring
  • Eventual lookup consistency




                                           43
Contributions

• Lookup consistency in presence of
  joins/leaves
  • System not affected by joins/leaves
  • Inserts do not “disappear”


• No routing failures when nodes leave

• Number of leaves not bounded

                                          44
Related Work

• Li, Misra, Plaxton (’04, ’06) have a similar
  solution

• Advantages
  • Assertional reasoning
  • Almost machine verifiable proofs


• Disadvantages
  • Starvation possible
  • Not used for lookup consistency
  • Failure-free environment assumed
                                                 45
Related Work

• Lynch, Malkhi, Ratajczak (’02), position paper
  with pseudo code in appendix

• Advantages
  • First to propose atomic lookup consistency


• Disadvantages
  •   No proofs
  •   Message might be sent to a node that left
  •   Does not work for both joins and leaves together
  •   Failures not dealt with
                                                         46
Outline


•…
•…
 • Additional Pointers on the Ring
•…
•…




                                     47
Routing

• Generalization of Chord to provide
  arbitrary arity


• Provide logk(n) hops per lookup
   • k being a configurable parameter
   • n being the number of nodes
• Instead of only log2(n)


                                        48
      Achieving logk(n) lookup
     • Each node logk(N) levels, N=kL
     • Each level contains k intervals,

     • Example, k=4, N=64 (43), node 0
                  0
                           4

                                    8
                                                  Node 0   I0     I1   I2   I3
     Interval 3        Interval 0       12
                                                                 16…3 32…4 48…6
                                                  Level 1 0…15
                                                                 1    7    3
48                                           16


     Interval 2        Interval 1




                  32                                                              49
     Achieving logk(n) lookup
     • Each node logk(N) levels, N=kL
     • Each level contains k intervals,

     • Example, k=4, N=64 (43), node 0
              0
                         4
         Interval 0
                                8

              Interval 1                       Node 0   I0     I1    I2     I3
                                     12
                                                              16…3 32…4 48…6
                  Interval 2                   Level 1 0…15
                                                              1    7    3
48                      Interval 3        16                               12…1
                                               Level 2 0…3    4…7   8…11
                                                                           5




             32                                                                   50
     Achieving logk(n) lookup
     • Each node logk(N) levels, N=kL
     • Each level contains k intervals,

     • Example, k=4, N=64 (43), node 0
             0
                   4

                       8
                                     Node 0    I0    I1    I2     I3
                           12
                                                    16…3 32…4 48…6
                                     Level 1 0…15
                                                    1    7    3
48                              16                               12…1
                                     Level 2 0…3    4…7   8…11
                                                                 5

                                     Level 3   0     1     2      3




            32                                                          51
Arity important

• Maximum number of hops can be configured
               1
 k N          r

                                         1
                                               
                                                   r
                                                       
 log k ( N )  log 1 ( N )  log 1   N
                                      
                                           r   
                                               
                                                       r
                                  r                   
                                                    
                  N r           N


• Example, a 2-hop system
 k        N
 log   N
           (N )  2



                                                             52
Placing pointers
•Each node has (k-1)logk(N) pointers
  • Node p’s pointers point at
                                                   i 1 
                                                   k 1 
    f (i )  p  (1  (( i  1) mod (k  1))) k         


                                                  0
                                                             4
   Node 0’s pointers
                                                                 8
      f(1)=1
      f(2)=2                                                         12
      f(3)=3
      f(4)=4
      f(5)=8               48                                             16

      f(6)=12
      f(7)=16
      f(8)=32
      f(9)=48
                                              32                          53
Greedy Routing
• lookup(i) algorithm
  • Use pointer closest to i, without
    “overshooting” i

  • If no such pointer exists, succ is responsible
    for i

                         i




                                                 54
Routing with Atomic Ring Maintenance

• Invariant of lookup
  • Last hop is always predecessor of
    responsible node


• Last step in lookup
  • If JoinForward is true, forward to pred
  • If LeaveForward is true, forward to succ




                                               55
Avoiding Routing Failures
• If nodes leave, routing failures
  can occur

• Accounting algorithm
  • Simple Algorithm
     • No routing failures of ordinary messages

  • Fault-free Algorithm
     • No routing failures


• Many cases and interleavings
  • Concurrent joins and leaves,
    pointers in both directions
                                                  56
General Routing
• Three lookup styles

  • Recursive

  • Iterative

  • Transitive




                        57
Reliable Routing
• Reliable lookup for each style
   • If initiator doesn’t crash, responsible node reached
   • No redundant delivery of messages

• General strategy
   • Repeat operation until success
   • Filter duplicates using unique identifiers

• Iterative lookup
   • Reliability easy to achieve

• Recursive lookup
   • Several algorithms possible

• Transitive lookup
   • Efficient reliability hard to achieve
                                                            58
Outline


•…
•…
 • One-to-many Communication
•…
•…




                               59
Group Communication on an Overlay

• Use existing routing pointers
  • Group communication


• DHT only provides key lookup
  • Complex queries by searching the overlay
  • Limited horizon broadcast
  • Iterative deepening


• More efficient than Gnutella-like systems
  • No unintended graph partitioning
  • Cheaper topology maintenance [castro04]
                                               60
Group Communication on an Overlay

• DHT builds a graph
  • Why not use general graph algorithms?


• Can use the specific structure of DHTs
  • More efficient
  • Avoids redundant messages




                                            61
Broadcast Algorithms

• Correctness conditions:
  • Termination
     • Algorithm should eventually terminate


  • Coverage
     • All nodes should receive the broadcast message


  • Non-redundancy
     • Each node receives the message at most once


• Initially assume no failures
                                                        62
Naïve Broadcast

• Naive Broadcast Algorithm
  send message to succ until:
      initiator reached or overshooted
                                                 initiator

                                             0
                                        15       1
                                   14                2

                              13                         3

                             12                              4


                              11                         5

                                   10                6
                                         9       7
                                             8
                                                                 63
Naïve Broadcast

• Naive Broadcast Algorithm
  send message to succ until:
      initiator reached or overshooted
                                                     initiator
• Improvement                                    0
                                            15       1
  • Initiator delegates half           14                2
    the space to neighbor         13                         3

                                 12                              4
• Idea applied recursively
  • log(n) time and n messages    11                         5

                                       10                6
                                             9       7
                                                 8
                                                                     64
Simple Broadcast in the Overlay

• Dissertation assumes general DHT model

event n.SimpleBcast(m, limit)       % initially limit = n
  for i:=M downto 1 do
      if u(i) ∈ (n,limit) then
          sendto u(i) : SimpleBcast(m, limit)
          limit := u(i)




                                                            65
”Advanced” Broadcast
• Old algorithm on k-ary trees




                                 66
Getting responses
• Getting a reply
   • Nodes send directly back to initiator
   • Not scalable

• Simple Broadcast with Feedback
   • Collect responses back to initiator
   • Broadcast induces a tree, feedback in reverse direction

• Similar to simple broadcast algorithm
   • Keeps track of parent (par)
   • Keeps track of children (Ack)
   • Accumulate feedback from children, send to parent

• Atomic ring maintenance
   • Acquire local lock to ensure nodes do not leave
                                                               67
Outline


•…
•…
 • Advanced One-to-many Communication
•…
•…




                                        68
Motivation for Bulk Operation

• Building MyriadStore in 2005
  • Distributed backup using the DKS DHT

• Restoring a 4mb file
  • Each block (4kb) indexed in DHT
  • Requires 1000 items in DHT

• Expensive
  • One node making 1000 lookups
  • Marshaling/unmarshaling 1000 requests

                                            69
Bulk Operation

• Define a bulk set: I
  • A set of identifiers

• bulk_operation(m, I)
  • Send message m to every node i ∈ I

• Similar correctness to broadcast
  • Coverage: all nodes with identifier in I
  • Termination
  • Non-redundancy
                                               70
Bulk Owner Operation with Feedback

• Define a bulk set: I
  • A set of identifiers


• bulk_own(m, I)
  • Send m to every node responsible for an identifier
    i∈I


• Example
  • Bulk set I={4}
  • Node 4 might not exist
  • Some node is responsible for identifier 4
                                                         71
Bulk Operation with Feedback

• Define a bulk set: I
  • A set of identifiers

• bulk_feed(m, I)
  • Send message m to every node i ∈ I
  • Accumulate responses back to initiator


• bulk_own_feed(m, I)
  • Send message m to every node responsible
    for i ∈ I
  • Accumulate responses back to initiator
                                               72
Bulk Properties (1/2)

• No redundant messages

• Maximum log(n) messages per node




                                     73
Bulk Properties (2/2)
• Two extreme cases

• Case 1
  •   Bulk set is all identifiers
  •   Identical to simple broadcast
  •   Message complexity is n
  •   Time complexity is log(n)

• Case 2
  •   Bulk set is a singleton with one identifier
  •   Identical to ordinary lookup
  •   Message complexity is log(n)
  •   Time complexity is in log(n)

                                                    74
Pseudo Reliable Broadcast
• Pseudo-reliable broadcast to deal with crash failures

• Coverage property
   • If initiator is correct, every node gets the message

• Similar to broadcast with feedback

• Use failure detectors on children
   • If child with responsibility to cover I fails
   • Use bulk to retry covering interval I

• Filter redundant messages using unique identifiers

• Eventually perfect failure detector for termination
   • Inaccuracy results in redundant messages

                                                            75
Applications of bulk operation

• Bulk operation
  • Topology maintenance: update nodes in bulk set
  • Pseudo-reliable broadcast: re-covering intervals


• Bulk owner
  • Multiple inserts into a DHT


• Bulk owner with feedback
  • Multiple lookups in a DHT
  • Range queries


                                                       76
Outline


•…
•…
 • Replication
•…
•…




                 77
Successor-list replication

• Successor-list replication
  • Replicate a node’s item on its f successors
  • DKS, Chord, Pastry, Koorde etcetera.


• Was abandoned in favor of symmetric
  replication because …




                                                  78
Motivation: successor-lists
• If a node joins or leaves
  • f replicas need to be updated
                                             Color
                                             represents
                                             data item




              Replication degree 3
              Every color replicated three times




                                                          79
Motivation: successor-lists
• If a node joins or leaves
  • f replicas need to be updated
                                           Color
                                           represents
                                           data item




                   Node leaves
           Yellow, green, red, blue need
                to be re-distributed


                                                        80
Multiple hashing
• Rehashing
  • Store each item <k,v> at
     •   succ( H(k) )
     •   succ( H(H(k)) )
     •   succ( H(H(H(k))) )
     •   …


• Multiple hash functions
  • Store each item <k,v> at
     •   succ( H1(k) )
     •   succ( H2(k) )
     •   succ( H3(k) )
     •   …

• Advocated by CAN and Tapestry

                                  81
Motivation: multiple hashing
• Example
  • Item <”Seif”, ”Stockholm”>
     • H(”Seif”)=7
     • succ(7)=9


  • Node 9 crashes
     • Node 12 should get item from replica
     • Need hash inverse H-1(7)=”Seif” (impossible)
     • Items dispersed all over nodes (inefficient)

                                           9          12
                                  7

                                        Seif,
                            5         Stockholm


                                                           82
Symmetric Replication
•Basic Idea
   •Replicate identifiers, not nodes


•Associate each identifier i with f other identifiers:
                    N
   • r (k )  i  k   , for 0  k  f
                    f

•Identifier space partitioned into m
 equivalence classes
   •Cardinality of each class is   f, m=N/f

•Each node replicates the equivalence class of
 all identifiers it is responsible for

                                                         83
Symmetric replication
 Replication degree f=4, Space={0,…,15}
 • Congruence classes modulo 4:
    •   {0, 4,   8, 12}
    •   {1, 5,   9, 13}
    •   {2, 6,   10, 14}
                                                                   Data: 15, 0
    •   {3, 7,   11, 15}
                             Data: 14, 13, 12, 11
                                                              15      0
                                                                                 1
                                                         14                          2
                                                                                                 Data: 1, 2, 3

                                                13                                       3

                                               12                                            4

                                                    11                                   5         Data: 4, 5

                                                         10                          6
                                                              9                  7
                       Data: 6, 7, 8, 9, 10                           8
                                                                                                                 84
Ordinary Chord
 Replication degree f=4, Space={0,…,15}
 • Congruence classes modulo 4                                        Data: 3, 4
    •   {0, 4,   8, 12}                                               Data: 7, 8
                               Data: 2, 1, 0, 15
    •   {1, 5,   9, 13}                                              Data: 11, 12
                               Data: 6, 5, 4, 3
    •   {2, 6,   10, 14}
                                                                     Data: 15, 0
    •   {3, 7,   11, 15}       Data: 10, 9, 8, 7                                                     Data: 5, 6, 7

                               Data: 14, 13, 12, 11                                                 Data: 9, 10, 11
                                                                15       0
                                                                                    1               Data: 13, 14, 15
                                                           14                           2
                                                                                                     Data: 1, 2, 3

                                                   13                                       3
                                                                                                        Data: 8, 9

                                                                                                      Data: 12, 13
                                                   12                                           4
                                                                                                        Data: 0, 1
                      Data: 10, 11, 12, 13, 14

                       Data: 14, 15, 0, 1, 2
                                                      11                                    5           Data: 4, 5


                         Data: 2, 3, 4, 5, 6               10                           6
                                                                9                   7
                        Data: 6, 7, 8, 9, 10                             8
                                                                                                                     85
Cheap join/leave
 Replication degree f=4, Space={0,…,15}
 • Congruence classes modulo 4                                         Data: 3, 4
    •   {0, 4,   8, 12}                                                Data: 7, 8
                               Data: 2, 1, 0, 15
    •   {1, 5,   9, 13}                                               Data: 11, 12
                               Data: 6, 5, 4, 3
    •   {2, 6,   10, 14}
                                                                      Data: 15, 0
    •   {3, 7,   11, 15}       Data: 10, 9, 8, 7                                                      Data: 5, 6, 7

                               Data: 14, 13, 12, 11                                                  Data: 9, 10, 11
                                                                 15        0
                                                                                     1               Data: 13, 14, 15
                           Data: 0, 15                     14                            2
                                                                                                      Data: 1, 2, 3
                            Data: 3, 4
                                                   13                                        3
                            Data: 7, 8                  Data: 11, 12, 7,                                 Data: 8, 9
                                                        8, 3, 4, 0, 15
                                                                                                       Data: 12, 13
                           Data: 11, 12
                                                   12                                            4
                                                                                                         Data: 0, 1
                      Data: 10, 11, 12, 13, 14

                       Data: 14, 15, 0, 1, 2
                                                      11                                     5           Data: 4, 5


                         Data: 2, 3, 4, 5, 6               10                            6
                                                                 9                   7
                        Data: 6, 7, 8, 9, 10                               8
                                                                                                                      86
Contributions
• Message complexity for join/leave O(1)
  • Bit complexity remains unchanged

• Handling failures more complex
  • Bulk operation to fetch data
  • On average log(n) complexity

• Can do parallel lookups
  •   Decreasing latencies
  •   Increasing robustness
  •   Distributed voting
  •   Erasure codes

                                           87
Presentation Overview


•…
•…
 • Summary
•…
•…



                        88
Summary (1/3)

• Atomic ring maintenance
  • Lookup consistency for j/l
  • No routing failures as nodes j/l
  • No bound on number of leaves
  • Eventual consistency with failures

• Additional routing pointers
  • k-ary lookup
  • Reliable lookup
  • No routing failures with additional pointers

                                                   89
Summary (2/3)
• Efficient Broadcast
  • log(n) time and n message complexity
  • Used in overlay multicast


• Bulk operations
  • Efficient parallel lookups
  • Efficient range queries




                                           90
Summary (3/3)

• Symmetric Replication
  • Simple, O(1) message complexity for j/l
    • O(log f) for failures


  • Enables parallel lookups
    • Decreasing latencies
    • Increasing robustness
    • Distributed voting




                                              91
Presentation Overview

• Gentle introduction to DHTs
• Contributions
• The future




                                92
Future Work (1/2)

• Periodic stabilization
  • Prove it is self-stabilizing




                                   93
Future Work (2/2)

• Replication Consistency
  • Atomic consistency impossible in
    asynchronous systems
  • Assume partial synchrony
  • Weaker consistency models?
  • Using virtual synchrony




                                       94
Speculative long-term agenda
• Overlay today provides
   •   Dynamic membership
   •   Identities (max/min avail)
   •   Only know subset of nodes
   •   Shared memory registers

• Revisit distributed computing
   •   Assuming an overlay as basic primitive
   •   Leader election
   •   Consensus
   •   Shared memory consistency (started)
   •   Transactions
   •   Wave algorithms (started)


• Implement middleware providing these…
                                                95
Acknowledgments

• Seif Haridi
• Luc Onana Alima

•   Cosmin Arad
•   Per Brand
•   Sameh El-Ansary
•   Roland Yap


                      96
THANK YOU




            97
98
     Handling joins
     •       When n joins
         •     Find n’s successor with lookup(n)
         •     Set succ to n’s successor                                         15
         •     Stabilization fixes the rest               13




                                     11


Periodically at n:                        When receiving notify(p) at n:

1.   set v:=succ.pred                     1.   if pred=nil or p is in (pred,n]
2.   if v≠nil and v is in (n,succ]        2.      set pred:=p
3.      set succ:=v
4.   send a notify(n) to succ

                                                                                      99
     Handling leaves
     •       When n leaves
         •      Just dissappear (like failure)

     •       When pred detected failed                                             15
         •      Set pred to nil                                13

     •       When succ detected failed
         •      Set succ to closest alive in
                successor list
                                          11


Periodically at n:                         When receiving notify(p) at n:

1.   set v:=succ.pred                      1.    if pred=nil or p is in (pred,n]
2.   if v≠nil and v is in (n,succ]         2.       set pred:=p
3.      set succ:=v
4.   send a notify(n) to succ

                                                                                   100

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:7/23/2011
language:English
pages:100