VIEWS: 4 PAGES: 16 POSTED ON: 6/10/2012 Public Domain
Distributed Hash Tables A Modular Approach by Gurmeet Singh Manku Ph.D. Thesis at Stanford University, 2004 Advisor: Prof Rajeev Motwani manku@cs.stanford.edu http://www-db.stanford.edu/~manku 27 May 2004 Design Principle I: Flat Address Space 0.00 0.010 0.011 0.100 0.101 0.110 0.111 0 1 Each host acquires an ID in [0, 1) and “manages” a “partition”. Design Principle II: Structured Overlay Overlay = “Circle” + “Routing Network” Problem: insert/lookup hash entries? Issue: no global knowledge of who manages what! 0.00 0.010 0.011 0.100 0.101 0.110 0.111 Solution: an overlay of TCP links insert/lookup hash entries by “routing” 0.00 “Circle” for a strong core 0.111 for “correctness” “Routing Network” 0.010 0.110 for short routes for “efficiency” 0.011 0.101 0.100 A Modular Framework for DHTs Applications DHT Data Management Layer DHT Routing Layer Choice of Routing Network (Description of any family of Deterministic/Randomized ICNs defined over powers-of-two nodes) “Circle” Emulation Engine (handles dynamism & scale) (aware of physical network-proximity) ID Management (ensures fine-grained partition balance) Low-level Routines The ID Assignment Problem How does a new host acquire an ID in [0, 1)? No global knowledge of “current set of IDs” Can a system enjoy Low cost (# messages) all 3 properties Almost equi-sized partitions simultaneously? Naïve ID Assignment Choose “r”, a random number in [0, 1) [SMK+01, RD01, RFHK01] [MNR02, FG03, KK03] [MBR03, HJS+03, ZHS+04] σ = θ(log2 n) with n hosts in the system σ >100 when n = 4K σ = Partition-balance ratio Can we do better? = Ratio of the largest perhaps ... to the smallest partition if we could “learn” a few IDs One RANDOM PROBE One RANDOM PROBE One “Random Walk” down the tree Split the manager encountered Split the leaf encountered A host with a k-bit ID 0 “manages” an interval of size 2-k 1 Leaves in O(log log n) different levels σ = θ(log n) with n hosts in the system [NW03] .01101 .01100 .0001 Can we make σ = θ(1)? .0000 Only leaf nodes correspond to IDs in [0, 1) A New ID Assignment Algorithm [M04] 1 RANDOM walk down the tree 2 Walk up until sub-tree has (c log n) leaves 3 Split the “shallowest leaf” below the sub-tree Leaves in at most 3 different levels w.h.p. [log2 n], [log2 n] ± 1 where [x] denotes integer closest to x. So σ 4 Implementation: Improved ID Assignment [KM04*] 1 RANDOM PROBE [M04] followed by 1 LOCAL PROBE of size (c log n) σ4 Cost = θ(R + log n) 0 1 (c log n) RANDOM PROBES [AAABMP03][NW03] σ4 Cost = θ(R log n) 0 1 “a” RANDOM PROBEs [KM04*] followed by LOCAL PROBEs of size “b” σ4 ab > c log n Cost = θ(aR + b) aopt log log n 0 1 Emulation Engine in Action Hypercube Pastry[RD01] Tapestry[ZHS+04] Butterfly Ulysses[KMXY03] Variant of hypercube Chord [SMK+01] Multi-dimensional mesh CAN[RFHK01] de Bruijn graph D2B[FG03] Koorde[KK03] [NW03] [AAA+03] [LKRG03] Choice of Topology Circle Emulation Engine ID Assignment Kleinberg’s construction [Kle00] Symphony [MBR03] Randomized “standard” butterfly Viceroy [MNR02] Randomized “Kleinberg-style” butterfly [M03] Randomized Chord & hypercubes [CDHR03] [GGG03] [ZGG03] Skip-lists /skip-graphs [AS03] [HJS+03] The Emulation Engine [M03,M04] Description of parallel network over static sets of nodes Circle Emulation Engine (successive powers of two) ID Assignment 20 21 22 23 24 ... ID Assignment Schemes [M04] [KM04*] guarantee that leaves are in at most 3 levels [log2 n], [log2 n] ± 1 where [x] denotes integer closest to x. Emulation with “Naïve ID Assignment” was shown in [M03] Emulation Engine Details THREE sets of links per node: Example: HYPERCUBE 1 set with d-bit ID (top d bits) A node with 5-bit ID .01101 1 set with (d-1)-bit ID (top d-1 bits) makes links with managers of 1 set with (d-2)-bit ID (top d-2 bits) .111 .1110 .11101 .001 .0010 .00101 .010 .0100 .01001 .0111 .01111 .01100 Level [log2 n] - 1 Level [log2 n] Level [log2 n] + 1 EVERY node makes a set of links assuming ([log2 n] – 1)-bit ID ROUTING: If source has a d-bit ID, routing starts off at level (d-2), moving down one level at an intermediate node, if required. Physical-Network Proximity [MG04*] 0.00 The Problem: All links in the “Routing Network” 0.111 have high latency (ping-times)! How do we make these links low-latency? 0.010 0.110 Existing solutions [GGG+03] [ZGG03] tied to specific networks (Chord/hypercube) 0.011 0.101 0.100 A Generic Solution [MG04*] “Circle” A “generic scheme to incorporate network proximity” + into any deterministic/randomized families of ICNs. “Routing network” Symphony: [MBR03] A Randomized DHT Routing Network Each host makes k links Clockwise distance of a link chosen from a probability distribution function: p(x) = 1/(x log N) 1 1 1 1/ N p( x)dx ( x log N ) dx 1 1/ N Kleinberg[K00] With k=1, GREEDY routing requires O(log2 n) hops Greedy Routing Chord (log N links per node) R-Chord (with log N links per node) [ZGG03,GGG+03] - Clockwise distance N/2 - Clockwise distance uniformly at random from [N/2, N) - Clockwise distance N/4 - Clockwise distance uniformly at random from [N/4, N/2) - Clockwise distance N/8 - Clockwise distance uniformly at random from [N/8, N/4) ... and so on … - Clockwise distance 1 GREEDY “Choose the link that makes ‘most progress’ towards the destination” GREEDY with 1-LOOKAHEAD “GREEDY augmented by taking IDs of neighbor’s neighbors into account” How good are these? The Story that Emerges [MNW04] In topologies with n nodes and log n links per node ... A. Deterministic topologies (Hypercube/Chord) have diameter θ(log n) GREEDY routing is optimal. 1-LOOKAHEAD is no help. B. Randomization reduces the diameter to θ(log n / log log n) in expectation R-Hypercube/R-Chord/Symphony have diameter θ(log n / log log n) C. GREEDY is unable to discover short routes. Path lengths are θ(log n) steps on average. D. GREEDY with 1-LOOKAHEAD is able to discover short routes. Path lengths reduce to θ(log n / log log n) steps on average. E. GREEDY with 1-LOOKAHEAD is competitive with de Bruijn networks Average path lengths are within 10% of each other. More results in [M03][GM03][AMM04*] Papers & Manuscripts [BMR03] “SETS: Search Enhanced by Topic Segmentation” by Bawa, Manku and Raghavan, SIGIR 2003. [MBR03] “Symphony: Distributed Hashing in a Small-World” by Manku, Bawa & Raghavan, USITS 2003. [M03] “Routing Networks for Distributed Hash Tables” by Manku, PODC 2003. [GM04] “Optimal Routing in Chord” by Ganesan and Manku, SODA 2004 [MNW04] “Know thy Neighbor’s Neighbor: The Power of Lookahead in Randomized P2P Routing” by Manku, Naor & Weider, STOC 2004. [M04] “Balanced Binary Trees for ID Management in Distributed Hash Tables” by Manku, PODC 2004. [AMM04*] “The Degree-Diameter Greedy Routing Problem” by Abraham, Malkhi & Manku, submitted to DISC 2004. [KM04*] “Structured Coupon Collection over Cliques” by Kenthapadi & Manku, manuscript. [MG04*] “Low-Latency DHTs based on Arbitrary Geometries” by Manku & Ganesan, manuscript. http://www.cs.stanford.edu/~manku