Embed
Email

041504

Document Sample

Shared by: liwenting
Categories
Tags
Stats
views:
0
posted:
11/23/2011
language:
English
pages:
24
G22.3250-001

Distributed Data Structures

for Internet Services









Robert Grimm

New York University

(with some slides by Steve Gribble)

Altogether Now:

The Three Questions



 What is the problem?

 What is new or different?

 What are the contributions and limitations?

Clusters, Clusters, Clusters





 Let’s broaden the goals for cluster-based services

 Incremental scalability

 High availability

 Operational manageability

 And also data consistency

 But what to do if the data has to be persistent?

 TACC works best for read-only data

 Porcupine works best for a limited group of services

 Email, news, bulletin boards, calendaring

Enter Distributed Data Structures

(DDS)



 In-memory, single site application interface

 Persistent, distributed, replicated implementation

 Clean consistency model

 Atomic operations (but no transactions)

 Independent of accessing nodes (functional homogeneity)

DDS’s as an Intermediate Design

Point



 Relational databases

 Strong guarantees (ACID)

 But also high overhead, complexity

 Logical structure very much independent of physical layout



 Distributed data structures

 Atomic operations, one-copy equivalence

 Familiar, frequently used interface: hash table, tree, log

 Distributed file systems

 Weak guarantees (e.g., close/open consistency)

 Low-level interface with little data independence

 Applications impose structure on directories, files, bytes

Design Principles





 Separate concerns

 Service code implements application

 Storage management is reusable, recoverable

 Appeal to properties of clusters

 Generally secure and well-administered

 Fast network, uninterruptible power



 Design for high throughput and high concurrency

 Use event-driven implementation

 Make it easy to compose components

 Make it easy to absorb bursts (in event queues)

Assumptions





 No network partitions within cluster

 Highly redundant network

 DDS components are fail-stop

 Components implemented to terminate themselves

 Failures are independent

 Messaging is synchronous

 Bounded time for delivery

 Workload has no extreme hotspots (for hash table)

 Population density over key space is even

 Working set of hot keys is larger than # of cluster nodes

Distributed Hash Tables

(in a Cluster…)

DHT Architecture

Cluster-Wide Metadata Structures

Metadata Maps









Why is two-phase

commit acceptable

for DDS’s?

Recovery

Experimental Evaluation





 Cluster of 28 2-way SMPs and 38 4-way SMPs

 To a total of 208 500 MHZ Pentium CPUs

 2-way SMPs: 500 MB RAM, 100 Mbs switched Ethernet

 4-way SMPs: 1 GB RAM, 1 Gbs switched Ethernet

 Implementation written in Jāvā

 Sun’s JDK 1.1.7v3, OpenJIT, Linux user-level threads

 Load generators run within cluster

 80 nodes necessary to saturate 128 storage bricks

Scalability: Reads and Writes

Graceful Degradation (Reads)

Unexpected Imbalance (Writes)









What’s going on?

Capacity

Recovery Behavior









Normal

GC in action

1 brick fails | Recovery |

Buffer cache warm up

So, All Is Good?

Assumptions Considered Harmful!





 Central insight, based on experience with DDS

 “Any system that attempts to gain robustness solely

through precognition is prone to fagility”

 In other words

 Complex systems are so complex that they are

impossible to understand completely, especially when

operating outside their regular range

Assumptions in Action





 Bounded synchrony

 Timeout four orders of magnitude higher than common

case round trip time

 But garbage collection can take a very long time

 The result is a catastrophic drop in throughput

 Independent failures

 Race condition in two-phase commit caused latent

memory leak (10 KB/minute under normal operation)

 All bricks failed predictably within 10-20 minutes of

each other

 The result is a catastrophic loss of data

Assumptions in Action (cont.)





 Fail-stop components

 Session layer uses synchronous connect() method

 Another graduate student adds firewalled machine to

cluster, resulting in nodes locking up for 15 minutes at

a time

 The result is a catastrophic corruption of data

What Can We Do?





 Systematically overprovision the system

 But doesn’t that mean predicting the future, again?

 Use admission control

 But this can still result in livelock, only later…

 Build introspection into the system

 Need to easily quantify behavior in order to adapt

 Close the control loop

 Make the system adapt automatically (but see previous)

 Plan for failures

 Use transactions, checkpoint frequently, reboot proactively

What Do You Think?



Related docs
Other docs by liwenting
第04章 类的重用
Views: 89  |  Downloads: 0
摘要
Views: 81  |  Downloads: 0
摘要
Views: 85  |  Downloads: 0
摘要_2_
Views: 68  |  Downloads: 0
國泰醫院2012年紙本期刊到刊總表
Views: 134  |  Downloads: 0
”Lyme_disease”_-_the_European_history
Views: 66  |  Downloads: 0
تعریف و تاریخچهPRP
Views: 77  |  Downloads: 0
_C6C28D15-9903-407A-8FEE-77A0422212B0_
Views: 113  |  Downloads: 0
__________
Views: 96  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!