The Seven Deadly Sins of Distributed Systems
by Steve Muir Princeton University
Workshop on Real Large Distributed Systems, 2004
Presentation: Charles Yang 2005/3/17
Introduction
PlanetLab
Time of paper: 400+ Nodes, 175 sites Now: 537 machines, 254 sites
What does it do?
IA-32’s running Linux distributed virtualization experiment with a variety of planetary-scale services
file sharing and networkembedded storage content distribution networks routing and multicast overlays network measurement tools. Etc.
The Paper (and this Presentation)
Node Manager (NM)
Each user creates a slice NM configures slivers on actual nodes
Paper describes 7 challenges encountered
Applicable to all large-scale heterogeneous environments
The se7en Sins
1. Networks are unreliable in the worst way 2. DNS is not a good naming system 3. Local clocks are inaccurate/unreliable 4. Large-scale systems always inconsistent 5. The improbable will happen 6. Over-utilization is the steadystate condition 7. Limited system transparency hampers debugging
Large Heterogeneous Networks are Fundamentally Unreliable
Specifications of IP, TCP, UDP say that
They really really mean it Instead of just losing packets
Delayed, duplicated, corrupted High variable latency – 24 hours to d/l a small file Unexpected termination – RPC interupted
Unreliable Networks – Solutions
ALL possible errors should be handled gracefully
It will happen
For variable latency
Multithreading or async I/O + timeouts
For RPC operations
Transactions may be too heavyweight NM: acquire & bind
+
Interference from other users
Someone might port-scan you SEARCH \x90\x90\x90\x90\x90….
DNS Names Make Poor Node Identifiers
Suffers from ambiguity & instability
Human errors Network reorginzations, renaming of hosts DNS servers may be overloaded -> secondary servers Network asymmetry: internal names, NAT, etc Non-static addresses & multihoming
NM: unique numeric ID’s (MAC address)
Local Clocks are Unreliable and Untrustworthy
Local clocks are bad NTP helps, but some sites block it Bottom-line: make sure your application knows about the problem
Inconsistent Node Configuration is the Norm
Multiple versions of software packages Multiple versions of your own application Updates not well ordered NM
Incorporate failsafe behavior
No slices in XML file -> probably major format change
Version numbering
There’s No Such Thing as “One-in-aMillion”
Hundreds of nodes, 24/7
Murphy’s Law
Unexpected reboots happen more than you’d think Power outages all over the world Must not cut corners when handling errors
No PlanetLab Node is Under-Utilized
The Norm:
100% utilization, load in 5-10 range +20 concurrent slices Several hundred processes are not uncommon
Hence:
Applications run much slower So make your apps aware
System-wide solution
Smart scheduler
Limited System Transparency Hampers Debugging
Virtualization -> not a complete view System solutions
Develop debugging tools
App side:
Make sure you debug thoroughly before deploying Report as much info as possible in a readable format
Guidelines
Assumptions made in non-distributed apps are not valid in large-scale and/or heterogeneous environments Dist. apps must gracefully handle a broad range of unlikely errors Resource mgmt is dist. env is much different than non-dist. Even local operations can behave radically diff in a heavily utilized system