Dynamo Amazon's Highly Available Key-value Store

Document Sample
Dynamo Amazon's Highly Available Key-value Store Powered By Docstoc
					Dynamo: Amazon‟s Highly
Available Key-value Store
   Giuseppe DeCandia, et.al.,
          SOSP „07
               Introduction
• Dynamo: used to manage applications that
  require only primary-key access to data
• Dynamo applications need scalability, high
  availability, fault tolerance, but don‟t need
  the complexity of a relational DB
  – ACID properties => little parallelism, low
    availability
               Assumptions:
• Applications perform simple read/write ops on
  single, small ( < 1MB) data objects which are
  identified by a unique key.
  – Example: the shopping cart
• Replace ACID properties with weaker
  guarantees: eventual consistency, no isolation
  promises
• Services must operate efficiently on commodity
  hardware
• Used only by internal services, so security isn‟t
  an issue
 Service Level Agreements (SLA)
• Clients and servers negotiate SLAs to
  establish the kind of service and the
  expected performance
• Amazon expects the guarantees to apply
  to 99.9% of requests
  – Claim that most industry systems express
    SLAs in terms of “average”, “median”, and
    “expected variance” – much weaker than
    Amazon‟s requirements
        Design Considerations
• Services control properties such as durability
  and consistency, evaluate tradeoffs (cost v
  performance, for example)
• Replicated databases cannot guarantee
  strong consistency and high availability at the
  same time
  – Optimistic replication updates replicas as a
    background process to get eventual consistency
    Design Considerations:
  Resolving Conflicting Updates
• When
 – Since Dynamo targets services that require
   “always writeable” data storage; e.g., users
   must always be able to add/delete from the
   shopping cart; resolve conflicts during reads,
   not writes
• By Whom
 – Let each application decide for itself
 – But … default is “last write wins”.
   Other Key Design Principles
• Incremental scalability: adding a single
  node should not affect the system
  significantly
• Symmetry: all nodes have the same
  responsibilities
• Decentralization: favor P2P techniques
  over centralized control
• Heterogeneity: take advantage of
  differences in server capabilities.
 Comparison to Other Systems
• Peer-to-Peer (Freenet, Chord, …)
  – Structured v unstructured: access times
  – Conflict resolution for concurrent updates
    without wide-area file locking
• Distributed File Systems and Databases
  (Google, Bayou, Coda, …)
  – Treatment of system partitions
  – Conflict resolution, eventual consistency
  – Strong consistency v eventual consistency
   Dynamo v Other Decentralized
        Storage Systems
• “always writeable”;
  – updates won‟t be rejected because of failure
    or concurrent updates
• One administrative domain; nodes are
  assumed to be trustworthy
• Don‟t require hierarchical name spaces or
  relational schema
• Operations must be performed within a
  few hundred milliseconds.
        System Architecture
• The Dynamo data storage system
  contains items that are associated with a
  single key
• Operations that are implemented: get( )
  and put( ).
  – get(key)
  – put(key, context, object) where context refers
    to various kinds of system metadata
Problem             Technique                   Advantage

Partitioning        Consistent Hashing          Incremental scalability

High availability   Vector clocks, reconciled   Version size is decoupled
for writes          during reads                from update rates

Temporary           Sloppy Quorum,              Provides high availability &
failures            hinted handoff              durability guarantee when
                                                some of the replicas are
                                                not available

Permanent           Anti-entropy using          Synchronizes divergent replicas
failures            Merkle trees                in the background

Membership &      Gossip-based protocol         Preserves symmetry and avoids
failure detection                               having a centralized registry for
                                                storing membership and node
                                                liveness information

Table 1: Summary of techniques used in Dynamo and their advantages
       Partitioning Algorithm
• Partitioning = dividing data storage across
  all nodes. Supports scalability
• Very similar to Chord-based schemes
• Consistent hashing scheme distributes
  content across multiple nodes
  – In consistent hashing the effect of adding a
    node is localized – on average, K/n objects
    must be remapped (K = # of keys, n = # of
    nodes)
       Partitioning Algorithm
• Hash function produces an m-bit number
  which defines a circular name space (like
  Chord)
• Nodes are assigned numbers randomly in
  the name space
• Hash(data key) and assign to node using
  successor function like Chord
           Load Distribution
• Random assignment of node to position in
  ring may produce non-uniform distribution
  of data.
• Solution: virtual nodes
  – Assign several random numbers to each
    physical node; now it is responsible for itself
    and data that would be stored on the virtual
    nodes, if they existed
                Replication
• Data is replicated at N nodes
• Succ(key) = coordinator node
  – The coordinator replicates the object at the N-1
    successor nodes in the ring, skipping virtual
    nodes to increase fault tolerance
  – Preference list: the list of nodes that store a
    particular key
  – There are actually > N nodes on the preference
    list, in order to ensure N “healthy” nodes at all
    times.
            Data Versioning
• Updates can be propagated to replicas
  asynchronously – the put( ) call may return
  before all updates have been applied.
  – Implication: a subsequent get( ) may return
    stale data.
• Barring failure, most updates are applied
  within bounded time, but server or network
  failure can delay updates “for an extended
  period of time”.
            Data Versioning
• Some app‟s can be designed to work in
  this environment; e.g., the “add-to/delete-
  from cart” operation.
  – It‟s okay to add to an old cart, as long as all
    versions of the cart are eventually reconciled
• Dynamo treats each modification as a new
  (& immutable) version of the object.
  – Multiple versions can exist at the same time
              Reconciliation
• Usually, new versions contain the old
  versions – no problem
• Sometimes concurrent updates and
  failures generate conflicting versions
• Typically this is handled by merging
  – For add-to-cart operations, nothing is lost
  – For delete-from cart, deleted items might
    reappear after the reconciliation
     Parallel Version Branches
• There may be multiple versions of the same
  data, each coming from a different path (e.g., if
  there‟s been a network partition)
• Vector clocks are used to identify causally
  related versions and parallel (concurrent)
  versions
   – For causally related versions, accept the final version
     as the “true” version
   – For parallel (concurrent) versions, use some
     reconciliation technique to resolve the conflict
   Execution of get( ) and put( )
• Operations can originate at any node in the
  system.
• Clients may
  – Route request through a load-balancing coordinator
    node
  – Use client software that routes the request directly to
    the coordinator for that object
• The coordinator contacts R nodes for reading
  and W nodes for writing, where R + W > N
                “Sloppy Quorum”
• put( ): the coordinator writes to the first N healthy
  nodes on the preference list. If W writes succeed,
  the write is considered to be successful
• get( ): coordinator reads from N nodes; waits for R
  responses.
   – If they agree, return value.
   – If they disagree, but are causally related, return the
     most recent value
   – If they are causally unrelated apply reconciliation
     techniques and write back the corrected version
              Hinted Handoff
• What if a write operation can‟t reach some of the
  nodes on the preference?
• To preserve availability and durability, store the
  replica temporarily on another node,
  accompanied by a metadata “hint” that
  remembers where the replica should be stored.
• Hinted handoff ensures that read and write
  operations don‟t fail because of network
  partitioning or node failures.
  Handling Permanent Failures
• Hinted replicas may be lost before they
  can be returned to the original node. Other
  problems may cause replicas to be lost or
  fall out of agreement
• Merkle trees allow two nodes to compare
  a set of replicas and determine fairly easily
  – Whether or not they are consistent
  – Where the inconsistencies are
  Handling Permanent Failures
• Merkle trees have leaves whose values are
  hashes of the values associated with keys (one
  key/leaf)
  – Parent nodes contain hashes of their children
  – Eventually, root contains a hash that represents
    everything in that replica
• To detect inconsistency between two sets of
  replicas, compare the roots
  – Source of inconsistency can be detected by looking at
    internal nodes
                  Failures
• Like Google, Amazon has a number of
  data centers, each with many commodity
  machines.
  – Individual machines fail regularly
  – Sometimes entire data centers fail due to
    power outages, network partitions, tornados,
    etc.
• To handle failure of entire centers, replicas
  are spread across multiple data centers.
Membership and Failure Detection
• Temporary failures or accidental additions
  of nodes are possible but shouldn‟t cause
  load re-balancing.
• Additions and deletions of nodes are
  explicitly executed by an administrator.
• A gossip-based protocol is used to ensure
  that every node eventually has a
  consistent view of the membership list.
      Gossip-based Protocol
• Periodically, each node contacts another
  node in the network, randomly selected.
• Nodes compare their membership
  histories and reconcile them.
          Load Balancing for
        Additions and Deletions
• When a node is added, it acquires key values
  from other nodes in the network.
  – Nodes learn of the addition through the gossip
    protocol, contact the node to offer their keys, which
    are then transferred after being accepted
  – When a node is removed, a similar process happens
    in reverse
• Experience has shown that this approach leads
  to a relatively uniform distribution of key/value
  pairs across the system
                  Summary
• Experience with Dynamo indicates that it meets
  the requirements of scalability and availability.
• Service owners are able to customize their
  storage system to emphasize performance,
  durability, or consistency. The primary
  parameters are N, R, and W.
• The developers conclude that decentralization
  and eventual consistency can provide a
  satisfactory platform for hosting highly-available
  applications.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:25
posted:6/11/2011
language:English
pages:29