Dynamo Amazons Highly Available Key-Value Store

Document Sample
Dynamo Amazons Highly Available Key-Value Store Powered By Docstoc
					Dynamo: Amazon's Highly Available
       Key-Value Store

 CScI 8101 : Advanced Operating Systems
        Presented by: Chaithra KN
    Clouded Data: Comprehending Scalable Data Management

    Managing petabytes of data for millions of users - challenge for big
    internet based enterprises such as Google, Yahoo!, and Amazon.

    Traditional Relational Database Systems – unsuitable (expensive,
    complex data model, repartitioning is complex)

    An architectural redesign of data management systems with
    requirements like:

             - High scalability
             - High availability
             - Low latency
             - Weaker consistency
             - Lower application generality
    Clouded Data: Comprehending Scalable Data Management

    Some services do not require complex querying and
    management functionality offered by RDBMS - access data by
    primary-key only

    Downgrade some of the service guarantees of traditional

    There is a shift towards NoSQL or non-relational data stores -
    key-value stores. But relational databases are still needed.

     Example - in online stores, shopping cart and session
    management can be handled with a key-value system but the
    checkout process needs a traditional database
                 Amazon's Dynamo in a nutshell

    Highly decentralized – architecture similar to Peer-to-peer

    Provides key-value only interface. The value could be just a
    blob, the data store is not aware of what the contents of value

    Targeted mainly at applications that need an “always writeable”
    data store where no updates are rejected

    Built for latency sensitive applications that require atleast 99.9%
    of read and write operations to be performed within a few
    hundred milliseconds

    Sacrifices consistency under certain failure conditions to
             System Assumptions and Requirements

     Query Model: simple read and write operations to a data item
     that is uniquely identified by a key – no complex querying or
     accesses ranging multiple data items

     ACID Properties: Does not provide consistency and isolation

     Efficiency: latency requirements which are generally measured
     at the 99.9th percentile of the distribution – built for applications
     that have stringent SLA requirements
    Ex: A service that guarantees to provide a resonse within 300ms
     for 99.9% of its requests for a peak client load of 500 requests
     per second
                        Design requirements

   Incremental scalability
   Symmetry
   Decentralization
   Heterogenity

    Conflict resolution is pushed to reads in order to ensure that writes
    are never rejected.
                       Consistent Hashing
 Partitioning algorithm used to dynamically partition the data.
 The output range of a hash function is treated as a fixed circular space

 or ring.
 Each node in the system is assigned a random value within this space

which represents its position in the ring.
 Data item's key hashed to determine the node which will process it.

 Virtual nodes – each node assigned to multiple positions

 Advantage - departure or arrival of a node only affects its immediate

neighbors and other nodes remain unaffected.

No need to store system state – i.e. no trackers needed to keep track of
where my data is going into.
Consistent hashing
                   Advantages of virtual nodes

    If a node becomes unavailable the load handled by this node is evenly
    dispersed across the remaining available nodes.

    When a node becomes available again, the newly available node
    accepts a roughly equivalent amount of load from each of the other
    available nodes.

    The number of virtual nodes that a node is responsible can be
    decided based on its capacity, accounting for heterogeneity in the
    physical infrastructure

   Incremental scalability and heterogenity achieved.
   Needed to achieve high availability and durability

   Replicates data on N hosts – N is configurable

Data stored locally + replicated on coordinator's N-1 clockwise
successor nodes.

   Preference list – list of nodes responsible for storing a particular key.
                           Vector clocks
    Eventual consistency – updates propagated to all replicas
    All replicas eventually become consistent through reconciliation
    measures taken at the time of read operations.
    Multiple versions of data exist – updates can happen to older versions
    – divergent versions reconciled later.
     When newer versions subsume previous versions – system decides
    the final version.
     Version branching – conflicting versions of an object – the client
    performs the reconciliation (collapse).
    One vector clock associated with every version of every object –
    (node,counter) pair.
    While updating, a client must specify which version it is updating –
    using context obtained from an earlier read.

    Reconciliation during reads
            System interface, execution of get and put
    get(key)
    put(key, context, object) – context contains information like version of
    the object
    Uses consistency protocol similar to quorum-like systems.
    R – minimum number of nodes that must participate in a successful
    read operation.
    W - minimum number of nodes that must participate in a successful
    write operation.
    Quorum-like system : R + W > N
    N – top N nodes from the preference list
    Strict quorum affects availablility and durability.
                           Selecting a node

Two strategies that a client can use to select a node:
(1) route its request through a generic load balancer that will select a
node based on load information, or

(2) use a partition-aware client library that routes requests directly to the
   appropriate coordinator nodes.

The advantage of the first approach - client does not have to link any
  code specific to Dynamo in its application

Second strategy - can achieve lower latency because it skips a potential
  forwarding step.
              Sloppy quorum and hinted handoff

    Sloppy quorum used to handle temporary server failures and network

     All read and write operations are performed on first N healthy nodes
    from the preference list, which may or may not always be the first N
    nodes encountered while walking the consistent hashing ring. *

    The replacement node will get a hint in its metadata using which it will
    send replica to the intended recipient has recovered – Hinted handoff

* To account for node failures, preference list contains more than N
                        Other techniques
   Merkel trees – to detect inconsistencies between replicas and to
    minimize the amount of transferred data.

    Gossipping – Nodes can be added and removed multiple times –
    Gossip-based protocol propagates membership changes and
    maintains an eventually consistent view of membership.

  Each storage node has 3 software components:
- Request coordination
- Membership and failure detection
- Local persistence engine
         Balancing performance and durability

   Buffered read and writes
    Missing writes if server crashes - One of the N replicas can perform a
    durable write without affecting performance
Balancing performance and durability
                    Positives and Negatives
+ No need to store system data separately.No concept of dynamic
  system state with consistency guarantees on that.
+ A single node is mapped onto multiple virtual nodes. This leads to
  uniform data and load distribution.
+ Merkel tree approach – very less data transfer needed to maintain
  consistency of replicas.
+ Client applications can tune the value of N, W and R to achieve their
  desired level of performance, availability and durability.

- Static provisioning of hash space can lead to over-provisioning
- Conflict reconciliations at application layer to solve inconsistency
  problems might increase software development cost.
               Comparing PNUTS and Dynamo

PNUTS                          Dynamo

- Hashed / Ordered tables      - Key – value pairs
- Generation based             - Vector clocks used
versioning                     - Gossip based
- Communication through        - Consistent hashing
Pub / Sub YMB                  - Eventual consistency and
infrastructure (optimized      reconciliation
for geographically
separated replicas)
- Partitioning into tablets
- Timeline based consistency
                              Clouded Data

    Ad hoc systems - It remains to be seen if such application level
    coordination without support from the data management service will be
    sufficient with more complex operations and heavier workloads.

    Atomic objects to single data items – is this sufficient?

Shared By: