Dynamo Amazons Highly Available Key-Value Store
Document Sample


Dynamo: Amazon's Highly Available
Key-Value Store
CScI 8101 : Advanced Operating Systems
Presented by: Chaithra KN
Clouded Data: Comprehending Scalable Data Management
Systems
Managing petabytes of data for millions of users - challenge for big
internet based enterprises such as Google, Yahoo!, and Amazon.
Traditional Relational Database Systems – unsuitable (expensive,
complex data model, repartitioning is complex)
An architectural redesign of data management systems with
requirements like:
- High scalability
- High availability
- Low latency
- Weaker consistency
- Lower application generality
Clouded Data: Comprehending Scalable Data Management
Systems
Some services do not require complex querying and
management functionality offered by RDBMS - access data by
primary-key only
Downgrade some of the service guarantees of traditional
RDBMS
There is a shift towards NoSQL or non-relational data stores -
key-value stores. But relational databases are still needed.
Example - in online stores, shopping cart and session
management can be handled with a key-value system but the
checkout process needs a traditional database
Amazon's Dynamo in a nutshell
Highly decentralized – architecture similar to Peer-to-peer
systems
Provides key-value only interface. The value could be just a
blob, the data store is not aware of what the contents of value
are.
Targeted mainly at applications that need an “always writeable”
data store where no updates are rejected
Built for latency sensitive applications that require atleast 99.9%
of read and write operations to be performed within a few
hundred milliseconds
Sacrifices consistency under certain failure conditions to
System Assumptions and Requirements
Query Model: simple read and write operations to a data item
that is uniquely identified by a key – no complex querying or
accesses ranging multiple data items
ACID Properties: Does not provide consistency and isolation
guarantees
Efficiency: latency requirements which are generally measured
at the 99.9th percentile of the distribution – built for applications
that have stringent SLA requirements
Ex: A service that guarantees to provide a resonse within 300ms
for 99.9% of its requests for a peak client load of 500 requests
per second
Design requirements
Incremental scalability
Symmetry
Decentralization
Heterogenity
Conflict resolution is pushed to reads in order to ensure that writes
are never rejected.
Consistent Hashing
Partitioning algorithm used to dynamically partition the data.
The output range of a hash function is treated as a fixed circular space
or ring.
Each node in the system is assigned a random value within this space
which represents its position in the ring.
Data item's key hashed to determine the node which will process it.
Virtual nodes – each node assigned to multiple positions
Advantage - departure or arrival of a node only affects its immediate
neighbors and other nodes remain unaffected.
No need to store system state – i.e. no trackers needed to keep track of
where my data is going into.
Consistent hashing
Advantages of virtual nodes
If a node becomes unavailable the load handled by this node is evenly
dispersed across the remaining available nodes.
When a node becomes available again, the newly available node
accepts a roughly equivalent amount of load from each of the other
available nodes.
The number of virtual nodes that a node is responsible can be
decided based on its capacity, accounting for heterogeneity in the
physical infrastructure
Incremental scalability and heterogenity achieved.
Replication
Needed to achieve high availability and durability
Replicates data on N hosts – N is configurable
Data stored locally + replicated on coordinator's N-1 clockwise
successor nodes.
Preference list – list of nodes responsible for storing a particular key.
Vector clocks
Eventual consistency – updates propagated to all replicas
asynchronously.
All replicas eventually become consistent through reconciliation
measures taken at the time of read operations.
Multiple versions of data exist – updates can happen to older versions
– divergent versions reconciled later.
When newer versions subsume previous versions – system decides
the final version.
Version branching – conflicting versions of an object – the client
performs the reconciliation (collapse).
One vector clock associated with every version of every object –
(node,counter) pair.
While updating, a client must specify which version it is updating –
using context obtained from an earlier read.
Reconciliation during reads
System interface, execution of get and put
get(key)
put(key, context, object) – context contains information like version of
the object
Uses consistency protocol similar to quorum-like systems.
R – minimum number of nodes that must participate in a successful
read operation.
W - minimum number of nodes that must participate in a successful
write operation.
Quorum-like system : R + W > N
N – top N nodes from the preference list
Strict quorum affects availablility and durability.
Selecting a node
Two strategies that a client can use to select a node:
(1) route its request through a generic load balancer that will select a
node based on load information, or
(2) use a partition-aware client library that routes requests directly to the
appropriate coordinator nodes.
The advantage of the first approach - client does not have to link any
code specific to Dynamo in its application
Second strategy - can achieve lower latency because it skips a potential
forwarding step.
Sloppy quorum and hinted handoff
Sloppy quorum used to handle temporary server failures and network
partitions.
All read and write operations are performed on first N healthy nodes
from the preference list, which may or may not always be the first N
nodes encountered while walking the consistent hashing ring. *
The replacement node will get a hint in its metadata using which it will
send replica to the intended recipient has recovered – Hinted handoff
* To account for node failures, preference list contains more than N
nodes.
Other techniques
Merkel trees – to detect inconsistencies between replicas and to
minimize the amount of transferred data.
Gossipping – Nodes can be added and removed multiple times –
Gossip-based protocol propagates membership changes and
maintains an eventually consistent view of membership.
Implementation
Each storage node has 3 software components:
- Request coordination
- Membership and failure detection
- Local persistence engine
Balancing performance and durability
Buffered read and writes
Missing writes if server crashes - One of the N replicas can perform a
durable write without affecting performance
Balancing performance and durability
Positives and Negatives
+ No need to store system data separately.No concept of dynamic
system state with consistency guarantees on that.
+ A single node is mapped onto multiple virtual nodes. This leads to
uniform data and load distribution.
+ Merkel tree approach – very less data transfer needed to maintain
consistency of replicas.
+ Client applications can tune the value of N, W and R to achieve their
desired level of performance, availability and durability.
- Static provisioning of hash space can lead to over-provisioning
- Conflict reconciliations at application layer to solve inconsistency
problems might increase software development cost.
Comparing PNUTS and Dynamo
PNUTS Dynamo
- Hashed / Ordered tables - Key – value pairs
- Generation based - Vector clocks used
versioning - Gossip based
- Communication through - Consistent hashing
Pub / Sub YMB - Eventual consistency and
infrastructure (optimized reconciliation
for geographically
separated replicas)
- Partitioning into tablets
- Timeline based consistency
Clouded Data
Ad hoc systems - It remains to be seen if such application level
coordination without support from the data management service will be
sufficient with more complex operations and heavier workloads.
Atomic objects to single data items – is this sufficient?
Get documents about "