Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB, Cassandra, and Azure Cosmos DB.

Paper Link

A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers.

Let's visit some fundamental ideas (with a caution that there are several of them!).

Distributed Hash Tables (DHTs)

A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include:

Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control.
Fault Tolerance: The system remains reliable even when nodes join, leave, or fail.
Scalability: It efficiently handles systems with thousands or millions of nodes.
Anonymity: Participant identities can be hidden.

Keyspace Partitioning

DHTs distribute ownership of the keyspace among participating nodes. Common partitioning methods include:

Consistent Hashing: This widely-used method partitions the keyspace based on the distance δ(k₁, k₂) between keys. Each node is assigned an identifier. The node with identifier i_x owns all keys k where δ(i_x, k) is minimal.

Rendezvous Hashing: All clients employ the same hash function h(.) to map a key to one of the available servers.

Locality-Preserving Hashing: This method assigns similar keys to similar nodes, enhancing data locality.

Overlay Network

Each node maintains a routing table to direct requests to the node responsible for the specific key. This is called an overlay network.

A trade-off exists between routing table size and the number of hops required for a request to reach its destination. Larger routing tables generally reduce the number of hops.

Consistent Hashing

Consistent Hashing is used to partition a key-space, with node assignments determined by a distance function. This concept exhibits several variations.

Modulo Hash

Assigns key x to node H(x) % N, where H(x) is the hash of the key and N is the number of nodes.

Issue: Changes in N (e.g., node additions/removals) invalidate all cached values.

Chord Hash

Arranges hashed keys on a circular ring. Each node is responsible for a contiguous arc of the ring.

Advantage: Node failures only impact data within their specific arc.

Tree Hash

Uses the bit values of H(x) to traverse a tree and locate the responsible node.
Challenge: Node failures require efficient reassignment of the failed node's data.

Kademlia Hash

Stores key x on the node closest by XOR distance. XOR distance is a function which, for each bit, the function returns zero if the two bits are equal and one if the two bits are different.

Flexibility: Handles node additions/removals by dynamically redistributing data.

Load Distribution in Consistent Hashing

Chord hashing faces a significant challenge with node failures: the loss of a node necessitates moving all its data to the successor node, leading to load imbalance.

Naive Solution: Hash Space Redistribution

This approach involves adjusting the hash space to evenly distribute data among the remaining nodes.

Drawback: This can trigger substantial data movement across the network, impacting performance.

Better Solution: Redundant Node Placement

Both Chord and Kademlia employ a strategy of placing a node at multiple locations within their respective structures (multiple points in Chord, multiple leaf positions in Kademlia). This redundancy ensures that upon node failure, the keyspace is more evenly distributed among the remaining nodes, minimizing load imbalance.

In the diagram above, the same node is placed at multiple positions on the hash ring. When the purple node goes down, the green and yellow nodes share the portion of the arc owned by the purple node.

Another Advantage: Note all machines will have the same capacity. Nodes with higher capacity can be placed at more positions, effectively increasing their share of the keyspace.

CARP (Cache Array Reading Protocol)

CARP is another assignment algorithm that offer good load distribution.

Each machine is assigned a unique hash value: h_i, where i ranges from 1 to N. For an input x, calculate a hash value for each machine:

H₁ = H(h₁, x)

H₂ = H(h₂, x)
H₃ = H(h₃, x)
...

Map input x to the machine with the largest hash value H_i.

Pros: When a machine fails, the load is automatically redistributed. Since the hash function is independent of machine failures, the next machine chosen for a given input will be determined randomly based on its hash value.

Cons: Mapping an input to a machine requires calculating N hash values, resulting in a linear time complexity of O(n). This can significantly impact performance, especially during high-traffic periods. In contrast, distributed hashing protocols like Chord and Kademlia offer logarithmic time complexity (O(log n)) for finding the responsible node, making them more efficient for large-scale systems.

Routing Algorithm in Consistent Hashing

In decentralized systems with millions of machines, several critical challenges arise:

Dynamic Membership: Nodes frequently join and leave the network.
Decentralized Control: Nodes are owned and operated by independent entities.

Challenges of Early Peer-to-Peer Systems

Napster: Relied on a centralized server to maintain a global index and route requests. This single point of failure made the system vulnerable to disruption.
Gnutella: Employed a decentralized approach where each peer connected to a limited number of neighbors. However, locating resources often involved costly broadcasts to all connected peers, leading to significant overhead.

Consistent Hashing as a Solution

Consistent hashing provides an effective mechanism for routing requests in decentralized systems.

Routing Algorithm for Chord

Each node is assigned a unique identifier (e.g., a hash value).

Each node maintains a "finger table" with O(log N) entries. The i-th entry in node n's finger table points to the node that is 2^i-1 identifiers away from n on the ring.

During lookup, the current node examines its finger table to find the closest node to the key's identifier.

Routing Algorithm for Kademlia

Each node is assigned a unique identifier (e.g., a hash value).

Each node maintains a routing table with entries based on its identifier's binary representation:

b₁': Entries for nodes whose first bit differs from the current node.

b₁b₂': Entries for nodes whose second bit differs.

b₁b₂b₃': Entries for nodes whose third bit differs, and so on.

When a node receives a request for a specific hash, it checks if it owns the hash. If not, it forwards the request to a neighboring node that is "closer" in terms of their identifier's binary representation (i.e., differs in fewer bits).

The routing table has (O(log N)). Finding the responsible node requires traversing the routing table, resulting in logarithmic lookup time (O(log N)).

In essence, this algorithm allows nodes to efficiently route requests towards the node responsible for the target hash by iteratively correcting one bit at a time.

Problem

In a system with a million nodes, Chord typically requires around 20 hops to locate a specific key. However, the failure of a single node along the routing path can cause the entire lookup to fail.

Kademlia addresses this issue through replication. Kademlia employs r = 20, meaning each key is stored on 20 closest nodes. To support replication, each node in Kademlia maintains 20 entries for each prefix of its own ID, each pointing to one of the k^th closest nodes for that prefix.

Constant Size Routing Tables (Koorde Hash)

Koorde hash is an algorithm that can achieve constant size routing tables independent of N. In tree-based distributed systems, efficient routing can be achieved by strategically placing nodes within the tree structure:

By placing nodes at every non-leaf position of the tree, each node only needs to maintain pointers to its immediate children.

Each node requires a fixed number of pointers, independent of the total number of nodes in the system.

Finding any node within the tree can be accomplished in logarithmic time (O(log N)).

SQL vs NoSQL Databases

SQL databases are built upon the concept of relational data, SQL databases prioritize strong data integrity guarantees:

Atomicity: Transactions execute as a single, indivisible unit, either succeeding or failing completely.
Consistency: Ensures that all data modifications maintain the integrity of the database schema (e.g., data type constraints, referential integrity).
Isolation: Transactions appear to execute serially, even when occurring concurrently, preventing unexpected interactions. This is one of the most important and distinctive features of SQL databases.
Durability: Committed data persists permanently, even in the event of system failures.

Examples: MySQL, PostgreSQL, Oracle, SQL Server.

However the SQL databases are limited by their scalability - it is difficult to distribute data across multiple nodes while maintaining ACID properties. Additionally, vertical scaling is limited by the capacity of individual servers.

NoSQL databases were developed to address the limitations of SQL databases, particularly in terms of scalability and flexibility.

There are different types of NoSQL databases supporting different data models.

Key-Value Stores: (e.g., Redis, Memcached, DynamoDB) Simple data structures with fast read/write operations.
Document Stores: (e.g., MongoDB, CouchDB) Store data in flexible, document-like structures.
Wide-Column Stores: (e.g., Cassandra, HBase) Efficiently handle large datasets with sparse attributes.
Graph Databases: (e.g., TAO, Neo4j) Optimized for representing and querying relationships between data.

They have relaxed consistency models, often prioritizing latency, performance, and availability (ALPS) over strong consistency. Data will eventually be consistent across the system, but there may be temporary inconsistencies. Many NoSQL databases offer limited transaction support compared to SQL. Key-value stores typically support atomic operations on single keys. Some, like MongoDB, provide limited support for multi-document transactions.

NoSQL databases are inherently designed for distributed environments, enabling easier horizontal scaling.
They utilize simpler consistency protocols (e.g., Dynamo's vector clocks) compared to more complex protocols like Paxos required by some SQL databases.

Quorum Systems

Quorum systems are used to ensure that multiple nodes in a system agree on a decision or state. Quorum systems are effective for linearizability, which ensures that operations on a single key appear to occur in a specific order. However, they are not suitable for strict serializability, which guarantees that concurrent transactions appear to execute in a serial order.

Transactions in Quorum Systems

Transactions within a quorum system exhibit the following characteristics:

Atomicity: Transactions either complete successfully or fail entirely.
Linearizability: Operations on individual keys are executed in a well-defined order.

To enable transactions, a two-phase commit protocol is typically employed:

Write Phase:
- The value for the key is written to a specific number of replicas (C).
- Each replica provides a commit vote, essentially locking the key for other writes.
Commit Phase:
- Once enough votes are collected, the transaction is committed.

While atomicity is inherent to transactions, it's not always necessary in a quorum system. Maintaining consistent states across all replicas can be resource-intensive and unnecessary for achieving linearizability.

Voting Protocol for Quorums

The most common way of using quorum systems involves voting. In this scenario, atomicity of operations may not be guaranteed, but with proper voting configuration, the system can appear linearizable to the client, even if individual replicas are in inconsistent states.

Given N servers, W (write quorum), and R (read quorum), the system can achieve linearizability if:

R + W > N

Consider a system with three servers (A, B, C), where R = 2 and W = 2. Say, a write operation succeeds on server A but fails on the others.

The overall write operation fails. However, because this action is non-atomic, node A is left in an inconsistent state.
The client can read data from replicas A and B. Upon detecting conflicting values from A and B, it will read from C, receiving a version the same as A.
Server A can then repair its state by fetching the correct value from another server.

Linearizability cannot be guaranteed when R + W <= N. In such cases, conflicts can arise.

Say a write W₁ succeeded at replica A and B and a write W₂ to the same key succeeded at replica B and C. Since B had observed both writes, it can break ties. But if B is down, then there would be a conflict. In this case, W = 2 but R = 1 and hence R + W = 3 <= N.

Vector Clock

Vector Clocks are a powerful technique for establishing causal order among events in a distributed system. They enhance Lamport's clock by maintaining a separate logical clock for each node within the system.

In an N-node system, a vector clock is an N-dimensional vector, where each element represents the logical clock of a corresponding node.

Event Handling

Local Event: Increment the local node's clock (i.e., the i-th element of the vector).

V[i] = V[i] + 1

Send Event: Increment the local node's clock. Transmit the current vector clock (V[1...N]) along with the message.

Receive Event:

Increment the local node's clock.

V[i] = V[i] + 1

Update each element of the local vector by taking the maximum of the corresponding value in the received message and the current local value.

V[j] = max(V_msg[j], V[j]) ∀ j ≠ i

Comparing Vector Clocks

Equality: Two vector clocks, V₁ and V₂, are equal if and only if all their corresponding elements are equal.

V₁ = V₂ iff V₁[i] = V₂[i] ∀ i ∈ {1, ..., N}

Less Than or Equal To: Vector clock V₁ is less than or equal to V₂ if and only if every element of V₁ is less than or equal to the corresponding element in V₂.

V₁ <= V₂ iff V₁[i] <= V₂[i] ∀ i ∈ {1, ..., N}

Vector Clocks in Quorum System

While quorum systems with R + W <= N cannot guarantee linearizability, vector clocks can be employed to effectively resolve conflicts and facilitate eventual reconciliation.

Approach

Each node maintains and updates its vector clock according to the rules described earlier.
When a write operation occurs on a key, the current state of the vector clock is associated with the written value.

Example: <key, value, <A: 1, B: 2>> where <A: 1, B: 2> represents the vector clock at the time of the write.

Conflict Resolution

Partial Ordering: Vector clocks enable partial ordering of values.

V₁ <= V₂ iff V₁[i] <= V₂[i] ∀ nodes i

Conflict Resolution Strategies:

Choose a value randomly.
Select the value associated with the most recent physical timestamp.

Neither of these conflict resolution strategies guarantees linearizability.

Anti-Entropy using Merkle Trees

Say, a set of replicas own a keyspace. Maintaining consistency among these replicas is crucial, a concept often referred to as anti-entropy.

Merkle trees offer a robust mechanism for detecting conflicts and efficiently updating outdated values within this keyspace. This is particularly valuable in quorum systems, where replica inconsistencies can frequently arise.

A merkle tree is a binary tree-based data structure where each leaf node holds the hash of some items. Internal nodes, in turn, hold the hash of their respective child nodes. In this case, the leaf nodes hold the hash of values of all the keys. The hierarchical structure enables efficient verification of data integrity and facilitates the identification of discrepancies between replicas.

To compare the merkle trees of a keyspace across two replicas, we initiate a top-down traversal, starting at the root node, to find the keys which are not synced.

Failure Detectors

Refer Cassandra - A Decentralized Structured Storage System for detailed description of failure detectors and gossip protocols.

Failure detectors are components that help in identification of faulty nodes in a distributed system. It is impossible to have a perfect failure detector. Consequently, practical implementations rely on imperfect approximations. Heartbeat-based detectors is a prominent example that leverages gossiped heartbeat information among nodes.

Dynamo

Dynamo, a pioneering key-value NoSQL database, has significantly influenced the landscape of modern NoSQL systems. A key innovation of Dynamo lies in its deliberate trade-off of strict consistency for achieving low-latency and high availability.

Recognizing the critical dependencies of numerous other Amazon services on Dynamo, the system prioritizes stringent tail latency requirements. Specifically, it focuses on achieving high performance for the 99.9th percentile of tail latency. This emphasis stems from the understanding that in many scenarios, an incorrect response is preferable to experiencing no response or delayed response. The additive nature of latency SLAs across interdependent systems underscores the potential for severe cascading failures if even a small fraction of requests experience prolonged latencies.

The API

Following the design principles of many successful distributed systems, Dynamo boasts a remarkably simple API:

Get(key) -> (context, list of values): Retrieves the value(s) associated with the given key, along with contextual information.
Put(key, context, value) -> void: Stores the specified value under the given key, utilizing the provided context for conflict resolution.

The context parameter in these operations is an opaque byte string. This context information is crucial for resolving write conflicts using a vector clock mechanism.

Consistent Hashing in Dynamo

Dynamo was developed in 2007, during a period of intense interest in decentralized systems. Reflecting this trend, Dynamo employs a peer-to-peer architecture where all nodes are identical and utilize consistent hashing for efficient overlay network formation. This design adheres to the principles of incremental scalability, decentralization, symmetry, and heterogeneity.

Traditional consistent hashing presents challenges, such as key redistribution when a new node joins, potentially triggering expensive merkle tree recomputations across multiple nodes. That is why, Dynamo employs a variant of consistent hashing.

Dynamo's Strategy

The keyspace is divided into Q equally sized partitions.
If the system has S nodes, Q/S tokens are assigned to the nodes. The number of tokens are assigned based on its physical capacity of a node.

This approach offers several advantages:

When a node leaves, only the new nodes responsible for its partitions need to recompute merkle trees, as the partitions themselves remain fixed.
Partitioning (defined by Q) and partition placement are decoupled.
The routing table maintains a constant number of entries (Q).

Additionally, we define a fairness metric as follows:

Load balancing efficiency = mean load / max load

A node is considered out of balance if its load exceeds 1.15 times the mean load. Identifying and addressing unbalanced nodes is critical, as they significantly impact tail latencies.

Note, the same physical machines may end up at multiple positions on the ring, also known as virtual nodes.

Handling Client Operations

Each key k is assigned to a coordinator node. This coordinator node is responsible for replicating the keys that fall within its assigned range.

For redundancy, the coordinator replicates the key to N - 1 successor nodes within the ring. This results in a system where each node is responsible for the region of the ring between itself and its Nth predecessor.

The list of nodes responsible for storing a particular key is termed the preference list. Since each physical node may be responsible for multiple ranges on the ring, the preference list strategically skips positions on the ring to ensure that it only contains distinct physical nodes.

In the figure above, the key is placed on node 0 (the coordinator), 1, and 2.

Clients can interact with the system through two primary methods:

Directly using the assignment map: This approach offers efficient and low-latency access.
Through a load balancer: This may introduce additional latency.

If a client initially contacts a different node (not the key's coordinator), the request will be redirected to the correct coordinator.

Unlike traditional quorum systems that enforce linearizability by adhering to the R + W > N rule, Dynamo deviates from this approach. This deviation is intentional, as adhering to R + W > N can lead to significant latency increases, especially when N is large.

Instead of strict quorum rules, Dynamo utilizes vector clocks associated with each key's value to handle conflicting writes. To ensure durability, the coordinator synchronously replicates the write to at least one other node before acknowledging the client's request.

For applications demanding high-performance and consistent reads, a configuration of R=1 and W=N may be suitable.

Section 4.4 provides a detailed example of how vector clocks are employed to reconcile conflicting values for a key. While the specific details are well-explained in the paper, the core concept is that D3 precedes D4 because Sx(D3) = Sx(D2) and Sy(D3) > Sy(D2). D3 and D4 are considered conflicting versions because Sy(D3) > Sy(D4) and Sz(D4) > Sz(D3). These conflicts are resolved through semantic reconciliation.

The paper also mentions that Dynamo truncates the vector clock to 10 elements by removing the oldest entry. This design decision raises concerns, as it may potentially lead to several unforeseen issues.

Handling Failures

Temporary Failure

Dynamo introduces "hinted handoff", a technique that improves fault tolerance. When a node is unavailable, writes are redirected to its successor node on the ring. These "hinted writes" are stored in a separate file, facilitating efficient transfer when the target node becomes available again.

This approach inherently breaks the traditional quorum algorithm, leading Dynamo to adopt a "sloppy quorum" strategy.

Permanent Failure

When a replica fails permanently, writes are still directed to at least N-1 other nodes, ensuring some level of data durability. This is where merkle trees prove invaluable.

Dynamo incorporates an anti-entropy replica synchronization protocol. In this protocol, the merkle tree serves as a data structure where leaf nodes represent the hashes of values of individual key-value pairs. During synchronization, a comparison of merkle trees between replicas reveals discrepancies, enabling the efficient transfer of only the out-of-sync values.

Ring Membership

Only administrators possess the authority to add or remove members from the ring. Upon any membership change, a gossip-based protocol propagates these updates throughout the system, ensuring eventual consistency in the membership view across all nodes. In this protocol, each node randomly selects a peer for communication every second.

Failure detection within the system also leverages a simple gossip-style mechanism. However, this detection is inherently local. Node A may deem node B as failed if it fails to receive responses from B, even though B might still be responsive to other nodes like C. To further reinforce failure detection, a periodic loop is implemented to actively check for the presence of node B.

Evaluation

Tail latencies: The 99.9th percentile latency (p999) hovers around 200ms, significantly higher than the average latency.

Load Balancing: The new consistent partitioning strategy demonstrates excellent load balancing efficiency, exceeding 0.9.

Version Divergence: Two primary scenarios contribute to version divergence: system failures and concurrent write operations.

99.94% of requests encounter a single version of the data.
0.00057% encounter two versions.
0.00047% encounter three versions.
0.00009% encounter four versions.

Client-Driven Coordination: Implementing client-driven coordination, where clients directly contact servers based on their knowledge of the key-to-preference list mapping, results in a 30ms reduction in both read and write latencies at the 99.9th percentile.

Paper Review

The Dynamo paper is a seminal work in the field of distributed systems. It introduces several groundbreaking ideas that have significantly influenced the design and development of numerous NoSQL databases.

Given the depth and breadth of the paper, these insights necessarily provides a high-level overview. To fully grasp the nuances and intricacies of Dynamo's design, I highly recommend a thorough and repeated reading of the original paper.

Recommended Read - Cassandra - A Decentralized Structured Storage System which is another NoSQL database similar to Dynamo.

PS: Towards Distributed SQL Databases

Even with the rise of NoSQL, the need for strong consistency and data integrity remained paramount for critical applications like financial systems.

Google Spanner emerged as a groundbreaking achievement, becoming the first distributed SQL database to offer strong consistency guarantees. It pioneered the concept of "DRDBMS" – systems that are:

Scalable: Capable of horizontal scaling to handle massive datasets and high throughput.
Consistent: Adhering to ACID properties, ensuring data integrity and transactional isolation in a distributed environment.
Geographically Replicated: Providing high availability and low latency by distributing data across multiple geographic locations.
SQL-compliant: Supporting the familiar SQL language for data querying and manipulation.

Recommended Read: Spanner: Google's Globally-Distributed Database

Amazon Aurora, another significant player, attempted to enhance consistency by optimizing consensus protocols on top of existing standalone SQL databases like MySQL. However, this approach often faced limitations, with a single write node frequently becoming a performance bottleneck.

CockroachDB, drawing inspiration from Spanner, offers a scalable and consistently replicated transactional store. However, it differentiates itself by:

Independence from Atomic Clocks: Unlike Spanner, CockroachDB does not rely on tightly synchronized atomic clocks, making it more adaptable to diverse deployment environments.
Open Source and Platform Agnostic: CockroachDB is open-source and can be deployed outside of Google Cloud Platform (GCP), providing greater flexibility and choice for organizations.

Pico