Skip to main content

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDBCassandra, and Azure Cosmos DB.

Paper Link

A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers.

Let's visit some fundamental ideas (with a caution that there are several of them!).

Distributed Hash Tables (DHTs)

A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include:

  • Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control.
  • Fault Tolerance: The system remains reliable even when nodes join, leave, or fail.
  • Scalability: It efficiently handles systems with thousands or millions of nodes.
  • Anonymity: Participant identities can be hidden.

Keyspace Partitioning

DHTs distribute ownership of the keyspace among participating nodes. Common partitioning methods include:

  • Consistent Hashing: This widely-used method partitions the keyspace based on the distance δ(k1, k2) between keys. Each node is assigned an identifier. The node with identifier ix owns all keys k where δ(ix, k) is minimal.
  • Rendezvous Hashing: All clients employ the same hash function h(.) to map a key to one of the available servers.
  • Locality-Preserving Hashing: This method assigns similar keys to similar nodes, enhancing data locality.

Overlay Network

Each node maintains a routing table to direct requests to the node responsible for the specific key. This is called an overlay network.

A trade-off exists between routing table size and the number of hops required for a request to reach its destination. Larger routing tables generally reduce the number of hops.

Consistent Hashing

As previously discussed, the keyspace is partitioned, with node assignments determined by a distance function. This concept exhibits several variations.

Modulo Hash

Assigns key x to node H(x) % N, where H(x) is the hash of the key and N is the number of nodes.
Issue: Changes in N (e.g., node additions/removals) invalidate all cached values.

Chord Hash



Arranges hashed keys on a circular ring. Each node is responsible for a contiguous arc of the ring.
Advantage: Node failures only impact data within their specific arc.

Tree Hash

Uses the bit values of H(x) to traverse a tree and locate the responsible node.
Challenge: Node failures require efficient reassignment of the failed node's data.

Kademlia Hash

Stores key x on the node closest by XOR distance. XOR distance is a function which, for each bit, the function returns zero if the two bits are equal and one if the two bits are different.
Flexibility: Handles node additions/removals by dynamically redistributing data.

Load Distribution in Consistent Hashing

Chord hashing faces a significant challenge with node failures: the loss of a node necessitates moving all its data to the successor node, leading to load imbalance.

Naive Solution: Hash Space Redistribution

This approach involves adjusting the hash space to evenly distribute data among the remaining nodes.
Drawback: This can trigger substantial data movement across the network, impacting performance.


Better Solution: Redundant Node Placement

Both Chord and Kademlia employ a strategy of placing a node at multiple locations within their respective structures (multiple points in Chord, multiple leaf positions in Kademlia). This redundancy ensures that upon node failure, the keyspace is more evenly distributed among the remaining nodes, minimizing load imbalance.


In the diagram above, the same node is placed at multiple positions on the hash ring. When the purple node goes down, the green and yellow nodes share the portion of the arc owned by the purple node.

Another Advantage: Note all machines will have the same capacity. Nodes with higher capacity can be placed at more positions, effectively increasing their share of the keyspace.

CARP (Cache Array Reading Protocol)

CARP is another assignment algorithm that offer good load distribution.

Each machine is assigned a unique hash value: hi, where i ranges from 1 to N. For an input x, calculate a hash value for each machine:

H1 = H(h1, x)
H2 = H(h2, x)
H3 = H(h3, x)
...

Map input x to the machine with the largest hash value Hi.

ProsWhen a machine fails, the load is automatically redistributed. Since the hash function is independent of machine failures, the next machine chosen for a given input will be determined randomly based on its hash value.

ConsMapping an input to a machine requires calculating N hash values, resulting in a linear time complexity of O(n). This can significantly impact performance, especially during high-traffic periods. In contrast, distributed hashing protocols like Chord and Kademlia offer logarithmic time complexity (O(log n)) for finding the responsible node, making them more efficient for large-scale systems.

Routing Algorithm in Consistent Hashing

In decentralized systems with millions of machines, several critical challenges arise:

  • Dynamic Membership: Nodes frequently join and leave the network.
  • Decentralized Control: Nodes are owned and operated by independent entities.

Challenges of Early Peer-to-Peer Systems

  • Napster: Relied on a centralized server to maintain a global index and route requests. This single point of failure made the system vulnerable to disruption.
  • Gnutella: Employed a decentralized approach where each peer connected to a limited number of neighbors. However, locating resources often involved costly broadcasts to all connected peers, leading to significant overhead.

Consistent Hashing as a Solution

Consistent hashing provides an effective mechanism for routing requests in decentralized systems.

Routing Algorithm for Chord

Each node is assigned a unique identifier (e.g., a hash value).

Each node maintains a "finger table" with O(log N) entries. The i-th entry in node n's finger table points to the node that is 2i-1 identifiers away from n on the ring.


During lookup, the current node examines its finger table to find the closest node to the key's identifier.

Routing Algorithm for Kademlia

Each node is assigned a unique identifier (e.g., a hash value).

Each node maintains a routing table with entries based on its identifier's binary representation:

b1': Entries for nodes whose first bit differs from the current node.

b1b2': Entries for nodes whose second bit differs.

b1b2b3': Entries for nodes whose third bit differs, and so on.

When a node receives a request for a specific hash, it checks if it owns the hash. If not, it forwards the request to a neighboring node that is "closer" in terms of their identifier's binary representation (i.e., differs in fewer bits).

The routing table has (O(log N)). Finding the responsible node requires traversing the routing table, resulting in logarithmic lookup time (O(log N)).

In essence, this algorithm allows nodes to efficiently route requests towards the node responsible for the target hash by iteratively correcting one bit at a time.

Problem

In a system with a million nodes, Chord typically requires around 20 hops to locate a specific key. However, the failure of a single node along the routing path can cause the entire lookup to fail.

Kademlia addresses this issue through replication. Kademlia employs r = 20, meaning each key is stored on 20 closest nodes. To support replication, each node in Kademlia maintains 20 entries for each prefix of its own ID, each pointing to one of the kth closest nodes for that prefix.

Constant Size Routing Tables (Koorde Hash)

Koorde hash is an algorithm that can achieve constant size routing tables independent of N. In tree-based distributed systems, efficient routing can be achieved by strategically placing nodes within the tree structure:


By placing nodes at every non-leaf position of the tree, each node only needs to maintain pointers to its immediate children.

Each node requires a fixed number of pointers, independent of the total number of nodes in the system.
Finding any node within the tree can be accomplished in logarithmic time (O(log N)).

SQL vs NoSQL Databases

SQL databases are built upon the concept of relational data, SQL databases prioritize strong data integrity guarantees:
  • Atomicity: Transactions execute as a single, indivisible unit, either succeeding or failing completely.
  • Consistency: Ensures that all data modifications maintain the integrity of the database schema (e.g., data type constraints, referential integrity).
  • Isolation: Transactions appear to execute serially, even when occurring concurrently, preventing unexpected interactions. This is one of the most important and distinctive features of SQL databases.
  • Durability: Committed data persists permanently, even in the event of system failures.
Examples: MySQL, PostgreSQL, Oracle, SQL Server.

However the SQL databases are limited by their scalability - it is difficult to distribute data across multiple nodes while maintaining ACID properties. Additionally, vertical scaling is limited by the capacity of individual servers.

NoSQL databases were developed to address the limitations of SQL databases, particularly in terms of scalability and flexibility.

There are different types of NoSQL databases supporting different data models.

  • Key-Value Stores: (e.g., Redis, Memcached, DynamoDB) Simple data structures with fast read/write operations.
  • Document Stores: (e.g., MongoDB, CouchDB) Store data in flexible, document-like structures.
  • Wide-Column Stores: (e.g., Cassandra, HBase) Efficiently handle large datasets with sparse attributes.
  • Graph Databases: (e.g., Neo4j) Optimized for representing and querying relationships between data.
They have relaxed consistency models, often prioritizing latency, performance, and availability (ALPS) over strong consistency. Data will eventually be consistent across the system, but there may be temporary inconsistencies. Many NoSQL databases offer limited transaction support compared to SQL. Key-value stores typically support atomic operations on single keys. Some, like MongoDB, provide limited support for multi-document transactions.
NoSQL databases are inherently designed for distributed environments, enabling easier horizontal scaling.
They utilize simpler consistency protocols (e.g., Dynamo's vector clocks) compared to more complex protocols like Paxos required by some SQL databases.

Quorum Systems

Quorum systems are used to ensure that multiple nodes in a system agree on a decision or state. Quorum systems are effective for linearizability, which ensures that operations on a single key appear to occur in a specific order. However, they are not suitable for strict serializability, which guarantees that concurrent transactions appear to execute in a serial order.

Transactions in Quorum Systems

Transactions within a quorum system exhibit the following characteristics:

  • Atomicity: Transactions either complete successfully or fail entirely.
  • Linearizability: Operations on individual keys are executed in a well-defined order.

To enable transactions, a two-phase commit protocol is typically employed:

  1. Write Phase:

    • The value for the key is written to a specific number of replicas (C).
    • Each replica provides a commit vote, essentially locking the key for other writes.
  2. Commit Phase:

    • Once enough votes are collected, the transaction is committed.

While atomicity is inherent to transactions, it's not always necessary in a quorum system. Maintaining consistent states across all replicas can be resource-intensive and unnecessary for achieving linearizability.

Voting Protocol for Quorums

The most common way of using quorum systems involves voting. In this scenario, atomicity of operations may not be guaranteed, but with proper voting configuration, the system can appear linearizable to the client, even if individual replicas are in inconsistent states.

Given N servers, W (write quorum), and R (read quorum), the system can achieve linearizability if:

R + W > N

Consider a system with three servers (A, B, C), where R = 2 and W = 2. Say, a write operation succeeds on server A but fails on the others.

  • The overall write operation fails. However, because this action is non-atomic, node A is left in an inconsistent state.
  • The client can read data from replicas A and B. Upon detecting conflicting values from A and B, it will read from C, receiving a version the same as A.
  • Server A can then repair its state by fetching the correct value from another server.

Linearizability cannot be guaranteed when R + W <= N. In such cases, conflicts can arise.

Say a write W1 succeeded at replica A and B and a write W2 to the same key succeeded at replica B and C. Since B had observed both writes, it can break ties. But if B is down, then there would be a conflict. In this case, W = 2 but R = 1 and hence R + W = 3 <= N.

Vector Clock

Vector Clocks are a powerful technique for establishing causal order among events in a distributed system. They enhance Lamport's clock by maintaining a separate logical clock for each node within the system.

In an N-node system, a vector clock is an N-dimensional vector, where each element represents the logical clock of a corresponding node.

Event Handling

  • Local Event: Increment the local node's clock (i.e., the i-th element of the vector).
V[i] = V[i] + 1
  • Send Event: Increment the local node's clock. Transmit the current vector clock (V[1...N]) along with the message.
  • Receive Event:
    • Increment the local node's clock.
V[i] = V[i] + 1
    • Update each element of the local vector by taking the maximum of the corresponding value in the received message and the current local value.
V[j] = max(Vmsg[j], V[j]) ∀ j ≠ i

Comparing Vector Clocks

  • Equality: Two vector clocks, V1 and V2, are equal if and only if all their corresponding elements are equal.
V1 = V2 iff V1[i] = V2[i] ∀ i ∈ {1, ..., N}
  • Less Than or Equal To: Vector clock V1 is less than or equal to V2 if and only if every element of V1 is less than or equal to the corresponding element in V2
V1 <= V2 iff V1[i] <= V2[i] ∀ i ∈ {1, ..., N}

Vector Clocks in Quorum System

While quorum systems with R + W <= N cannot guarantee linearizability, vector clocks can be employed to effectively resolve conflicts and facilitate eventual reconciliation.

Approach

  • Each node maintains and updates its vector clock according to the rules described earlier.
  • When a write operation occurs on a key, the current state of the vector clock is associated with the written value.
Example: <key, value, <A: 1, B: 2>> where <A: 1, B: 2> represents the vector clock at the time of the write.

Conflict Resolution

  • Partial Ordering: Vector clocks enable partial ordering of values.
V1 <= V2 iff V1[i] <= V2[i] ∀ nodes i
  • Conflict Resolution Strategies:
    • Choose a value randomly.
    • Select the value associated with the most recent physical timestamp.

Neither of these conflict resolution strategies guarantees linearizability.

Anti-Entropy using Merkle Trees 

Say, a set of replicas own a keyspace. Maintaining consistency among these replicas is crucial, a concept often referred to as anti-entropy.

Merkle trees offer a robust mechanism for detecting conflicts and efficiently updating outdated values within this keyspace. This is particularly valuable in quorum systems, where replica inconsistencies can frequently arise.

A merkle tree is a binary tree-based data structure where each leaf node holds the hash of some items. Internal nodes, in turn, hold the hash of their respective child nodes. In this case, the leaf nodes hold the hash of values of all the keys. The hierarchical structure enables efficient verification of data integrity and facilitates the identification of discrepancies between replicas.


To compare the merkle trees of a keyspace across two replicas, we initiate a top-down traversal, starting at the root node, to find the keys which are not synced.

Failure Detectors

Refer Cassandra - A Decentralized Structured Storage System for detailed description of failure detectors and gossip protocols.

Failure detectors are components that help in identification of faulty nodes in a distributed system. It is impossible to have a perfect failure detector. Consequently, practical implementations rely on imperfect approximations. Heartbeat-based detectors is a prominent example that leverages gossiped heartbeat information among nodes.

Dynamo

Dynamo, a pioneering key-value NoSQL database, has significantly influenced the landscape of modern NoSQL systems. A key innovation of Dynamo lies in its deliberate trade-off of strict consistency for achieving low-latency and high availability.

Recognizing the critical dependencies of numerous other Amazon services on Dynamo, the system prioritizes stringent tail latency requirements. Specifically, it focuses on achieving high performance for the 99.9th percentile of tail latency. This emphasis stems from the understanding that in many scenarios, an incorrect response is preferable to experiencing no response or delayed response. The additive nature of latency SLAs across interdependent systems underscores the potential for severe cascading failures if even a small fraction of requests experience prolonged latencies.

The API

Following the design principles of many successful distributed systems, Dynamo boasts a remarkably simple API:

  • Get(key) -> (context, list of values): Retrieves the value(s) associated with the given key, along with contextual information.
  • Put(key, context, value) -> void: Stores the specified value under the given key, utilizing the provided context for conflict resolution.

The context parameter in these operations is an opaque byte string. This context information is crucial for resolving write conflicts using a vector clock mechanism.

Consistent Hashing in Dynamo

Dynamo was developed in 2007, during a period of intense interest in decentralized systems. Reflecting this trend, Dynamo employs a peer-to-peer architecture where all nodes are identical and utilize consistent hashing for efficient overlay network formation. This design adheres to the principles of incremental scalability, decentralization, symmetry, and heterogeneity.

Traditional consistent hashing presents challenges, such as key redistribution when a new node joins, potentially triggering expensive merkle tree recomputations across multiple nodes. That is why, Dynamo employs a variant of consistent hashing.

Dynamo's Strategy

  • The keyspace is divided into Q equally sized partitions.
  • If the system has S nodes, Q/S tokens are assigned to the nodes. The number of tokens are assigned based on its physical capacity of a node.

This approach offers several advantages:

  • When a node leaves, only the new nodes responsible for its partitions need to recompute merkle trees, as the partitions themselves remain fixed.
  • Partitioning (defined by Q) and partition placement are decoupled.
  • The routing table maintains a constant number of entries (Q).

Additionally, we define a fairness metric as follows:

Load balancing efficiency = mean load / max load

A node is considered out of balance if its load exceeds 1.15 times the mean load. Identifying and addressing unbalanced nodes is critical, as they significantly impact tail latencies.

Note, the same physical machines may end up at multiple positions on the ring, also known as virtual nodes.

Handling Client Operations

Each key k is assigned to a coordinator node. This coordinator node is responsible for replicating the keys that fall within its assigned range.

For redundancy, the coordinator replicates the key to N - 1 successor nodes within the ring. This results in a system where each node is responsible for the region of the ring between itself and its Nth predecessor.

The list of nodes responsible for storing a particular key is termed the preference list. Since each physical node may be responsible for multiple ranges on the ring, the preference list strategically skips positions on the ring to ensure that it only contains distinct physical nodes.


In the figure above, the key is placed on node 0 (the coordinator), 1, and 2.

Clients can interact with the system through two primary methods:

  • Directly using the assignment map: This approach offers efficient and low-latency access.
  • Through a load balancer: This may introduce additional latency.

If a client initially contacts a different node (not the key's coordinator), the request will be redirected to the correct coordinator.

Unlike traditional quorum systems that enforce linearizability by adhering to the R + W > N rule, Dynamo deviates from this approach. This deviation is intentional, as adhering to R + W > N can lead to significant latency increases, especially when N is large.

Instead of strict quorum rules, Dynamo utilizes vector clocks associated with each key's value to handle conflicting writes. To ensure durability, the coordinator synchronously replicates the write to at least one other node before acknowledging the client's request.

For applications demanding high-performance and consistent reads, a configuration of R=1 and W=N may be suitable.

Section 4.4 provides a detailed example of how vector clocks are employed to reconcile conflicting values for a key. While the specific details are well-explained in the paper, the core concept is that D3 precedes D4 because Sx(D3) = Sx(D2) and Sy(D3) > Sy(D2). D3 and D4 are considered conflicting versions because Sy(D3) > Sy(D4) and Sz(D4) > Sz(D3). These conflicts are resolved through semantic reconciliation.

The paper also mentions that Dynamo truncates the vector clock to 10 elements by removing the oldest entry. This design decision raises concerns, as it may potentially lead to several unforeseen issues.

Handling Failures

Temporary Failure

Dynamo introduces "hinted handoff", a technique that improves fault tolerance. When a node is unavailable, writes are redirected to its successor node on the ring. These "hinted writes" are stored in a separate file, facilitating efficient transfer when the target node becomes available again.

This approach inherently breaks the traditional quorum algorithm, leading Dynamo to adopt a "sloppy quorum" strategy.

Permanent Failure

When a replica fails permanently, writes are still directed to at least N-1 other nodes, ensuring some level of data durability. This is where merkle trees prove invaluable.

Dynamo incorporates an anti-entropy replica synchronization protocol. In this protocol, the merkle tree serves as a data structure where leaf nodes represent the hashes of values of individual key-value pairs. During synchronization, a comparison of merkle trees between replicas reveals discrepancies, enabling the efficient transfer of only the out-of-sync values.

Ring Membership

Only administrators possess the authority to add or remove members from the ring. Upon any membership change, a gossip-based protocol propagates these updates throughout the system, ensuring eventual consistency in the membership view across all nodes. In this protocol, each node randomly selects a peer for communication every second.

Failure detection within the system also leverages a simple gossip-style mechanism. However, this detection is inherently local. Node A may deem node B as failed if it fails to receive responses from B, even though B might still be responsive to other nodes like C. To further reinforce failure detection, a periodic loop is implemented to actively check for the presence of node B.

Evaluation

  • Tail latencies: The 99.9th percentile latency (p999) hovers around 200ms, significantly higher than the average latency.
  • Load Balancing: The new consistent partitioning strategy demonstrates excellent load balancing efficiency, exceeding 0.9.
  • Version Divergence: Two primary scenarios contribute to version divergence: system failures and concurrent write operations.
    • 99.94% of requests encounter a single version of the data.
    • 0.00057% encounter two versions.
    • 0.00047% encounter three versions.
    • 0.00009% encounter four versions.
  • Client-Driven Coordination: Implementing client-driven coordination, where clients directly contact servers based on their knowledge of the key-to-preference list mapping, results in a 30ms reduction in both read and write latencies at the 99.9th percentile.

Paper Review

The Dynamo paper is a seminal work in the field of distributed systems. It introduces several groundbreaking ideas that have significantly influenced the design and development of numerous NoSQL databases.

Given the depth and breadth of the paper, these insights necessarily provides a high-level overview. To fully grasp the nuances and intricacies of Dynamo's design, I highly recommend a thorough and repeated reading of the original paper.

Recommended Read - Cassandra - A Decentralized Structured Storage System which is another NoSQL database similar to Dynamo.

PS: Towards Distributed SQL Databases

Even with the rise of NoSQL, the need for strong consistency and data integrity remained paramount for critical applications like financial systems.

Google Spanner emerged as a groundbreaking achievement, becoming the first distributed SQL database to offer strong consistency guarantees. It pioneered the concept of "DRDBMS" – systems that are:

  • Scalable: Capable of horizontal scaling to handle massive datasets and high throughput.
  • Consistent: Adhering to ACID properties, ensuring data integrity and transactional isolation in a distributed environment.
  • Geographically Replicated: Providing high availability and low latency by distributing data across multiple geographic locations.
  • SQL-compliant: Supporting the familiar SQL language for data querying and manipulation.

Amazon Aurora, another significant player, attempted to enhance consistency by optimizing consensus protocols on top of existing standalone SQL databases like MySQL. However, this approach often faced limitations, with a single write node frequently becoming a performance bottleneck.

Recommended Read: Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

CockroachDB, drawing inspiration from Spanner, offers a scalable and consistently replicated transactional store. However, it differentiates itself by:

  • Independence from Atomic Clocks: Unlike Spanner, CockroachDB does not rely on tightly synchronized atomic clocks, making it more adaptable to diverse deployment environments.
  • Open Source and Platform Agnostic: CockroachDB is open-source and can be deployed outside of Google Cloud Platform (GCP), providing greater flexibility and choice for organizations.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...