Skip to main content

Paper Insights #21 - ZooKeeper: Wait-free coordination for Internet-scale systems

This paper is a significant contribution to the field of Distributed Systems. While ZooKeeper's design is based on principles similar to Google's Chubby, it also incorporates key differences, which we will discuss. Authored by Patrick Hunt et al. from Yahoo Research and presented at USENIX ATC '10, this work can be seen as an open-source counterpart to Chubby.  At the time of its publication, decentralized approaches, such as those using consistent hashing (e.g., Dynamo), were waning in popularity, and centralized systems were gaining traction.  Both Google's Chubby and Apache ZooKeeper emerged during this shift.

Paper Link

Let's begin with some fundamental concepts.

Distributed Logs

Distributed systems rely on distributed logs, which serve as the single source of truth for the operations performed. These logs are crucial for replicated state machines. To maintain consistency across replicas of the same system, it's essential that they execute the same set of operations in a consistent order. Otherwise, inconsistencies will arise. Typically, the operations recorded in a distributed log correspond to modifications of data stored on a node.


Instruction 1, 3, and 4 are blind-writes. Instruction 2 is read-write.

The diagram above depicts three replicas, each holding a versioned data item, X. These replicas sequentially apply instructions from the distributed log. While the state of the memory location holding X may differ across replicas at any given point in time, once all instructions from the log have been applied, all replicas will converge to the same final state.

Single-Data Transactions

Transactions, in a distributed system, can take two forms:
  • Single-data transactions: These transactions operate on a single object atomically.
  • Multi-data transactions: These transactions affect multiple objects atomically.

This discussion focuses solely on single-data transactions. Multi-data transactions will be covered in a future article on databases, which I will link here.

Implementation using Distributed Logs

Single-data item transactions can itself take two forms:

  • Direct write (W): A direct assignment of a value.
  • Read-then-write (R-W): Reading a value and subsequently writing a new value. There may be some decision making based on the value read.

Distributed logs can implement both types of single-data item transactions. For a direct write (W), the log entry might be represented as, indicating that the value should be assigned to the data item identified by the key. For a read-then-write (R-W), the log entry could be CAS(key, version, value), signifying a "compare-and-set" operation: the value of the data item associated with key is updated to value only if the current version matches version. The version signifies the value that was read.

It's important to note that distributed logs are effective for single-data item transactions only, where the entire data item resides on the node(s) responsible for it. For multi-item transactions (or general database transactions), a combination of distributed logging and two-phase commit (2PC) is required. However, a discussion of 2PC is beyond the scope of this article.

Consistency Models

Single-item transaction consistency models vary based on the ordering guarantees they provide. Several types of ordering are possible:

  • Total Order: All transactions are ordered linearly.

  • Causal Order: A partial order based on event causality. For example, if client C1 performs write W1 and then sends a message to clients C2 and C3, and subsequently C2 performs write W2 and C3 performs write W3, then, according to Lamport's happened-before relationship, W1 happened before both W2 and W3. Therefore, W1 must appear before W2 and W3 in any valid causal order. Both {W1, W2, W3} and {W1, W3, W2} are valid causal orderings, but {W2, W1, W3} is not valid.


  • FIFO Order: A partial order ensuring that transactions from the same client are ordered. For example, if a client performs write W1 and then write W2, the final order must have W1 before W2. Writes from other clients (like W3 in the previous example) can occur at any point in the sequence relative to W1 and W2.



Linearizability (External Consistency) - Temporal Total Order

The strongest single-object consistency model, linearizability guarantees i

mmediate external visibility of transaction results (a.k.a. external consistency).

If a client executes transaction T1 and, only after its successful commit, executes transaction T2, then linearizability requires that the actual order of these transactions must reflect this client-observed real-time order: T1 must be ordered before T2.

Linearizability simplifies client reasoning about the system's behavior.

Sequential Consistency - Total Order

A strong consistency model ensuring that operations appear to have occurred in some total order.


A helpful analogy (attributed to Dr. D. Mazieres) illustrates the difference between linearizability and sequential consistency:

Suppose you successfully commit a transaction that adds a new item to a key-value store. You then call a friend, asking them to lookup the new data item. If your friend's read fails, the store violates linearizability (external consistency). 

Many distributed datastores, including ZooKeeper (with its relaxed reads, as we'll discuss), do not guarantee linearizability. However, they still maintain sequential consistency, ensuring that the final state reflects some total order of the executed transactions. For example, if your friend then attempts to add the same data items again (with the same key), the store must reject either yours or your friend's transaction depending on whose transaction the store decides to order first. This avoids duplicate keys and thereby makes the store at least sequentially consistent.

In other words, sequential consistency guarantees a total order of transactions without strict temporal (or externally-visible) semantics.

Causal Consistency - Causal Order

At this level, the total order of transactions is broken, and only a partial order can be maintained based on causality. Causal consistency orders transactions according to their causal relationships. This weaker consistency model is available during network partitions.

PRAM Consistency (FIFO Consistency) - FIFO Order

PRAM consistency maintains partial ordering based on the principle that writes executed by the same process are totally ordered.

Others

Other consistency models, such as release consistency, read-your-writes (RYW), and write-follows-reads (WFR), will be discussed elsewhere.

Bonus: Strict Serializability

While this discussion focuses on single-item reads and writes (and thus doesn't involve multi-item transactions), it's worth noting that extending linearizability to multi-item transactions yields strict serializability.

Distributed Log Algorithms

Now that we understand the role of distributed logs in implementing replicated systems capable of single-data item transactions, let's explore the algorithms that enable their creation.

In a distributed system, the participating nodes are called participants. Nodes that function correctly (i.e., are non-crashing and non-Byzantine) are considered correct participants. 

Consensus

A consensus algorithm addresses the problem of agreeing on a common value for a variable among these participants. Such an algorithm must satisfy the following properties:

  • Termination: Every correct process eventually decides on a value.
  • Integrity: If all correct processes initially proposed the same value v, then all correct processes must decide on v.
  • Agreement: All correct processes must agree on the same value.

Consensus algorithms are fundamental to building distributed logs. For example, a distributed log can be divided into slots, where each slot holds a single log entry (or instruction for the distributed system). Consensus can then be used to ensure agreement on the value (instruction) to be placed in each slot.

Atomic Broadcast

Atomic broadcast is a message broadcast in distributed system which ensures that:

  • Validity (Liveness): If a participant broadcasts a message, all participants eventually receive it.
  • Agreement (Safety): If one correct participant receives a message, all correct participants eventually receive it.
  • Integrity (Safety): Each participant receives a given message at most once.
  • Total Order (Safety): Messages are delivered to all correct participants in the same total order.

In essence, atomic broadcast guarantees that all correct participants eventually receive all broadcast messages in a consistent total order. The use of "eventually" is crucial in the definition.

Beyond atomic broadcast, other broadcast protocols exist, each corresponding to a specific ordering guarantee:

  • FIFO Broadcast: Messages originating from a single node are delivered to all participants in the same order they were sent out. No ordering guarantees are provided for messages from different senders.

  • Causal Broadcast: All messages are delivered in a causally consistent order.

Equivalence of Consensus and Atomic Broadcast

A distributed log, built using consensus, can represent the messages in an atomic broadcast, with each log entry corresponding to a broadcast message, thus ensuring total order. Conversely, an algorithm that solves atomic broadcast can also solve consensus. For instance, a distributed log can be implemented via atomic broadcast, where each instruction becomes a message. Therefore, consensus and atomic broadcast are equivalent problems. Algorithms capable of solving consensus can also solve atomic broadcast.

No Perfect Algorithm Exists! - The FLP Impossibility

Of the many principles governing distributed systems, the FLP (Fischer-Lynch-Paterson) impossibility result is perhaps the most significant. It states that in an asynchronous network where even a single node can crash, it's impossible to guarantee both safety (or correctness) and liveness (or termination) simultaneously.

Given that crash faults are always a possibility in large systems, and real-world networks are inherently asynchronous, this implies that no consensus algorithm can truly be considered universally applicable in practice. Even the most well-known algorithms, like Paxos and Raft, must make compromises and assumptions to function. Almost all practical consensus algorithms prioritize safety over liveness and assume non-Byzantine faults.

ZooKeeper

ZooKeeper is a system designed to help other distributed systems solve problems such as:

  • Distributed Locking: Analogous to mutexes in parallel computing, distributed locks allow a node to acquire and release a lock on a specific object.
  • Leader Election: A group of participants can elect a leader among themselves, with consistent knowledge of who that leader is.
  • Group Membership: Participants can maintain a consistent view of the other members in their group.

It's important to note that all of these problems are equivalent to consensus. If leader election could be solved in a true FLP setting, so could consensus. Similarly, a perfect distributed locking algorithm would make consensus trivial (which is why consensus is relatively straightforward in parallel computing).

However, it's crucial to understand that ZooKeeper relies on certain assumptions about the system and doesn't solve these problems perfectly. To achieve availability and performance, ZooKeeper compromises on both safety and liveness, as we'll explore later.

A Coordination Kernel

ZooKeeper itself doesn't directly solve end-user problems, nor is it a ready-to-use distributed application. Instead, it provides a "coordination kernel", a framework upon which higher-level distributed systems can be built.

For example, Apache Kafka is a higher-level system that leverages ZooKeeper for its underlying coordination. Because ZooKeeper isn't perfect, Kafka inherits some of its limitations. However, it functions effectively in practice. By the end of this discussion, you will realize why ZooKeeper is far from perfect, yet, the best coordination kernel we have.

Z-Nodes

In ZooKeeper, data objects are called z-nodes. They are organized hierarchically, much like a file system. Each z-node contains data and can optionally have child z-nodes. ZooKeeper thus acts as a data store that supports single-data-item (z-node) transactions.

Z-nodes come in two types:

  1. Regular: These z-nodes are explicitly created and deleted by clients using API calls.
  2. Ephemeral: These z-nodes are also created by clients but are automatically deleted either explicitly or after a client-specified timeout. Ephemeral nodes are typically associated with client sessions; if the client disconnects, the ephemeral nodes it created are automatically removed.

ZooKeeper maintains the following information for each z-node:

  • Type: Whether the z-node is regular or ephemeral.
  • Metadata: Timestamps and version numbers.
  • Data: The data stored in the z-node (default 1 MB, configurable).
  • Children: The z-nodes that are children of this z-node.
  • Counter: Used for creating sequential child z-nodes. Sequential child z-nodes share the same parent and are created in a specific sequence.

Watcher

ZooKeeper provides a "watch" mechanism, allowing clients to monitor a z-node for changes (including creation and deletion). It's important to note that watch notifications only signal that an update has occurred; they do not include the actual modified data. The client must subsequently read the z-node to retrieve the updated information.

Sessions

ZooKeeper manages client connections using sessions, which are governed by timeouts. If a session times out, ZooKeeper assumes the client has closed and takes appropriate actions (e.g., deleting ephemeral nodes).

However, this assumption is flawed in real-world asynchronous environments, as network delays or transient failures can cause timeouts even if the client is still active. This is a key area where ZooKeeper compromises safety.

To mitigate session timeouts, the ZooKeeper client library sends heartbeats every s/3 seconds and switches to a new server if it hasn't received a response within 2s/3 seconds, where s represents the session timeout length.

Client API

ZooKeeper provides the following z-node APIs:

  • create (path, data, flags): Creates a z-node at the specified path with the given data. Flags are used to specify the z-node type (e.g., ephemeral, sequential).
  • delete (path, version): Deletes the z-node at the specified path, but only if its current version matches the provided version. Version numbers act like generation numbers, preventing accidental deletions.
  • exists (path, watch): Returns true if a z-node exists at the specified path. The optional watch flag enables a watch on the z-node.
  • getData (path, watch): Returns the data and metadata of the z-node at the specified path. The optional watch flag enables a watch on the z-node.
  • setData (path, data, version): Sets the data of the z-node at the specified path to the given data, but only if its current version matches the provided version. This allows the operation to function as a compare-and-set (CAS) operation for read-write transactions.
  • getChildren (path, watch): Returns the list of child z-nodes for the z-node at the specified path. The optional watch flag enables a watch on the child list.
  • sync (path): Waits for all pending updates initiated before this operation to propagate to the server the client is currently connected to.

A key characteristic of these APIs is that they don't return a file handle or descriptor. There are no Open() or Close() operations on z-nodes. All operations are atomic transactions in themselves. No locks are held on z-nodes beyond the duration of the operation. All reads and writes are complete: full reads and full writes.

Instruction Idempotency

All client operations in ZooKeeper are designed to be idempotent. Idempotency is a highly desirable property in distributed systems. An idempotent operation can be executed multiple times without changing the result beyond the initial application. For example, the following operations are idempotent:

  • A := A * 1
  • A := A + 0
  • A := c (where c is a constant)

The following operation is not idempotent: A := A + 1

ZooKeeper client instructions are idempotent at the server level. For example, the setData() operation includes a version number. After the operation is executed once, subsequent identical executions will have no effect because the version number will have changed.

It's important to note that while the server-side processing of ZooKeeper operations is idempotent, the client API calls themselves are not always idempotent. For instance, using the sequential flag to create sequential nodes results in non-idempotent behavior at the API level. However, the server internally translates these requests into idempotent operations.

Guarantees

ZooKeeper offers two guarantees:

  1. Linearizability: All writes to z-nodes are linearizable. Because ZooKeeper stores single data items (z-nodes), linearizability is equivalent to strict serializability in this context. For reads, ZooKeeper provides two modes:

    • Synchronous Reads (or Strong Reads): Offer linearizability but can be slower.
    • Relaxed Reads: Do not guarantee linearizability but are generally faster.
  2. FIFO Client Order: All requests originating from a single client are executed in the order they were sent by that client. This is relatively straightforward to implement on the client side.

Implementation Details

This section details ZooKeeper's internal workings.

For high availability, ZooKeeper is fully replicated across all its nodes.

At its core, ZooKeeper implements a distributed log, which contains a sequence of instructions to be applied to the z-node data structure. This entire z-node data structure is held in memory and periodically snapshotted to persistent storage. Each instruction in the distributed log corresponds to a modification of a specific z-node. A log entry is first committed to storage before it's applied to the in-memory data structure. A ZooKeeper replica only responds to a client request after all these steps are complete.

Implementing this distributed log requires a consensus algorithm. ZooKeeper uses the ZooKeeper Atomic Broadcast (ZAB) protocol for this purpose.

ZooKeeper Atomic Broadcast (ZAB)

ZAB is the atomic broadcast algorithm used by ZooKeeper to construct its distributed logs. ZAB, like other atomic broadcast algorithms, relies on certain assumptions about the system.

ZAB operates in two phases:

  1. Leader Election Phase: ZAB elects a leader among the ZooKeeper nodes. This process uses a Paxos-like approach. Each node proposes a value, and a consensus algorithm is run to select the node that proposed the highest value as the leader. It's important to note that this initial consensus, like Paxos itself, is not perfect and can fail to terminate, thus potentially compromising the system's liveness.

    For brevity, a detailed explanation of Paxos is omitted (due to its complexity and the abundance of existing resources). For our purposes, Paxos can be thought of as a non-perfect consensus algorithm that helps nodes agree on a value, in this case, the leader.

    The result of the initial leader election can be the first entry in the distributed log. Subsequent application-specific instructions follow.

  2. Broadcast Phase: The elected leader broadcasts messages containing log entries to all other ZooKeeper servers (followers). Because all modifications are routed through the leader, they are processed in a consistent order. A user request is processed by the leader and then sent to all replicas. A request is considered committed when a majority of followers respond successfully, i.e., f + 1 commits out of 2f + 1.

Handling Partition

If the leader crashes or a majority of replicas become unreachable, a new leader election occurs.  The distributed log will then contain an entry reflecting this new leadership.  A node can only be elected leader if it has caught up with all log entries that were successfully committed before the view change (the transition to a new leader).

Prior to a view change, there are 2f + 1 nodes, of which at least f + 1 are up-to-date.

After a view change (when f + 1 nodes become unavailable), there will be at least one node that is fully caught up with the latest committed log entries.

Before a new leader is elected, the new set of replicas must have at least f + 1 nodes that are caught up to this latest position. The new leader must also ensure that all lagging replicas receive all the log entries before the leader change entry itself can be committed. This is accomplished by having the most up-to-date replica broadcast the logs to everyone else. 

This also means that duplicate delivery of messages are possible. However, because all ZooKeeper operations are idempotent, duplicate message delivery has no adverse effects.

Fuzzy Snapshots

Because ZooKeeper operations are idempotent, it's possible to create fuzzy snapshots. A fuzzy snapshot is taken without needing to acquire a full lock on the in-memory database. Imagine a distributed log where entries assign values to variables (assignments to constants are idempotent).

For example, consider variables A and B, initially with values {A: 1, B: 2} -

  • A fuzzy snapshot process begins, recording the current value of A (A: 1).
  • A transaction updates A := 4. State - {A: 4, B: 2}.
  • Then, another transaction updates B := 3. State - {A: 4, B: 3}.
  • Finally, the fuzzy snapshot process records the current value of B (B:3).

The resulting fuzzy snapshot appears as {A: 1, B: 3}. This state never actually existed at any point in time in the database.

However, due to the idempotent nature of the operations, the fuzzy snapshot can be brought up to the correct final state. By applying all log entries that occurred after the snapshot began, the fuzzy snapshot can be transformed to reflect the current state {A: 4, B: 3}.

A more detailed example specific to ZooKeeper is provided in Section 4.3 of the relevant paper.

Relaxed v/s Linearizable Reads

By default, ZooKeeper read operations are directed to a nearby replica, which might not have the latest updates from the distributed log. These "relaxed reads" can return stale data and are therefore not linearizable. The zxid (transaction id) on a replica indicates the last committed log entry.  A client can detect a potentially stale read by comparing the zxid returned with the read to a previously known zxid.

To achieve linearizable reads, ZooKeeper provides the sync operation. This operation goes through the leader and is processed like any other instruction, although it's not actually committed to the log.  sync effectively performs a read-only transaction, ensuring the client receives the latest value at that specific point in time. Having a single leader to coordinate read-only transactions ensures that such transactions fetch the latest value and reflect all writes up to that point. This single-leader approach also offers the advantage of consistent transaction ordering, aligning with client observations (i.e., external consistency).

The sync operation could be implemented straightforwardly by using the same distributed logging process as other instructions.  However, ZooKeeper optimizes this process by assuming that the leader will not lose its leadership while the sync operation is in progress. This optimization relies on leader leases.  This assumption is, again, unsafe and can lead to non-linearizable reads, thus compromising correctness.

While exchanging heartbeats, the client learns the latest zxid from the server. When switching to a new server, the client is guaranteed to connect only to servers with a zxid that is at least as recent as the last known zxid. This ensures the client can always find a server that is sufficiently up-to-date. These mechanisms are primarily for performance optimization and do not affect the fundamental correctness (or lack thereof, due to the sync optimization) of the system.

"Wait-Free" v/s Chubby

The paper's title mentions "wait-free". Let's explore what that signifies.

As previously discussed, clients are notified of z-node modifications. These notifications are guaranteed for eventual delivery. "Eventually" means that all replicas will eventually catch up and notify their respective clients. This holds true even if a client switches to a new replica based on the zxid.

This eventual delivery ensures that write operations don't have to wait for every client to be notified. Write requests are committed to the transaction log and the response is immediately returned to the client. Watchers are notified asynchronously and eventually. This asynchronous notification mechanism is what constitutes "wait-free" behavior.

A contrasting approach would require write requests to block until all clients watching the modified z-node have been notified. Otherwise, the transactions would fail. This is precisely how Chubby operates. The authors argue that this synchronous notification is the reason for Chubby's slower performance and its lack of "wait-free" property. Chubby provides a very strong and consistent notification mechanism, but this comes at the cost of scalability, especially for fine-grained locking. Chubby only sends a response back to the client after notifying all watchers (though it does incorporate timeouts, which, while improving responsiveness, introduce correctness issues, especially as timeouts are based on leases).

ZooKeeper, on the other hand, replies to the client before notifications are sent, achieving wait-free behavior. This is the key difference between Chubby and ZooKeeper.

ZooKeeper & FLP Impossibility

ZooKeeper violates the FLP impossibility result at several points:

  • ZAB itself does not guarantee liveness during faults.
  • ZooKeeper's lease mechanism, which relies on timeouts, also compromises safety. Timeouts depend on clocks, which are not guaranteed to be synchronized across processes in an asynchronous system. Therefore, there is no true consensus on clock values. Although this compromises safety, it's a widely accepted trade-off in practice.

Timeouts are one of two common ways to implement leases in asynchronous systems (the other being explicit lease relinquishment, often involving some variant of two-phase commit). A timeout can be conceptually viewed as a message received by the lease holder. In ZooKeeper, a timeout acts as a message to the leader to release locks on ephemeral nodes. This reliance on timeouts is where safety is compromised, but, again, this compromise is generally accepted in practice.

Primitives using ZooKeeper

ZooKeeper can be used to build a variety of distributed primitives.  Among the most interesting are leader election and distributed locks. The paper also explains other primitives, such as configuration management, group membership, and double barriers, which are relatively straightforward to understand.

Leader Election

A simple leader election can be implemented by having clients attempt to create a specific z-node. The first client to successfully create the z-node is designated the leader.

A more sophisticated approach might require the leader to submit new configurations before assuming leadership. In this case, a special ready z-node can be used. Clients create z-nodes containing their portion of the new configuration. Once all configuration z-nodes are created, a client creates the ready z-node.

ZooKeeper's FIFO client order guarantee ensures that all configuration z-nodes will be created before the ready z-node. Consequently, any client that observes the ready z-node is also guaranteed to have seen all the updated configuration z-nodes.

Distributed Locks

In its simplest form, a lock can be implemented using ephemeral z-nodes as lock files. A client releases the lock by either explicitly deleting the z-node or implicitly when its session terminates.

However, this simple approach suffers from the "herd effect": many clients might be waiting to acquire the lock, all contending simultaneously when it becomes available.  Also, this implementation provides only exclusive locks, not shared locks.

A more robust approach involves creating z-nodes with the sequential flag -

  • Each client is assigned a position in a queue and acquires the lock in sequence.
  • Each client watches the z-node of the client immediately preceding it in the queue.  A crucial detail is the loop back to line 2 in the Lock() function (as referenced in the paper). This handles the scenario where the client ahead of the current client in the queue might have crashed (effectively leaving the queue).
To implement a shared lock (ReadLock), the algorithm ignores any read z-nodes ahead in the queue and only checks for a preceding write z-node.

Evaluation

Finally, let's briefly discuss the paper's evaluation results. It's important to note that ZooKeeper's performance has been significantly improved since its initial release, so the reported numbers may no longer be entirely accurate.

  • Read vs. Write Throughput (Figure 5): As expected, read throughput is higher than write throughput. This is because reads are typically relaxed (and thus faster), while writes must go through the atomic broadcast protocol and be logged.
  • Factors Affecting Write Throughput: The atomic broadcast process is identified as the main limiting factor for write throughput. However, other factors, such as client communication overhead and ACL checks, also contribute.
  • Fault Tolerance: The failure of a single follower does not significantly impact overall throughput. When the leader fails, the system recovers relatively quickly (within approximately 200ms).

Paper Review

This paper covers many interesting concepts. As software engineers working on distributed systems, we typically don't deal directly with consensus and atomic broadcast protocols. Instead, we build on top of a kernel that abstracts these complexities away. In fact, we often build on primitives built on top of those kernels. Nevertheless, understanding how these underlying systems work is still valuable.

The key takeaway is that no such coordination kernel, whether ZooKeeper or Chubby, is perfect. They all operate under certain assumptions. Despite their theoretical flaws, they perform remarkably well in practice and thus form a critical foundation for distributed systems.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...