Paper Insights #21 - ZooKeeper: Wait-Free Coordination for Internet-Scale Systems

This paper is a significant contribution to the field of Distributed Systems. ZooKeeper's design is based on principles similar to Google's Chubby, however, it also incorporates key differences. This work was presented at Usenix Annual Technical Conference (ATC) '10. The work can be seen as an open-source counterpart to Chubby. At the time of its publication, decentralized approaches, such as those using consistent hashing (e.g., Dynamo), were waning in popularity, and centralized systems were gaining traction. Both Google's Chubby and Apache ZooKeeper emerged during this shift.

Paper Link

Recommended Read: Paper Insights - Paxos Made Simple where Paxos, a consensus algorithm, was discussed in detail.

Let's begin with some fundamental concepts.

1. Distributed Logs

Distributed systems rely on distributed logs, which serve as the single source of truth for the operations performed. These logs are crucial for replicated state machines, where, to maintain consistency across nodes, it's essential that they execute the same set of operations in a consistent order. Typically, the operations recorded in a distributed log correspond to modifications of data stored on a node.

Illustration 1: Distributed logs in RSM.

Illustration 1 above depicts three nodes, each holding a versioned data item, X. These replicas sequentially apply instructions from the distributed log. Instructions 1, 3, and 4 are blind-writes. Instruction 2 is read-write. While the state of the memory location holding X may differ across replicas at any given point in time, once all instructions from the log have been applied, all replicas will converge to the same final state.

2. Single-Data Transactions

Transactions, in a distributed system, can take two forms:

Single-data transactions: These transactions operate on a single object atomically.
Multi-data transactions: These transactions affect multiple objects atomically.

This article focuses solely on single-data transactions.

2.1. Implementation using Distributed Logs

Single-data item transactions can itself take two forms:

Blind write (W): A direct assignment of a value.
Read-then-write (R-W): Reading a value and subsequently writing a new value. There may be some decision making based on the value read.

Distributed logs can implement both types of single-data item transactions. For a blind write (W), the log entry might be represented as

key := value

indicating that value should be assigned to the data item identified by the key. For a read-then-write (R-W), the log entry could be

IF v(key) = version THEN key := value

signifying a compare-and-set (CAS) operation: the value of the data item associated with key is updated to value only if the current version matches the version. The version signifies the value that was read.

Note that distributed logs are effective for single-data item transactions only, where the entire data item resides on the node(s) responsible for it. For multi-item transactions (or general database transactions), a combination of distributed logging and two-phase commit (2PC) is required. The discussion of 2PC is beyond the scope of this article.

2.2. Consistency Models

There are two ways to think about a consistency model in a single data item read-write system:

Let's say there are clients observing all writes that are happening - what is the effective order in which the clients will see the writes?
What would be the effective state of the objects in the data store once all the writes have been applied?

Several types of ordering are possible:

Total Order: All transactions are ordered linearly and there is only a single valid order. All clients will see that same order and there will be only a single effective state.
Causal Order: A partial order based on event causality. For example (shown in illustration 2), if client C₁ performs write W₁ and then sends a message to clients C₂ and C₃, and subsequently C₂ performs write W₂ and C₃ performs write W₃, then, according to Lamport's happened-before relationship, W₁ happened before both W₂ and W₃. Therefore, W₁ must appear before W₂ and W₃ in any valid causal order. Both {W₁, W₂, W₃} and {W₁, W₃, W₂} are valid causal orderings, but {W₂, W₁, W₃} is not valid.

Illustration 2: Causal order.

FIFO Order: A partial order ensuring that writes from the same client are ordered. For example (shown in illustration 3), if a client performs write W₁ and then write W₂, the final order must have W₁ before W₂. Writes from other clients (like W₃ in the previous example) can occur at any point in the sequence relative to W₁ and W₂.

Illustration 3: FIFO order.

In this article, we will focus on total ordering only. For a rigorous insight into causal order, refer Paper Insights - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS.

2.2.1. Linearizability (External Consistency) - Temporal Total Order

The strongest single-object consistency model, linearizability, guarantees i

mmediate external visibility of transaction results (a.k.a. external consistency).

If a client executes transaction T₁ and, only after its successful commit, executes transaction T₂, then linearizability requires that the actual order of these transactions must reflect this client-observed real-time order: T₁ must be ordered before T₂.

Linearizability simplifies client reasoning about the system's behavior.

2.2.2. Sequential Consistency - Total Order

Sequential consistency is a strong consistency model ensuring that operations appear to have occurred in some total order.

A helpful analogy (attributed to Dr. D. Mazieres) illustrates the difference between linearizability and sequential consistency:

Suppose you successfully commit a transaction that adds a new item to a key-value store. You then call a friend, asking them to lookup the new data item. If your friend's read fails, the store violates linearizability (external consistency).

Many distributed datastores, including ZooKeeper (with its relaxed reads, as we'll discuss), do not guarantee linearizability. However, they still maintain sequential consistency, ensuring that the final state reflects some total order of the executed transactions. For example, if your friend then attempts to add the same data items again (with the same key), the store must reject either yours or your friend's transaction depending on whose transaction the store decides to order first. This avoids duplicate keys and thereby makes the store at least sequentially consistent.

In other words, sequential consistency guarantees a total order of transactions without strict temporal (or externally-visible) semantics.

2.2.3. Causal Consistency - Causal Order

At this level, the total order of transactions is broken, and only a partial order can be maintained based on causality. Causal consistency orders transactions according to their causal relationships. This weaker consistency model is available during network partitions.

See Paper Insights - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS to learn more about causal consistency.

2.2.4. PRAM Consistency (FIFO Consistency) - FIFO Order

PRAM consistency maintains partial ordering based on the principle that writes executed by the same process are totally ordered.

2.2.5. Eventual Consistency

Lastly, there are other consistency models, such as read-your-writes (RYW), and eventual. See Paper Insights - TAO: Facebook's Distributed Data Store for the Social Graph to learn more about them.

2.2.6. Bonus: Strict Serializability

While this discussion focuses on single-item reads and writes, it's worth noting that extending linearizability to multi-item transactions yields strict serializability.

Illustration 4 shows the different consistency models discussed so far.

Illustration 4: Consistency models.

3. Distributed Log Algorithms

Now that we understand the role of distributed logs in implementing replicated systems capable of single-data item transactions, let's explore the algorithms that enable their creation.

In a distributed system, nodes that function correctly (i.e., are non-crashing and non-Byzantine) are considered non-faulty nodes.

3.1. Consensus

A consensus algorithm addresses the problem of agreeing on a common value among the nodes. Such an algorithm must satisfy the following properties:

Termination (Liveness): Every non-faulty node eventually decides on a value.
Integrity (Safety): If all non-faulty nodes initially proposed the same value, then all non-faulty nodes must decide on that value.
Agreement (Safety): All non-faulty nodes must agree on the same value.

Consensus algorithms are fundamental to building distributed logs. For example, a distributed log can be divided into slots, where each slot holds a single log entry (or instruction for the distributed system). Consensus can then be used to ensure agreement on the value (instruction) to be placed in each slot.

3.2. Atomic Broadcast

Atomic broadcast is a message broadcast in distributed system which ensures that:

Validity (Liveness): If a node broadcasts a message, all nodes eventually receive it.
Integrity (Safety): Each node receives a given message at most once.
Agreement (Safety): If one non-faulty node receives a message, all non-faulty nodes eventually receive it.
Total Order (Safety): Messages are delivered to all non-faulty nodes in the same total order.

In essence, atomic broadcast guarantees that all non-faulty nodes eventually receive all broadcast messages in a consistent total order. The use of "eventually" is crucial in the definition.

Beyond atomic broadcast, other broadcast protocols exist, each corresponding to a specific ordering guarantee:

FIFO Broadcast: Messages originating from a single node are delivered to all nodes in the same order they were sent out. No ordering guarantees are provided for messages from different senders.
Causal Broadcast: All messages are delivered in a causally consistent order.

3.3. Equivalence of Consensus and Atomic Broadcast

A distributed log, built using consensus, can represent the messages in an atomic broadcast, with each log entry corresponding to a broadcast message, thus ensuring total order. Conversely, an algorithm that solves atomic broadcast can also solve consensus. For instance, a distributed log can be implemented via atomic broadcast, where each instruction becomes a message. Therefore, consensus and atomic broadcast are equivalent problems. Algorithms capable of solving consensus can also solve atomic broadcast.

3.4. No Perfect Algorithm Exists! - The FLP Impossibility

Of the many principles governing distributed systems, the FLP (Fischer-Lynch-Paterson) impossibility result is perhaps the most significant. It states that in an asynchronous network where even a single node can crash, it's impossible to guarantee both safety (or correctness) and liveness (or termination) simultaneously.

Given that crash faults are always a possibility in large systems, and real-world networks are inherently asynchronous, this implies that no consensus algorithm can truly be considered universally applicable in practice. Even the most well-known algorithms, like Paxos and Raft, must make compromises and assumptions to function. Almost all practical consensus algorithms prioritize safety over liveness.

4. ZooKeeper

ZooKeeper is a system designed to help other distributed systems solve problems such as:

Distributed Locking: Analogous to mutexes in parallel computing, distributed locks allow a node to acquire and release a lock on a specific object.
Leader Election: A group of participants can elect a leader among themselves, with consistent knowledge of who that leader is.
Group Membership: Participants can maintain a consistent view of the other members in their group.

All of these problems are equivalent to consensus. If leader election could be solved in a true FLP setting, so could consensus. Similarly, a perfect distributed locking algorithm would make consensus trivial, which is why consensus is relatively straightforward in parallel computing.

However, ZooKeeper relies on certain assumptions about the system and doesn't solve these problems perfectly. To achieve availability and performance, ZooKeeper compromises on both safety and liveness, as we'll explore later.

4.1. A Coordination Kernel

ZooKeeper itself doesn't directly solve end-user problems, nor is it a ready-to-use distributed application. Instead, it provides a coordination kernel, a framework upon which higher-level distributed systems can be built.

For example, Apache Kafka is a higher-level system that leverages ZooKeeper for its underlying coordination. Because ZooKeeper isn't perfect, Kafka inherits some of its limitations. However, it functions effectively in practice.

4.2. Z-Nodes

In ZooKeeper, data objects are called z-nodes. They are organized hierarchically, much like a file system, as shown in figure 1. Each z-node contains data and can optionally have child z-nodes. ZooKeeper thus acts as a data store that supports single-data-item (z-node) transactions.

Figure 1: Copy of figure 1 from the paper.

Z-nodes come in two types:

Regular: These z-nodes are explicitly created and deleted by clients using API calls.

Ephemeral: These z-nodes are also created by clients but are deleted either explicitly or after a client-specified timeout. Ephemeral nodes are typically associated with client sessions; if the client disconnects, the ephemeral nodes it created are automatically removed.

ZooKeeper maintains the following information for each z-node:

Type: Whether the z-node is regular or ephemeral.
Metadata: Timestamps and version numbers.
Data: The data stored in the z-node (default 1 MB, configurable).
Children: The z-nodes that are children of this z-node.
Counter: Used for creating sequential child z-nodes. Sequential child z-nodes share the same parent and are created in a specific sequence.

4.3. Watcher

ZooKeeper provides a "watch" mechanism, allowing clients to monitor a z-node for changes (including creation and deletion). The watch notifications only signal that an update has occurred; they do not include the actual modified data. The client must subsequently read the z-node to retrieve the updated information.

4.4. Sessions

ZooKeeper manages client connections using sessions, which are governed by timeouts. If a session times out, ZooKeeper assumes the client has closed and takes appropriate actions (e.g., deleting ephemeral nodes).

However, this assumption is flawed in real-world asynchronous environments, as network delays or transient failures can cause timeouts even if the client is still active. This is a key area where ZooKeeper compromises safety.

To mitigate session timeouts, the ZooKeeper client library sends heartbeats every s/3 seconds and switches to a new server if it hasn't received a response within 2s/3 seconds, where s represents the session timeout length.

4.5. Client API

ZooKeeper provides the following z-node APIs:

create (path, data, flags): Creates a z-node at the specified path with the given data. Flags are used to specify the z-node type (e.g., ephemeral, sequential).
delete (path, version): Deletes the z-node at the specified path, but only if its current version matches the provided version. Version numbers act like generation numbers, preventing accidental deletions.
exists (path, watch): Returns true if a z-node exists at the specified path. The optional watch flag enables a watch on the z-node.
getData (path, watch): Returns the data and metadata of the z-node at the specified path. The optional watch flag enables a watch on the z-node.
setData (path, data, version): Sets the data of the z-node at the specified path to the given data, but only if its current version matches the provided version. This allows the operation to function as a CAS operation for read-write transactions.
getChildren (path, watch): Returns the list of child z-nodes for the z-node at the specified path. The optional watch flag enables a watch on the child list.
sync (path): Waits for all pending updates initiated before this operation to propagate to the server the client is currently connected to.

A key characteristic of these APIs is that they don't return a file handle or descriptor. There are no Open() or Close() operations on z-nodes. All operations are atomic transactions in themselves. No locks are held on z-nodes beyond the duration of the operation. All reads and writes are complete: full reads and full writes.

4.6. Instruction Idempotency

All client operations in ZooKeeper are designed to be idempotent. Idempotency is a highly desirable property in distributed systems. An idempotent operation can be executed multiple times without changing the result beyond the initial application. For example, the following operations are idempotent:

A := A * 1
A := A + 0
A := c (where c is a constant)

The following operation is not idempotent: A := A + 1

ZooKeeper client instructions are idempotent at the server level. For example, the setData() operation includes a version number. After the operation is executed once, subsequent identical executions will have no effect because the version number will have changed.

It's important to note that while the server-side processing of ZooKeeper operations is idempotent, the client API calls themselves are not always idempotent. For instance, using the sequential flag to create sequential nodes results in non-idempotent behavior at the API level. However, the server internally translates these requests into idempotent operations.

4.7. Guarantees

ZooKeeper offers two guarantees:

Linearizability: All writes to z-nodes are linearizable. Because ZooKeeper stores single data items (z-nodes), linearizability is equivalent to strict serializability in this context. For reads, ZooKeeper provides two modes:
- Synchronous Reads (or Strong Reads): Offer linearizability but can be slower.
- Relaxed Reads: Do not guarantee linearizability but are generally faster.
FIFO Client Order: All requests originating from a single client are executed in the order they were sent by that client. This is relatively straightforward to implement on the client side.

5. Implementation Details

This section details ZooKeeper's internal workings.

For high availability, ZooKeeper is fully replicated across all its nodes.

At its core, ZooKeeper implements a distributed log, which contains a sequence of instructions to be applied to the z-node data structure. This entire z-node data structure is held in memory and periodically snapshotted to persistent storage. Each instruction in the distributed log corresponds to a modification of a specific z-node. A log entry is first committed to storage before it's applied to the in-memory data structure. A ZooKeeper replica only responds to a client request after all these steps are complete.

Implementing this distributed log requires a consensus algorithm. ZooKeeper uses the ZooKeeper Atomic Broadcast (ZAB) protocol for this purpose.

5.1. ZooKeeper Atomic Broadcast

ZAB is the atomic broadcast algorithm used by ZooKeeper to construct its distributed logs. ZAB, like other atomic broadcast algorithms, relies on certain assumptions about the system.

ZAB operates in two phases:

Leader Election Phase: ZAB elects a leader among the ZooKeeper nodes. This process uses a Paxos-like approach. Each node proposes a value, and a consensus algorithm is run to select the node that proposed the highest value as the leader. This initial consensus, like Paxos itself, is not perfect and can fail to terminate, thus potentially compromising the system's liveness.

The result of the initial leader election can be the first entry in the distributed log.
Broadcast Phase: The elected leader broadcasts messages containing log entries to all other ZooKeeper servers (followers). Because all modifications are routed through the leader, they are processed in a consistent order. A request is considered committed when a majority of followers respond successfully, i.e., when f + 1 commits out of 2f + 1.

5.2. Handling Partition

If the leader crashes or a majority of replicas become unreachable, a new leader election occurs. The distributed log will then contain an entry reflecting this new leadership. A node can only be elected leader if it has caught up with all log entries that were successfully committed before the view change (the transition to a new leader).

Prior to a view change, there are 2f + 1 nodes, of which at least f + 1 are up-to-date.

After a view change (when f + 1 nodes become unavailable), there will be at least one node that is fully caught up with the latest committed log entries.

Before a new leader is elected, the new set of replicas must have at least f + 1 nodes that are caught up to this latest position. The new leader must also ensure that all lagging replicas receive all the log entries before the leader change entry itself can be committed. This is accomplished by having the most up-to-date replica broadcast the logs to everyone else.

This also means that duplicate delivery of messages are possible. However, because all ZooKeeper operations are idempotent, duplicate message delivery has no adverse effects.

5.3. Fuzzy Snapshots

Because ZooKeeper operations are idempotent, it's possible to create fuzzy snapshots. A fuzzy snapshot is taken without needing to acquire a full lock on the in-memory database.

For example, consider data items A and B, initially with values {A: 1, B: 2} -

A fuzzy snapshot process begins, recording the current value of A (A: 1).
A transaction updates A := 4. New state - {A: 4, B: 2}.
Then, another transaction updates B := 3. New state - {A: 4, B: 3}.
Finally, the fuzzy snapshot process records the current value of B (B: 3).

The resulting fuzzy snapshot appears as {A: 1, B: 3}. This state never actually existed at any point in time in the database.

However, due to the idempotent nature of the operations, the fuzzy snapshot can be brought up to the correct final state. By applying all log entries that occurred after the snapshot began, the fuzzy snapshot can be transformed to reflect the current state {A: 4, B: 3}.

A more detailed example specific to ZooKeeper is provided in Section 4.3 of the relevant paper.

5.4. Relaxed v/s Linearizable Reads

By default, ZooKeeper read operations are directed to a nearby replica, which might not have the latest updates from the distributed log. These relaxed reads can return stale data and are therefore not linearizable. The zxid (transaction id) on a replica indicates the last committed log entry. A client can detect a potentially stale read by comparing the zxid returned with the read to a previously known zxid.

To achieve linearizable reads, ZooKeeper provides the sync operation. This operation goes through the leader and is processed like any other instruction, although it's not actually committed to the log. sync effectively performs a read-only transaction, ensuring the client receives the latest value at that specific point in time. Having a single leader to coordinate read-only transactions ensures that such transactions fetch the latest value and reflect all writes up to that point. This single-leader approach also offers the advantage of consistent transaction ordering, aligning with client observations (i.e., external consistency).

The sync operation could be implemented straightforwardly by using the same distributed logging process as other instructions. However, ZooKeeper optimizes this process by assuming that the leader will not lose its leadership while the sync operation is in progress. This optimization relies on leader leases. This assumption is, again, unsafe and can lead to non-linearizable reads, thus compromising correctness.

While exchanging heartbeats, the client learns the latest zxid from the server. When switching to a new server, the client is guaranteed to connect only to servers with a zxid that is at least as recent as the last known zxid. This ensures the client can always find a server that is sufficiently up-to-date. These mechanisms are primarily for performance optimization and do not affect the fundamental correctness of the system.

6. "Wait-Free" v/s Chubby

The paper's title mentions "wait-free". Let's explore what that signifies.

As previously discussed, clients are notified of z-node modifications. These notifications are guaranteed for eventual delivery. "Eventually" means that all replicas will eventually catch up and notify their respective clients. This holds true even if a client switches to a new replica based on the zxid.

This eventual delivery ensures that write operations don't have to wait for every client to be notified. Write requests are committed to the transaction log and the response is immediately returned to the client. Watchers are notified asynchronously and eventually. This asynchronous notification mechanism is what constitutes "wait-free" behavior.

A contrasting approach would require write requests to block until all clients watching the modified z-node have been notified. Otherwise, the transactions would fail. This is precisely how Chubby operates. The authors argue that this synchronous notification is the reason for Chubby's slower performance and its lack of "wait-free" property. Chubby provides a very strong and consistent notification mechanism, but this comes at the cost of scalability, especially for fine-grained locking. Chubby only sends a response back to the client after notifying all watchers (though it does incorporate timeouts, which, while improving responsiveness, introduce correctness issues).

ZooKeeper, on the other hand, replies to the client before notifications are sent, achieving wait-free behavior. This is the key difference between Chubby and ZooKeeper.

7. ZooKeeper & FLP Impossibility

ZooKeeper violates the FLP impossibility result at several points:

ZAB itself does not guarantee liveness during faults.
ZooKeeper's lease mechanism, which relies on timeouts, also compromises safety. Timeouts depend on clocks, which are not guaranteed to be synchronized across processes in an asynchronous system. Therefore, there is no true consensus on clock values. Although this compromises safety, it's a widely accepted trade-off in practice.

Timeouts are one of two common ways to implement leases in asynchronous systems (the other being explicit lease relinquishment, often involving some variant of two-phase commit). A timeout can be conceptually viewed as a message received by the lease holder. In ZooKeeper, a timeout acts as a message to the leader to release locks on ephemeral nodes. This reliance on timeouts is where safety is compromised, but, again, this compromise is generally accepted in practice.

8. Primitives using ZooKeeper

ZooKeeper can be used to build a variety of distributed primitives. Among the most interesting are leader elections and distributed locks. The paper also explains other primitives, such as configuration management, group membership, and double barriers, which are relatively straightforward to understand.

8.1. Leader Election

A simple leader election can be implemented by having clients attempt to create a specific z-node. The first client to successfully create the z-node is designated the leader.

A more sophisticated approach might require the leader to submit new configurations before assuming leadership. In this case, a special ready z-node can be used. Clients create z-nodes containing their portion of the new configuration. Once all configuration z-nodes are created, a client creates the ready z-node.

ZooKeeper's FIFO client order guarantee ensures that all configuration z-nodes will be created before the ready z-node. Consequently, any client that observes the ready z-node is also guaranteed to have seen all the updated configuration z-nodes.

8.2. Distributed Locks

In its simplest form, a lock can be implemented using ephemeral z-nodes as lock files. A client releases the lock by either explicitly deleting the z-node or implicitly when its session terminates.

However, this simple approach suffers from the herd effect: many clients might be waiting to acquire the lock, all contending simultaneously when it becomes available. Also, this implementation provides only exclusive locks, not shared locks.

A more robust approach involves creating z-nodes with the sequential flag -

Each client is assigned a position in a queue and acquires the lock in sequence.
Each client watches the z-node of the client immediately preceding it in the queue. A crucial detail is the loop back to line 2 in the Lock() function (as referenced in the paper). This handles the scenario where the client ahead of the current client in the queue might have crashed (effectively leaving the queue).

To implement a shared lock (ReadLock), the algorithm ignores any read z-nodes ahead in the queue and only checks for a preceding write z-node.

9. Evaluation

ZooKeeper's performance has been significantly improved since its initial release, so the reported numbers may no longer be entirely accurate.

Read vs. Write Throughput (Figure 5): As expected, read throughput is higher than write throughput. This is because reads are typically relaxed (and thus faster), while writes must go through the atomic broadcast protocol and be logged.

Factors Affecting Write Throughput: The atomic broadcast process is identified as the main limiting factor for write throughput. However, other factors, such as client communication overhead and ACL checks, also contribute.

Fault Tolerance: The failure of a single follower does not significantly impact overall throughput. When the leader fails, the system recovers relatively quickly (within ~200ms).

10. Paper Review

This paper covers many interesting concepts. As software engineers working on distributed systems, we typically don't deal directly with consensus and atomic broadcast protocols. Instead, we build on top of a kernel that abstracts these complexities away. In fact, we often build on primitives built on top of those kernels. Nevertheless, understanding how these underlying systems work is still valuable.

The key takeaway is that no such coordination kernel, whether ZooKeeper or Chubby, is perfect. They all operate under certain assumptions. Despite their theoretical flaws, they perform remarkably well in practice and thus form a critical foundation for distributed systems.

Pico