This paper is a significant contribution to the field of Distributed Systems. ZooKeeper's design is based on principles similar to Google's Chubby, however, it also incorporates key differences. This work was presented at Usenix Annual Technical Conference (ATC) '10. The work can be seen as an open-source counterpart to Chubby. At the time of its publication, decentralized approaches, such as those using consistent hashing (e.g., Dynamo), were waning in popularity, and centralized systems were gaining traction. Both Google's Chubby and Apache ZooKeeper emerged during this shift.
Recommended Read: Paper Insights - Paxos Made Simple where Paxos, a consensus algorithm, was discussed in detail.
Let's begin with some fundamental concepts.
1. Distributed Logs
Distributed systems rely on distributed logs, which serve as the single source of truth for the operations performed. These logs are crucial for replicated state machines, where, to maintain consistency across nodes, it's essential that they execute the same set of operations in a consistent order. Typically, the operations recorded in a distributed log correspond to modifications of data stored on a node.
Illustration 1: Distributed logs in RSM.
Illustration 1 above depicts three nodes, each holding a versioned data item, X. These replicas sequentially apply instructions from the distributed log. Instructions 1, 3, and 4 are blind-writes. Instruction 2 is read-write. While the state of the memory location holding X may differ across replicas at any given point in time, once all instructions from the log have been applied, all replicas will converge to the same final state.
2. Single-Data Transactions
- Single-data transactions: These transactions operate on a single object atomically.
- Multi-data transactions: These transactions affect multiple objects atomically.
This article focuses solely on single-data transactions.
2.1. Implementation using Distributed Logs
Single-data item transactions can itself take two forms:
- Blind write (W): A direct assignment of a value.
- Read-then-write (R-W): Reading a value and subsequently writing a new value. There may be some decision making based on the value read.
Distributed logs can implement both types of single-data item transactions. For a blind write (W), the log entry might be represented as
key := value
indicating that value should be assigned to the data item identified by the key. For a read-then-write (R-W), the log entry could be
IF v(key) = version THEN key := value
signifying a compare-and-set (CAS) operation: the value of the data item associated with key is updated to value only if the current version matches the version. The version signifies the value that was read.
Note that distributed logs are effective for single-data item transactions only, where the entire data item resides on the node(s) responsible for it. For multi-item transactions (or general database transactions), a combination of distributed logging and two-phase commit (2PC) is required. The discussion of 2PC is beyond the scope of this article.
2.2. Consistency Models
There are two ways to think about a consistency model in a single data item read-write system:
- Let's say there are clients observing all writes that are happening - what is the effective order in which the clients will see the writes?
- What would be the effective state of the objects in the data store once all the writes have been applied?
Several types of ordering are possible:
- 
Total Order: All transactions are ordered linearly and there is only a single valid order. All clients will see that same order and there will be only a single effective state. 
- 
Causal Order: A partial order based on event causality. For example (shown in illustration 2), if client C1 performs write W1 and then sends a message to clients C2 and C3, and subsequently C2 performs write W2 and C3 performs write W3, then, according to Lamport's happened-before relationship, W1 happened before both W2 and W3. Therefore, W1 must appear before W2 and W3 in any valid causal order. Both {W1, W2, W3} and {W1, W3, W2} are valid causal orderings, but {W2, W1, W3} is not valid. 
Illustration 2: Causal order.
- 
FIFO Order: A partial order ensuring that writes from the same client are ordered. For example (shown in illustration 3), if a client performs write W1 and then write W2, the final order must have W1 before W2. Writes from other clients (like W3 in the previous example) can occur at any point in the sequence relative to W1 and W2. 
Illustration 3: FIFO order.
2.2.1. Linearizability (External Consistency) - Temporal Total Order
The strongest single-object consistency model, linearizability, guarantees i
mmediate external visibility of transaction results (a.k.a. external consistency).2.2.2. Sequential Consistency - Total Order
Sequential consistency is a strong consistency model ensuring that operations appear to have occurred in some total order.
A helpful analogy (attributed to Dr. D. Mazieres) illustrates the difference between linearizability and sequential consistency:
Suppose you successfully commit a transaction that adds a new item to a key-value store. You then call a friend, asking them to lookup the new data item. If your friend's read fails, the store violates linearizability (external consistency).
Many distributed datastores, including ZooKeeper (with its relaxed reads, as we'll discuss), do not guarantee linearizability. However, they still maintain sequential consistency, ensuring that the final state reflects some total order of the executed transactions. For example, if your friend then attempts to add the same data items again (with the same key), the store must reject either yours or your friend's transaction depending on whose transaction the store decides to order first. This avoids duplicate keys and thereby makes the store at least sequentially consistent.
In other words, sequential consistency guarantees a total order of transactions without strict temporal (or externally-visible) semantics.
2.2.3. Causal Consistency - Causal Order
At this level, the total order of transactions is broken, and only a partial order can be maintained based on causality. Causal consistency orders transactions according to their causal relationships. This weaker consistency model is available during network partitions.
See Paper Insights - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS to learn more about causal consistency.
2.2.4. PRAM Consistency (FIFO Consistency) - FIFO Order
PRAM consistency maintains partial ordering based on the principle that writes executed by the same process are totally ordered.2.2.5. Eventual Consistency
Lastly, there are other consistency models, such as read-your-writes (RYW), and eventual. See Paper Insights - TAO: Facebook's Distributed Data Store for the Social Graph to learn more about them.
2.2.6. Bonus: Strict Serializability
While this discussion focuses on single-item reads and writes, it's worth noting that extending linearizability to multi-item transactions yields strict serializability.
Illustration 4 shows the different consistency models discussed so far.
3. Distributed Log Algorithms
3.1. Consensus
A consensus algorithm addresses the problem of agreeing on a common value among the nodes. Such an algorithm must satisfy the following properties:
- Termination (Liveness): Every non-faulty node eventually decides on a value.
- Integrity (Safety): If all non-faulty nodes initially proposed the same value, then all non-faulty nodes must decide on that value.
- Agreement (Safety): All non-faulty nodes must agree on the same value.
3.2. Atomic Broadcast
Atomic broadcast is a message broadcast in distributed system which ensures that:
- Validity (Liveness): If a node broadcasts a message, all nodes eventually receive it.
- Integrity (Safety): Each node receives a given message at most once.
- Agreement (Safety): If one non-faulty node receives a message, all non-faulty nodes eventually receive it.
- Total Order (Safety): Messages are delivered to all non-faulty nodes in the same total order.
In essence, atomic broadcast guarantees that all non-faulty nodes eventually receive all broadcast messages in a consistent total order. The use of "eventually" is crucial in the definition.
Beyond atomic broadcast, other broadcast protocols exist, each corresponding to a specific ordering guarantee:
- 
FIFO Broadcast: Messages originating from a single node are delivered to all nodes in the same order they were sent out. No ordering guarantees are provided for messages from different senders. 
- 
Causal Broadcast: All messages are delivered in a causally consistent order. 
3.3. Equivalence of Consensus and Atomic Broadcast
A distributed log, built using consensus, can represent the messages in an atomic broadcast, with each log entry corresponding to a broadcast message, thus ensuring total order. Conversely, an algorithm that solves atomic broadcast can also solve consensus. For instance, a distributed log can be implemented via atomic broadcast, where each instruction becomes a message. Therefore, consensus and atomic broadcast are equivalent problems. Algorithms capable of solving consensus can also solve atomic broadcast.
3.4. No Perfect Algorithm Exists! - The FLP Impossibility
Of the many principles governing distributed systems, the FLP (Fischer-Lynch-Paterson) impossibility result is perhaps the most significant. It states that in an asynchronous network where even a single node can crash, it's impossible to guarantee both safety (or correctness) and liveness (or termination) simultaneously.
Given that crash faults are always a possibility in large systems, and real-world networks are inherently asynchronous, this implies that no consensus algorithm can truly be considered universally applicable in practice. Even the most well-known algorithms, like Paxos and Raft, must make compromises and assumptions to function. Almost all practical consensus algorithms prioritize safety over liveness.
4. ZooKeeper
ZooKeeper is a system designed to help other distributed systems solve problems such as:
- Distributed Locking: Analogous to mutexes in parallel computing, distributed locks allow a node to acquire and release a lock on a specific object.
- Leader Election: A group of participants can elect a leader among themselves, with consistent knowledge of who that leader is.
- Group Membership: Participants can maintain a consistent view of the other members in their group.
All of these problems are equivalent to consensus. If leader election could be solved in a true FLP setting, so could consensus. Similarly, a perfect distributed locking algorithm would make consensus trivial, which is why consensus is relatively straightforward in parallel computing.
However, ZooKeeper relies on certain assumptions about the system and doesn't solve these problems perfectly. To achieve availability and performance, ZooKeeper compromises on both safety and liveness, as we'll explore later.
4.1. A Coordination Kernel
ZooKeeper itself doesn't directly solve end-user problems, nor is it a ready-to-use distributed application. Instead, it provides a coordination kernel, a framework upon which higher-level distributed systems can be built.
For example, Apache Kafka is a higher-level system that leverages ZooKeeper for its underlying coordination. Because ZooKeeper isn't perfect, Kafka inherits some of its limitations. However, it functions effectively in practice.
4.2. Z-Nodes
In ZooKeeper, data objects are called z-nodes. They are organized hierarchically, much like a file system, as shown in figure 1. Each z-node contains data and can optionally have child z-nodes. ZooKeeper thus acts as a data store that supports single-data-item (z-node) transactions.
Z-nodes come in two types:
- Regular: These z-nodes are explicitly created and deleted by clients using API calls.
- Ephemeral: These z-nodes are also created by clients but are deleted either explicitly or after a client-specified timeout. Ephemeral nodes are typically associated with client sessions; if the client disconnects, the ephemeral nodes it created are automatically removed.
ZooKeeper maintains the following information for each z-node:
- Type: Whether the z-node is regular or ephemeral.
- Metadata: Timestamps and version numbers.
- Data: The data stored in the z-node (default 1 MB, configurable).
- Children: The z-nodes that are children of this z-node.
- Counter: Used for creating sequential child z-nodes. Sequential child z-nodes share the same parent and are created in a specific sequence.
4.3. Watcher
4.4. Sessions
ZooKeeper manages client connections using sessions, which are governed by timeouts. If a session times out, ZooKeeper assumes the client has closed and takes appropriate actions (e.g., deleting ephemeral nodes).
However, this assumption is flawed in real-world asynchronous environments, as network delays or transient failures can cause timeouts even if the client is still active. This is a key area where ZooKeeper compromises safety.
To mitigate session timeouts, the ZooKeeper client library sends heartbeats every s/3 seconds and switches to a new server if it hasn't received a response within 2s/3 seconds, where s represents the session timeout length.
4.5. Client API
ZooKeeper provides the following z-node APIs:
- create (path, data, flags): Creates a z-node at the specified path with the given data. Flags are used to specify the z-node type (e.g., ephemeral, sequential).
- delete (path, version): Deletes the z-node at the specified path, but only if its current version matches the provided version. Version numbers act like generation numbers, preventing accidental deletions.
- exists (path, watch): Returns true if a z-node exists at the specified path. The optional watch flag enables a watch on the z-node.
- getData (path, watch): Returns the data and metadata of the z-node at the specified path. The optional watch flag enables a watch on the z-node.
- setData (path, data, version): Sets the data of the z-node at the specified path to the given data, but only if its current version matches the provided version. This allows the operation to function as a CAS operation for read-write transactions.
- getChildren (path, watch): Returns the list of child z-nodes for the z-node at the specified path. The optional watch flag enables a watch on the child list.
- sync (path): Waits for all pending updates initiated before this operation to propagate to the server the client is currently connected to.
A key characteristic of these APIs is that they don't return a file handle or descriptor. There are no Open() or Close() operations on z-nodes. All operations are atomic transactions in themselves. No locks are held on z-nodes beyond the duration of the operation. All reads and writes are complete: full reads and full writes.
4.6. Instruction Idempotency
All client operations in ZooKeeper are designed to be idempotent. Idempotency is a highly desirable property in distributed systems. An idempotent operation can be executed multiple times without changing the result beyond the initial application. For example, the following operations are idempotent:
- A := A * 1
- A := A + 0
- A := c (where c is a constant)
The following operation is not idempotent: A := A + 1
ZooKeeper client instructions are idempotent at the server level. For example, the setData() operation includes a version number. After the operation is executed once, subsequent identical executions will have no effect because the version number will have changed.
It's important to note that while the server-side processing of ZooKeeper operations is idempotent, the client API calls themselves are not always idempotent. For instance, using the sequential flag to create sequential nodes results in non-idempotent behavior at the API level. However, the server internally translates these requests into idempotent operations.
4.7. Guarantees
ZooKeeper offers two guarantees:
- 
Linearizability: All writes to z-nodes are linearizable. Because ZooKeeper stores single data items (z-nodes), linearizability is equivalent to strict serializability in this context. For reads, ZooKeeper provides two modes: - Synchronous Reads (or Strong Reads): Offer linearizability but can be slower.
- Relaxed Reads: Do not guarantee linearizability but are generally faster.
 
- 
FIFO Client Order: All requests originating from a single client are executed in the order they were sent by that client. This is relatively straightforward to implement on the client side. 
5. Implementation Details
This section details ZooKeeper's internal workings.
For high availability, ZooKeeper is fully replicated across all its nodes.
At its core, ZooKeeper implements a distributed log, which contains a sequence of instructions to be applied to the z-node data structure. This entire z-node data structure is held in memory and periodically snapshotted to persistent storage. Each instruction in the distributed log corresponds to a modification of a specific z-node. A log entry is first committed to storage before it's applied to the in-memory data structure. A ZooKeeper replica only responds to a client request after all these steps are complete.
Implementing this distributed log requires a consensus algorithm. ZooKeeper uses the ZooKeeper Atomic Broadcast (ZAB) protocol for this purpose.
5.1. ZooKeeper Atomic Broadcast
ZAB is the atomic broadcast algorithm used by ZooKeeper to construct its distributed logs. ZAB, like other atomic broadcast algorithms, relies on certain assumptions about the system.
ZAB operates in two phases:
- 
Leader Election Phase: ZAB elects a leader among the ZooKeeper nodes. This process uses a Paxos-like approach. Each node proposes a value, and a consensus algorithm is run to select the node that proposed the highest value as the leader. This initial consensus, like Paxos itself, is not perfect and can fail to terminate, thus potentially compromising the system's liveness. The result of the initial leader election can be the first entry in the distributed log. 
- 
Broadcast Phase: The elected leader broadcasts messages containing log entries to all other ZooKeeper servers (followers). Because all modifications are routed through the leader, they are processed in a consistent order. A request is considered committed when a majority of followers respond successfully, i.e., when f + 1 commits out of 2f + 1. 
5.2. Handling Partition
If the leader crashes or a majority of replicas become unreachable, a new leader election occurs. The distributed log will then contain an entry reflecting this new leadership. A node can only be elected leader if it has caught up with all log entries that were successfully committed before the view change (the transition to a new leader).
5.3. Fuzzy Snapshots
Because ZooKeeper operations are idempotent, it's possible to create fuzzy snapshots. A fuzzy snapshot is taken without needing to acquire a full lock on the in-memory database.
For example, consider data items A and B, initially with values {A: 1, B: 2} -
- A fuzzy snapshot process begins, recording the current value of A (A: 1).
- A transaction updates A := 4. New state - {A: 4, B: 2}.
- Then, another transaction updates B := 3. New state - {A: 4, B: 3}.
- Finally, the fuzzy snapshot process records the current value of B (B: 3).
The resulting fuzzy snapshot appears as {A: 1, B: 3}. This state never actually existed at any point in time in the database.
However, due to the idempotent nature of the operations, the fuzzy snapshot can be brought up to the correct final state. By applying all log entries that occurred after the snapshot began, the fuzzy snapshot can be transformed to reflect the current state {A: 4, B: 3}.
A more detailed example specific to ZooKeeper is provided in Section 4.3 of the relevant paper.
5.4. Relaxed v/s Linearizable Reads
6. "Wait-Free" v/s Chubby
The paper's title mentions "wait-free". Let's explore what that signifies.
As previously discussed, clients are notified of z-node modifications. These notifications are guaranteed for eventual delivery. "Eventually" means that all replicas will eventually catch up and notify their respective clients. This holds true even if a client switches to a new replica based on the zxid.
This eventual delivery ensures that write operations don't have to wait for every client to be notified. Write requests are committed to the transaction log and the response is immediately returned to the client. Watchers are notified asynchronously and eventually. This asynchronous notification mechanism is what constitutes "wait-free" behavior.
A contrasting approach would require write requests to block until all clients watching the modified z-node have been notified. Otherwise, the transactions would fail. This is precisely how Chubby operates. The authors argue that this synchronous notification is the reason for Chubby's slower performance and its lack of "wait-free" property. Chubby provides a very strong and consistent notification mechanism, but this comes at the cost of scalability, especially for fine-grained locking. Chubby only sends a response back to the client after notifying all watchers (though it does incorporate timeouts, which, while improving responsiveness, introduce correctness issues).
ZooKeeper, on the other hand, replies to the client before notifications are sent, achieving wait-free behavior. This is the key difference between Chubby and ZooKeeper.
7. ZooKeeper & FLP Impossibility
ZooKeeper violates the FLP impossibility result at several points:
- ZAB itself does not guarantee liveness during faults.
- ZooKeeper's lease mechanism, which relies on timeouts, also compromises safety. Timeouts depend on clocks, which are not guaranteed to be synchronized across processes in an asynchronous system. Therefore, there is no true consensus on clock values. Although this compromises safety, it's a widely accepted trade-off in practice.
Timeouts are one of two common ways to implement leases in asynchronous systems (the other being explicit lease relinquishment, often involving some variant of two-phase commit). A timeout can be conceptually viewed as a message received by the lease holder. In ZooKeeper, a timeout acts as a message to the leader to release locks on ephemeral nodes. This reliance on timeouts is where safety is compromised, but, again, this compromise is generally accepted in practice.
8. Primitives using ZooKeeper
8.1. Leader Election
8.2. Distributed Locks
- Each client is assigned a position in a queue and acquires the lock in sequence.
- Each client watches the z-node of the client immediately preceding it in the queue. A crucial detail is the loop back to line 2 in the Lock() function (as referenced in the paper). This handles the scenario where the client ahead of the current client in the queue might have crashed (effectively leaving the queue).
9. Evaluation
ZooKeeper's performance has been significantly improved since its initial release, so the reported numbers may no longer be entirely accurate.
- Read vs. Write Throughput (Figure 5): As expected, read throughput is higher than write throughput. This is because reads are typically relaxed (and thus faster), while writes must go through the atomic broadcast protocol and be logged.
- Factors Affecting Write Throughput: The atomic broadcast process is identified as the main limiting factor for write throughput. However, other factors, such as client communication overhead and ACL checks, also contribute.
- Fault Tolerance: The failure of a single follower does not significantly impact overall throughput. When the leader fails, the system recovers relatively quickly (within ~200ms).






Comments
Post a Comment