Paper Insights #26 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
This work provides a strong foundation for understanding causality, both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman.
Let's begin with some basic concepts.
Causal Ordering
In 1978, Leslie Lamport published Time, Clocks, and the Ordering of Events in a Distributed System, a seminal paper that significantly impacted distributed system design. This work, alongside Paxos and TLA+, stands as one of Lamport's most influential contributions.
A fundamental challenge in distributed systems is clock synchronization. Perfect synchronization is unattainable, a fact rooted in both computer science and physics. However, the goal isn't perfect synchronization itself, but rather the ability to totally order events. With synchronized clocks, we could definitively determine the temporal order of events across different nodes.
For example, if event a occurs on node 1 and event b on node 2, synchronized clocks would allow us to assign timestamps and directly compare them to establish which event happened first.
The Partial Order
Unfortunately, perfect global clock synchronization (outside of specialized solutions like Google's TrueTime) remains impractical. Consequently, we rely on the concept of partial ordering.
Let ≤ be a binary relation. Then ≤ is a partial order if:
- ∀a: a ≤ a (reflexivity)
- ∀a, b: (a ≤ b ∧ b ≤ a) => a = b (antisymmetry)
- ∀a, b, c: (a ≤ b ∧ b ≤ c) => a ≤ c (transitivity)
A strict partial order (<) does not include the reflexivity condition.
Within a single system, ordering is achievable using a shared clock source and synchronization primitives like Linux futexes. However, establishing order across distributed nodes requires different approaches.
Network messages offer a mechanism for partial ordering across distributed nodes. When node 1 sends a message to node 2, it includes its current timestamp (T). Node 2, upon receipt, knows that node 1's time was at least T. If node 2's local time is less than T, it adjusts its clock to T, ensuring subsequent events have timestamps greater than T, thereby establishing that all events after receipt of the message are ordered after the events before send of the message.
Example
Consider the following event diagram where a, b, c, and d are 4 events happening across 2 nodes:
It can be seen that:
- a < b, c < d (order within nodes)
- a < c, a < d (message causality)
However, the relative order of b to c or d remains undefined, as a result, we only know the partial order.
Happened-Before Relation (Causal Order)
Lamport formalized this partial ordering with the happened-before relation (->) also known as the causal order:
- If events a and b occur within the same node and a precedes b, then a -> b.
- If event a is the sending of a message and event b is its receipt, then a -> b.
- Transitivity: if a -> b and b -> c, then a -> c.
This relation has the properties of a strict partial order:
- Not reflexive (an event doesn't happen before itself)
- Anti-symmetric
- Transitive (as defined above)
Logical Clocks
Logical clocks aim to capture this partial ordering.
A logical clock C is a vector of clocks where each element Ci is a clock for a node i. C assigns timestamps to events, such that if a -> b, then C(a) < C(b). The converse isn't necessarily true.
This condition is satisfied by:
- Within a node, if a -> b, then Ci(a) < Ci(b).
- Across nodes, if a (send) -> b (receive), then Ci(a) < Cj(b).
Counter-based Implementation
- Each node maintains a counter.
- The counter increments for each local event.
- When a node i is sending a message, the current counter Ci is included.
- On receipt, the receiving node j updates its counter to max(Cj, Ci + 1).
Total Ordering from Partial Ordering
Logical clocks provide partial ordering. Total ordering can be achieved by breaking ties using node IDs. For example, if a -> b on node 1 and c -> d on node 2, we can enforce all non-ordered events from node 1 before node 2 and arrive at a -> b -> c -> d.
Strong Clock Condition
The strong clock condition requires:
- If a -> b, then C(a) < C(b).
- If C(a) < C(b), then a -> b.
Logical clocks cannot inherently satisfy this. For example, say an event a happened on node 1 and event b happened on node 2. The event a may have actually happened before event b but may be assigned a timestamp higher due to different counter position on the two nodes.
However, it would be very hard for humans to reason about such cases where the computer believes that C(b) < C(a), yet somehow a actually happened before b.
Lamport makes use of physical clocks to satisfy strong clock condition and make it easy for humans to reason about the partial ordering of events. To achieve the strong clock condition using physical clocks, specific constraints are needed:
- Bounded individual clock drift (k). For crystal clocks, k <= 10-6.
- Bounded drift between clock pairs (ε).
- Non-zero minimum network transmission time (μ).
- The receiver's clock upon receiving a message m must be greater than sender's clock before send, i.e., for all i, j, and t, where t is the real physical time:
ε / (1 - k) <= μ
When the above constraints are met (and it actually does in practice), the strong clock condition is achieved by the following approach (t and t' are the real physical time of events):
- Say Tm is the time when a message was sent by node i and μ is the time it took for the message to reach a node j.
Tm = Ci(t)
- Upon receiving m at time t', node j sets
Causal Consistency
For a review of general consistency models, please refer to our previous discussion on ZooKeeper. In this article, we'll focus specifically on causal consistency, particularly within the context of single-value key-value stores.
Causal consistency mandates that all operations respect the causal ordering. All clients observe the writes happening in some causal order. In a causally consistent key-value store, this can lead to scenarios where multiple values exist for the same key, requiring eventual resolution.
Let's illustrate this with an example:
Imagine a key X initially holds the value 0. Node 1 performs two consecutive writes: W(X, 1) followed by W(X, 2). Node 2 reads X, observes the value 1 (R(X, 1)), and then writes the value 3 (W(X, 3)).
The causal relationships are as follows:
- W(X, 1) -> R(X) (read reflects the write)
- W(X, 1) -> W(X, 2) (ordering within Node 1)
- R(X) -> W(X, 3) (ordering within Node 2)
Crucially, the order between W(X, 2) and W(X, 3) is undefined; they are not causally related. Therefore, both X = 2 and X = 3 are valid states within a causally consistent system. Thus, Node 3 may find the value to be 2 (W(X, 3) will be ordered before W(X, 2) from its perspective), and Node 4 may find the value to be 3 (W(X, 2) will be ordered before W(X, 3) from its perspective).
This contrasts with sequential consistency, which demands a single, definitive total order for every observer. In a sequentially consistent system, all reads must return either 2 or 3, depending on the system's chosen ordering (of course, given that the reads are ordered after all the writes).
Eventual consistency is even weaker than causal consistency. See Paper Insights - TAO: Facebook's Distributed Data Store for the Social Graph to learn more.
COPS (Cluster of Order-Preserving Servers)
COPS was developed to address the ALPS principles: Availability, Low-latency, Partition-tolerance, and Scalability. It achieves this by sacrificing strong consistency in favor of causal consistency. This aligns with the CAP theorem, which demonstrates that strong consistency, availability, and partition tolerance cannot be simultaneously guaranteed. Causal consistency (to be precise, it's real-time causal consistency), is the strongest consistency level achievable under availability and partition tolerance.
While systems like Dynamo and TAO employ eventual consistency—the most relaxed model—due to its relative simplicity, COPS takes a more nuanced approach to offer causal consistency.
COPS has 2 different flavors - COPS and COPS-GT. To avoid ambiguity, COPS will henceforth refer to the enhanced COPS-GT version only.
API
COPS provides a straightforward API:
- put(key, val, context): Stores a value associated with a key.
- get(key, context) -> value: Retrieves the value associated with a key.
- get_trans(<keys>, context) -> <values>: Retrieves a consistent view of multiple key-value pairs.
Despite its simple API, reasoning about the returned values within a causally consistent system can be complex, contributing to the relative lack of widespread adoption for causal consistency. In fact, COPS never became more than an academic marvel. The CS industry has been happy with eventual consistency.
Architecture
COPS employs a network of globally distributed clusters, each storing all key-value pairs.
Each cluster functions as a linearizable key-value store. Within a cluster, key-value pairs are partitioned into N keyspaces and distributed across replicas using consistent hashing.
The cluster's linearizable storage utilizes the FAWN KV-store, with chain replication for each key-value pair. Client write requests are routed to the chain's head, while all read requests, including dependency checks (described below), are directed to the tail.
Each key-value pair is associated with metadata:
- Version number
- Dependencies (explained below) - A list of <key, version> pairs.
The system includes three internal APIs, not directly exposed to users, but utilized by the client library:
- put_after(key, val, [deps]) -> <bool, version>
- get_by_version(key, version) -> <value, version, deps>
- dep_check(key, version) -> bool
Clients interact with the key-value store exclusively through a client library, which manages all communication.
Consistency Model
Causal Order
COPS defines causal ordering (~>), similar to Lamport's happened-before relation, but tailored to its API:
- Execution Thread: Operations a and b within a single execution thread (node) are ordered as a ~> b if a precedes b.
- Gets From: If a put operation a is read by a get operation b, then a ~> b. This replaces Lamport's message-based causality with a read-write dependency.
- Transitivity: The ~> relation maintains transitivity.
Conflict Handling
Concurrent writes can lead to conflicting values across replicas, as they're permitted to diverge when writes lack causal ordering. In our previous example, W(X, 2) and W(X, 3) resulted in diverging values for X.
Conflicting values are undesirable. COPS resolves these divergences using a handler function. This function takes two conflicting values for a key and returns a single, converged value.
The handler function must be:
- Associative
- Commutative
A simple handler function can be last-writer-wins to impose an arbitrary order, forcing conflict resolution.
Causal+ Consistency
This consistency model is slightly stronger then causal consistency and is called causal+ consistency.
Causal+ thus occupies a distinct position within the consistency spectrum, offering a stronger form of consistency than causal+ consistency, but weaker than strong consistency.
Versions and Dependencies
Each value associated with a key is assigned a unique version. COPS uses Lamport timestamps (as described previously) to generate these versions, ensuring for a key-value pair. When combined with a last-writer-wins tie-breaking mechanism, this yields a global ordering for all writes to a key. A version number is thus represented as <Lamport timestamp, node number>.
Formally, for writes to keys x and y at versions i and j respectively, if xi ~> yj (xi causally precedes yj), then i < j.
Progressing Property
Once a client reads a specific version for a key, subsequent reads will return that version or a later one. This is the progressing property guaranteed by causal+ consistency.
Dependencies
Causal relationships introduce dependencies between key values. If xi ~> yj, then yj depends on xi. This means if a client reads yj, it must also be able to read x at version i or a later version. A single key can have dependencies on multiple versions of other keys.
COPS maintains these dependencies as a dependency graph stored internally.
An example dependency graph.
Alphabets are keys and subscripts are version numbers.
Client Context
The client context tracks the client's causal knowledge, analogous to tracking received messages. It contains all previously read and written values within the client session.
- Reads: When a client performs a read, the read version for the key is added to the client context.
- Writes: When a client performs a write, the client provides the server with its context, ensuring the server is aware of the client's prior reads.
Writing to a Key
All write operations in COPS follow a two-phase process:
- Synchronous Local Write: The write is immediately applied to the local cluster.
- Asynchronous Replication: The write is subsequently propagated to all remote clusters.
It's crucial to note that COPS clients perform only one write at a time; parallel put operations are not supported.
Local Write
Internally, a put(key, val, context) call is translated into put_after(key, val, [deps]). This operation returns a boolean success indicator and assigns a version number to the key-value pair. The dependencies are derived from the client's context.
Upon receiving the put_after request, the server:
- Verifies that all dependent writes have been committed.
- Commits the current write.
- Assigns a version number.
- Returns <success, version number> to the client.
Asynchronous Replication
Following the local write, the operation is streamed to remote clusters. The key owners in these clusters perform a similar dependency check using the dep_check(key, version) API to ensure that all dependent key-version pairs are committed before applying the write.
Causal Ordering Guarantee
The write process ensures that all writes are applied in their causal order. This guarantees that if a client observes the effect of write Y, it will also observe the effect of any write X that causally precedes Y (X -> Y).
Reading a Key
Examples
Example 1
Example 2
Reading Multiple Keys
- The read for x retrieves xi.
- However before y is retrieved, there are two writes: write at version k to x causally followed by write at version l to y (xk ~> yl).
- The read for y retrieves yl.
- Phase 1: Concurrent get_version(key, LATEST) calls retrieve initial versions. These may be inconsistent, but the returned versions and dependencies are used to calculate causally corrected versions (CCVs) for any inconsistent reads.
- Phase 2: get_version(key, CCV) is invoked for keys with incorrect versions, ensuring a consistent snapshot.
Garbage Collection
The system's garbage collection mechanism must address several key areas to maintain efficiency and consistency.
Old Versions
To support historical reads using the get_by_version API, old versions of data must be retained. To prevent unbounded growth, a transaction timeout (trans_time) of 5 seconds is implemented. Therefore, old versions need to be preserved for a duration of trans_time plus the maximum potential clock drift (max_clock_drift) within the distributed system. This ensures that reads within the timeout window can access the necessary historical data.Dependencies
Client Metadata
COPS - Conflict Detection
Evaluation
Performance measurements at 52k gets/sec and 20k puts/sec yielded a 99.9th percentile tail latency of ~10ms for both get and put operations.
The authors observed the impact of varying the put-to-get ratio - increased put operations led to a decrease in throughput due to higher computational demands and a larger number of dependencies, which also negatively impacted get throughput.
Similarly, a higher put frequency resulted in reduced throughput, as the accumulation of dependencies up to the global checkpoint time increased. Conversely, a lower put frequency significantly reduced the number of dependencies.
Paper Review
This paper, a product of collaboration between Princeton and Carnegie Mellon, presents a challenging read. Even grasping the final product's API requires a solid understanding of causality. Consequently, its impact on industry has been limited.
However, it remains a valuable and intellectually stimulating exercise for academics. Approach it with patience!
Comments
Post a Comment