Skip to main content

Paper Insights #26 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

This work provides a strong foundation for understanding causality, both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman.

Paper Link

Let's begin with some basic concepts.

Causal Ordering

In 1978, Leslie Lamport published Time, Clocks, and the Ordering of Events in a Distributed System, a seminal paper that significantly impacted distributed system design. This work, alongside Paxos and TLA+, stands as one of Lamport's most influential contributions.

A fundamental challenge in distributed systems is clock synchronization. Perfect synchronization is unattainable, a fact rooted in both computer science and physics. However, the goal isn't perfect synchronization itself, but rather the ability to totally order events. With synchronized clocks, we could definitively determine the temporal order of events across different nodes.

For example, if event a occurs on node 1 and event b on node 2, synchronized clocks would allow us to assign timestamps and directly compare them to establish which event happened first.

The Partial Order

Unfortunately, perfect global clock synchronization (outside of specialized solutions like Google's TrueTime) remains impractical. Consequently, we rely on the concept of partial ordering.

Let be a binary relation. Then is a partial order if:

  • ∀a: a ≤ a (reflexivity)
  • ∀a, b: (a ≤ b ∧ b ≤ a) => a = b (antisymmetry)
  • ∀a, b, c: (a ≤ b ∧ b ≤ c) => a ≤ c (transitivity)

A strict partial order (<) does not include the reflexivity condition.

Within a single system, ordering is achievable using a shared clock source and synchronization primitives like Linux futexes. However, establishing order across distributed nodes requires different approaches.

Network messages offer a mechanism for partial ordering across distributed nodes. When node 1 sends a message to node 2, it includes its current timestamp (T). Node 2, upon receipt, knows that node 1's time was at least T. If node 2's local time is less than T, it adjusts its clock to T, ensuring subsequent events have timestamps greater than T, thereby establishing that all events after receipt of the message are ordered after the events before send of the message.

Example

Consider the following event diagram where a, b, c, and d are 4 events happening across 2 nodes:

It can be seen that:

  • < b, < d (order within nodes)
  • < c, < d (message causality)

However, the relative order of b to c or d remains undefined, as a result, we only know the partial order.

Happened-Before Relation (Causal Order)

Lamport formalized this partial ordering with the happened-before relation (->) also known as the causal order:

  1. If events a and b occur within the same node and a precedes b, then a -> b.
  2. If event a is the sending of a message and event b is its receipt, then a -> b.
  3. Transitivity: if a -> b and b -> c, then a -> c.

This relation has the properties of a strict partial order:

  • Not reflexive (an event doesn't happen before itself)
  • Anti-symmetric
  • Transitive (as defined above)

Logical Clocks

Logical clocks aim to capture this partial ordering. 

A logical clock C is a vector of clocks where each element Ci is a clock for a node iC assigns timestamps to events, such that if a -> b, then C(a) < C(b)The converse isn't necessarily true.

This condition is satisfied by:

  • Within a node, if a -> b, then Ci(a) < Ci(b).
  • Across nodes, if a (send) -> b (receive), then Ci(a) < Cj(b).

Counter-based Implementation

A simple implementation for logical clocks is based on counter variables:
  • Each node maintains a counter.
  • The counter increments for each local event.
  • When a node i is sending a message, the current counter Ci is included.
  • On receipt, the receiving node j updates its counter to max(Cj, Ci + 1).
Note that, such counter-based logical clocks don't satisfy strong clock condition (discussed below).

Total Ordering from Partial Ordering

Logical clocks provide partial ordering. Total ordering can be achieved by breaking ties using node IDs. For example, if a -> b on node 1 and c -> d on node 2, we can enforce all non-ordered events from node 1 before node 2 and arrive at a -> b -> c -> d.

Strong Clock Condition

The strong clock condition requires:

  • If a -> b, then C(a) < C(b).
  • If C(a) < C(b), then a -> b.

Logical clocks cannot inherently satisfy this. For example, say an event a happened on node 1 and event b happened on node 2. The event a may have actually happened before event b but may be assigned a timestamp higher due to different counter position on the two nodes.

However, it would be very hard for humans to reason about such cases where the computer believes that C(b) < C(a), yet somehow a actually happened before b.

Lamport makes use of physical clocks to satisfy strong clock condition and make it easy for humans to reason about the partial ordering of events. To achieve the strong clock condition using physical clocks, specific constraints are needed:

  • Bounded individual clock drift (k). For crystal clocks, k <= 10-6.
  • Bounded drift between clock pairs (ε).
  • Non-zero minimum network transmission time (μ).
  • The receiver's clock upon receiving a message m must be greater than sender's clock before send, i.e., for all ij, and t, where t is the real physical time:
Ci(t + μ) - Cj(t) > 0

This in turn requires the following condition to hold true (omitting derivation for brevity):

ε / (1 - k) <= μ 

When the above constraints are met (and it actually does in practice), the strong clock condition is achieved by the following approach (t and t' are the real physical time of events):

  • Say Tm is the time when a message was sent by node i and μ is the time it took for the message to reach a node j.

Tm = Ci(t)

  • Upon receiving m at time t', node j sets
Cj(t') = max(Cj(t'), Tm + μ)

I would highly recommend reading the Lamport's paper if you'd like to understand the derivation of the formulas in details.

Causal Consistency

For a review of general consistency models, please refer to our previous discussion on ZooKeeper. In this article, we'll focus specifically on causal consistency, particularly within the context of single-value key-value stores.

Causal consistency mandates that all operations respect the causal ordering. All clients observe the writes happening in some causal order. In a causally consistent key-value store, this can lead to scenarios where multiple values exist for the same key, requiring eventual resolution. 

Let's illustrate this with an example:

Imagine a key X initially holds the value 0. Node 1 performs two consecutive writes: W(X, 1) followed by W(X, 2). Node 2 reads X, observes the value 1 (R(X, 1)), and then writes the value 3 (W(X, 3)).

The causal relationships are as follows:

  • W(X, 1) -> R(X) (read reflects the write)
  • W(X, 1) -> W(X, 2) (ordering within Node 1)
  • R(X) -> W(X, 3) (ordering within Node 2)

Crucially, the order between W(X, 2) and W(X, 3) is undefined; they are not causally related. Therefore, both X = 2 and X = 3 are valid states within a causally consistent system. Thus, Node 3 may find the value to be 2 (W(X, 3) will be ordered before W(X, 2) from its perspective), and Node 4 may find the value to be 3 (W(X, 2) will be ordered before W(X, 3) from its perspective).

This contrasts with sequential consistency, which demands a single, definitive total order for every observer. In a sequentially consistent system, all reads must return either 2 or 3, depending on the system's chosen ordering (of course, given that the reads are ordered after all the writes).

Eventual consistency is even weaker than causal consistency. See Paper Insights - TAO: Facebook's Distributed Data Store for the Social Graph to learn more.

COPS (Cluster of Order-Preserving Servers)

COPS was developed to address the ALPS principles: Availability, Low-latency, Partition-tolerance, and Scalability. It achieves this by sacrificing strong consistency in favor of causal consistency. This aligns with the CAP theorem, which demonstrates that strong consistency, availability, and partition tolerance cannot be simultaneously guaranteed. Causal consistency (to be precise, it's real-time causal consistency), is the strongest consistency level achievable under availability and partition tolerance.

While systems like Dynamo and TAO employ eventual consistency—the most relaxed model—due to its relative simplicity, COPS takes a more nuanced approach to offer causal consistency.

COPS has 2 different flavors - COPS and COPS-GT. To avoid ambiguity, COPS will henceforth refer to the enhanced COPS-GT version only.

API

COPS provides a straightforward API:

  • put(key, val, context): Stores a value associated with a key.
  • get(key, context) -> value: Retrieves the value associated with a key.
  • get_trans(<keys>, context) -> <values>: Retrieves a consistent view of multiple key-value pairs.

Despite its simple API, reasoning about the returned values within a causally consistent system can be complex, contributing to the relative lack of widespread adoption for causal consistency. In fact, COPS never became more than an academic marvel. The CS industry has been happy with eventual consistency.

Architecture

COPS employs a network of globally distributed clusters, each storing all key-value pairs.

Each cluster functions as a linearizable key-value store. Within a cluster, key-value pairs are partitioned into N keyspaces and distributed across replicas using consistent hashing.

The cluster's linearizable storage utilizes the FAWN KV-store, with chain replication for each key-value pair. Client write requests are routed to the chain's head, while all read requests, including dependency checks (described below), are directed to the tail.

Each key-value pair is associated with metadata:

  1. Version number
  2. Dependencies (explained below) - A list of <key, version> pairs.

The system includes three internal APIs, not directly exposed to users, but utilized by the client library:

  • put_after(key, val, [deps]) -> <bool, version>
  • get_by_version(key, version) -> <value, version, deps>
  • dep_check(key, version) -> bool

Clients interact with the key-value store exclusively through a client library, which manages all communication.

Consistency Model

The authors formally specify the causal+ consistency model in appendix A and I would highly recommend reading that section. It is pretty easy to follow through.

Causal Order

COPS defines causal ordering (~>), similar to Lamport's happened-before relation, but tailored to its API:

  • Execution Thread: Operations a and b within a single execution thread (node) are ordered as a ~> b if a precedes b.
  • Gets From: If a put operation a is read by a get operation b, then a ~> b. This replaces Lamport's message-based causality with a read-write dependency.
  • Transitivity: The ~> relation maintains transitivity.

Conflict Handling

Concurrent writes can lead to conflicting values across replicas, as they're permitted to diverge when writes lack causal ordering. In our previous example, W(X, 2) and W(X, 3) resulted in diverging values for X.

Conflicting values are undesirable. COPS resolves these divergences using a handler function. This function takes two conflicting values for a key and returns a single, converged value.

The handler function must be:

  • Associative
  • Commutative

A simple handler function can be last-writer-wins to impose an arbitrary order, forcing conflict resolution.

Causal+ Consistency

This consistency model is slightly stronger then causal consistency and is called causal+ consistency.

Causal+ thus occupies a distinct position within the consistency spectrum, offering a stronger form of consistency than causal+ consistency, but weaker than strong consistency.


Versions and Dependencies

Each value associated with a key is assigned a unique version. COPS uses Lamport timestamps (as described previously) to generate these versions, ensuring for a key-value pair. When combined with a last-writer-wins tie-breaking mechanism, this yields a global ordering for all writes to a key. A version number is thus represented as <Lamport timestamp, node number>.

Formally, for writes to keys x and y at versions i and j respectively, if xi ~> yj (xi causally precedes yj), then i < j.

Progressing Property

Once a client reads a specific version for a key, subsequent reads will return that version or a later one. This is the progressing property guaranteed by causal+ consistency.

Dependencies

Causal relationships introduce dependencies between key values. If xi ~> yj, then yj depends on xi. This means if a client reads yj, it must also be able to read x at version i or a later version. A single key can have dependencies on multiple versions of other keys.

COPS maintains these dependencies as a dependency graph stored internally.

An example dependency graph.
Alphabets are keys and subscripts are version numbers.

Client Context

The client context tracks the client's causal knowledge, analogous to tracking received messages. It contains all previously read and written values within the client session.

  • Reads: When a client performs a read, the read version for the key is added to the client context.
  • Writes: When a client performs a write, the client provides the server with its context, ensuring the server is aware of the client's prior reads.

Writing to a Key

All write operations in COPS follow a two-phase process:

  1. Synchronous Local Write: The write is immediately applied to the local cluster.
  2. Asynchronous Replication: The write is subsequently propagated to all remote clusters.

It's crucial to note that COPS clients perform only one write at a time; parallel put operations are not supported.

Local Write

Internally, a put(key, val, context) call is translated into put_after(key, val, [deps]). This operation returns a boolean success indicator and assigns a version number to the key-value pair. The dependencies are derived from the client's context.

Upon receiving the put_after request, the server:

  • Verifies that all dependent writes have been committed.
  • Commits the current write.
  • Assigns a version number.
  • Returns <success, version number> to the client.

Asynchronous Replication

Following the local write, the operation is streamed to remote clusters. The key owners in these clusters perform a similar dependency check using the dep_check(key, version) API to ensure that all dependent key-version pairs are committed before applying the write.

Causal Ordering Guarantee

The write process ensures that all writes are applied in their causal order. This guarantees that if a client observes the effect of write Y, it will also observe the effect of any write X that causally precedes Y (X -> Y).

Reading a Key

To retrieve a single key, get_by_version(key, version) API is used. Setting version to LATEST fetches the most recent value. The API response includes the accessed version and a list of its dependencies.

Upon reception, the key-value pair and its associated dependencies are added to the context as dependencies.

Examples

Before moving forward, let's run through some examples of how all these work.

Example 1

Let's illustrate the concept with an example. Suppose there are three clusters. A write, x2 (key x, version 2), occurs in cluster 1 and is propagated to cluster 3. A client in cluster 3 reads x2 and writes y3. The cluster forwards y3 to cluster 2. However, cluster 2 notices that y3 depends on x2 (y-> x2 OR x2 ~> y3) and therefore must wait for cluster 1 to send x2 before y3 can be committed. This ensures that no clients in cluster 2 see y3 without seeing x2.

 

Example 2

Let's consider a scenario where clients in clusters 2 and 3 simultaneously write x2 and x1, with no dependencies. Cluster 1 resolves this conflict using last-writer-wins, establishing a deterministic total order. Crucially, the outcome (x1 or x2) doesn't affect causal consistency as long as all clusters converge on the same value.

For example, say, x1 is the final resolved value. If x2 arrived first, cluster 1 clients will observe x2 followed by x1. Cluster 2 clients will experience a similar transition from x2 to x1. However, cluster 3 clients will ever see x1, as x2's value is superseded before they observe it. All clusters will converge to x1 in the final state.




Reading Multiple Keys

The get_trans addresses a critical issue: ensuring consistent multi-key reads in a system with causal dependencies. A naive approach of reading each key individually can lead to inconsistencies.

Consider keys x and y, initially at versions i and j respectively. Suppose a read is issued:
  • The read for x retrieves xi.
  • However before y is retrieved, there are two writes: write at version k to x causally followed by write at version l to y (xk ~> yl).
  • The read for y retrieves yl.
The result <xi, yl> is inconsistent. While each individual read adheres to causality, the combined snapshot violates it. A consistent read would return a snapshot like <xi, yj>, <xk, yj>, or <xk, yl>.

The get_trans(<keys>, context) API resolves this through a two-phase process:
  • Phase 1: Concurrent get_version(key, LATEST) calls retrieve initial versions. These may be inconsistent, but the returned versions and dependencies are used to calculate causally corrected versions (CCVs) for any inconsistent reads.
  • Phase 2: get_version(key, CCV) is invoked for keys with incorrect versions, ensuring a consistent snapshot.
Re-examining the previous example, an initial read of <xi, yl> yields an inconsistency. However, yl includes a dependency on xk. In Phase 2, xk is retrieved, resulting in the consistent read <xk, yl>.

Figure 7 details the get_trans algorithm, particularly the CCV generation process.

Garbage Collection

The system's garbage collection mechanism must address several key areas to maintain efficiency and consistency.

Old Versions

To support historical reads using the get_by_version API, old versions of data must be retained. To prevent unbounded growth, a transaction timeout (trans_time) of 5 seconds is implemented. Therefore, old versions need to be preserved for a duration of trans_time plus the maximum potential clock drift (max_clock_drift) within the distributed system. This ensures that reads within the timeout window can access the necessary historical data.

For example, in the dependency graph below, x1 can be removed after trans_time before which there may be some get_trans operation trying to access it.



Dependencies

Beyond removing outdated versions, the system should also reclaim space occupied by fully replicated dependencies. A dependency, identified by a <key, version> pair, is considered fully replicated when it has been successfully copied to all replicas. After a designated trans_time period, a never-depend flag is applied to it, indicating it's safe for removal.

Client Metadata

Clients can safely remove fully committed dependencies from their local dependency graphs, mirroring the cluster replica cleanup process.

When a client detects a never-depend flag for a <key, version> pair (typically as a response to a get_by_versions request), the client can garbage collect that dependency and all its predecessors.

For example, in the dependency graph below, when y1 is marked as never-depend, w1 and x3 are also garbage collected.



Additionally we also have global checkpoint time. It represents the earliest timestamp for which at least one put_after operation is still pending. COPS nodes provide this checkpoint time to the client library in response to API calls. Clients can leverage this information to identify and clean up dependencies.

COPS - Conflict Detection

COPS-CD extends the COPS family, beyond the vanilla and COPS-GT versions, by enabling client-defined conflict resolution through application-specific handlers, replacing the default last-write-wins approach.

Evaluation 

Performance measurements at 52k gets/sec and 20k puts/sec yielded a 99.9th percentile tail latency of ~10ms for both get and put operations.

The authors observed the impact of varying the put-to-get ratio - increased put operations led to a decrease in throughput due to higher computational demands and a larger number of dependencies, which also negatively impacted get throughput.

Similarly, a higher put frequency resulted in reduced throughput, as the accumulation of dependencies up to the global checkpoint time increased. Conversely, a lower put frequency significantly reduced the number of dependencies.

Paper Review

This paper, a product of collaboration between Princeton and Carnegie Mellon, presents a challenging read. Even grasping the final product's API requires a solid understanding of causality. Consequently, its impact on industry has been limited.

However, it remains a valuable and intellectually stimulating exercise for academics. Approach it with patience!

Comments

Popular Posts

Paper Insights #25 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s. Paper Link Let's begin with some basic concepts. Network Interface Card (NIC) The NIC facilitates data reception and transmission. Understanding its operation requires examining the fundamental interaction between the CPU and memory. CPU <-> Memory Communication In a Von Neumann Architecture , the CPU and memory are core components, enabling Turing computation. Their communication relies on the system bus (e.g. PCIe ), a set of electrical pathways connecting the CPU, memory, and I/O devices. The system bus comprises three primary logical components: Data Bus : Bidirectional, carrying the actual data being tran...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...