Skip to main content

Paper Insights #34 - CRDTs: Consistency without Concurrency Control

Authored in 2009, this noteworthy paper from National Institute for Research in Computer Science and Automation was presented at the prestigious IEEE ICDCS. It introduces compelling design concepts for distributed systems.

Paper Link

Let's jump straight into CRDTs.

CRDT

Conflict-free Replicated Data Types (CRDTs) ensure data consistency across multiple replicas without the need for explicit conflict resolution. The acronym itself breaks down as:
  • Conflict-free: Indicating that concurrent updates from different replicas will always converge to the same final state.
  • Replicated: Highlighting that the data is distributed and maintained across several independent nodes.
Imagine a shared data structure that exists on multiple nodes. To manage updates to this data structure consistently, CRDTs typically provide the following core interface:

interface CRDT {
   Apply(T update);
   Merge(CRDT other);
};

Here, we have two essential operations:
  • Apply(update): This operation incorporates a local update to the data. The update is combined with the current local state using a specific operator, and the result becomes the new local state.
  • Merge(other): This operation integrates the state of another CRDT instance (other) into the current local state. This merging also utilizes a specific operator, resulting in a new local state that reflects both previous states.
The fundamental principles behind CRDTs are:
  • All updates originating from clients are initially applied locally.
  • Subsequently, these updates are propagated asynchronously to other replicas through various communication protocols. The crucial guarantee is that once all updates have been delivered and applied across all nodes, the final state of the data structure will be identical everywhere.
To achieve this eventual consistency regardless of the order or timing of updates, the operator used in both Apply and Merge must possess specific mathematical properties:
  • Commutativity: The order in which updates are applied doesn't affect the final outcome; that is, for updates a and b, f(a, b) = f(b, a). This allows nodes to process updates in any sequence.
  • Associativity: When multiple updates are combined, the grouping of operations doesn't change the result; that is, for updates a, b, and c, f(a, f(b, c)) = f(f(a, b), c). This ensures that merging states from different nodes is consistent.
  • Idempotency: Applying the same update multiple times has the same effect as applying it once. This prevents duplicate updates from causing unintended changes in the data's state.
A diverse range of CRDT implementations exists, each tailored for different data structures and update semantics.

Grow-Only Counter (G-Counter)

The Grow-Only Counter, or G-Counter, is a straightforward CRDT that allows only increasing integer values. Its simplicity makes it a foundational example.

Implementation

  • Each replica maintains a map (s) where keys are node identifiers and values are the current counter observed from that node.
  • The Apply() operation increments the counter value associated with the local node:
s[self] = s[self] + 1
  • The Merge() operation updates each node's counter by taking the maximum of the locally known value and the value from another replica:
s[i] = max(s[i], other.s[i]) for all nodes i.

The nature of the merge operation ensures that G-Counters are commutative, associative, and idempotent.

Positive-Negative Counter (PN-Counter)

To accommodate both increments and decrements, we can extend the G-Counter concept with the Positive-Negative Counter (PN-Counter). A naive extension of the Apply() operation to include decrements would violate idempotency.

Implementation

To maintain idempotency, a PN-Counter employs two G-Counters per node: a positive counter for increments and a negative counter for decrements.

The Merge() operation takes the maximum value for each node's positive counter and the maximum value for each node's negative counter. The overall value of the PN-Counter is the sum of all positive counter values minus the sum of all negative counter values.

Grow-Only Unordered Set

The Grow-Only Unordered Set mirrors the behavior of the G-Counter but for sets. It only allows adding elements.

Implementation

  • Each replica maintains a single set.
  • The Apply() operation adds new elements to the local set using a union operation.
  • The Merge() operation combines sets from different replicas using set union.
Due to the properties of set union, Grow-Only Unordered Sets are commutative, associative, and idempotent.

2-Phase Unordered Set (2P-Set)

Analogous to the PN-Counter, the 2-Phase Unordered Set (2P-Set) introduces the ability to remove elements. It maintains two sets: an added set and a removed set.

Implementation

  • Adding an element places it in the added set.
  • Removing an element adds it to the removed set. Once an element is in the removed set, it cannot be added back.
  • The Merge() operation takes the union of the added sets and the union of the removed sets from the merging replicas. The effective set contains elements present in the merged added set but not in the merged removed set.
A significant drawback of the 2P-Set is that the removed set can grow indefinitely.

Last Writer Wins (LWW)

The Last Writer Wins (LWW) strategy, often applied to sets or key-value stores, uses timestamps to resolve conflicts. Each operation is associated with a timestamp.

LWW can be used to implement a 2P-Set without the unbounded growth of the removed set. When an element is added or removed, the operation is timestamped. During a merge, if two conflicting operations (add and remove) for the same element are encountered, the operation with the later timestamp wins.

The LWW strategy is particularly crucial in eventually consistent key-value stores like Dynamo. When concurrent writes occur for the same key, replicas eventually converge to the value associated with the latest timestamp. This ensures that a single, consistent version of each key's value is maintained across the distributed system.

Sequence CRDTs

While the CRDTs discussed so far (counters and basic sets) illustrate fundamental principles, they often fall short of the requirements of more complex applications. A particularly interesting and non-trivial CRDT is the ordered set, which is the central topic of discussion in this paper.

An ordered set is a data structure where the elements maintain a specific sequence that isn't inherent to the elements themselves. Unlike a sorted set where order is determined by element properties, the order in an ordered set is explicitly defined by client operations. For instance, an ordered set could contain the English alphabet in a non-alphabetical order: {d, a, c, b}. The elements themselves are treated as opaque values.

The key challenge with ordered sets lies in allowing clients to specify the precise location where a new element should be inserted within the existing order. Consider an ordered set {a, b}. A client might request the insertion of element c between a and b, resulting in the new order {a, c, b}.

To achieve this fine-grained control over ordering, ordered sets typically require clients to tag each element with a unique identifier. This identifier space must be unbounded, meaning that for any two identifiers A and B, there exist infinitely many identifiers Z such that A < Z < B. These identifiers serve as the basis for determining the position of elements within the ordered set.

Consequently, each element in the ordered set effectively becomes a pair: <ID, element>, where the ID dictates its position in the sequence.

Let's revisit the example of inserting c between a and b. If a has an identifier of 1 and b has an identifier of 2, then when inserting c, the client can assign it any identifier between 1 and 2, such as 1.5. This approach inherently ensures the CRDT properties:
  • Commutativity: The order of concurrent insertions with distinct identifiers doesn't affect the final ordered set.
  • Associativity: Merging sets in different orders yields the same final ordered set because each element's position is determined by its unique identifier.
  • Idempotency: Adding the same <ID, element> pair multiple times has the same effect as adding it once, due to the set-like nature of the underlying data structure (we don't have duplicates with the same ID).

Resolving Identifier Conflicts

Despite the use of unbounded identifiers, conflicts can still arise. For example, if two different nodes concurrently attempt to insert elements c and d between a and b and both assign the same identifier, say 1.5, to their respective elements. Several strategies can be employed to resolve such conflicts:
  • Timestamping (Last Writer Wins): Similar to the LWW approach discussed earlier, each insertion operation can be timestamped. In case of an identifier collision, the element associated with the later timestamp takes precedence in the ordering.
  • Disambiguator ID: Each node is assigned a unique identifier. This identifier can be used to break ties.

TreeDoc: Leveraging Trees for Ordered Sets

In our previous examples, we conceptually used real numbers as identifiers to represent the ordering of elements. However, relying on floating-point numbers in computer systems presents challenges due to potential precision limitations. This could lead to a finite identifier space, violating the crucial unbounded property required for seamlessly inserting elements between any two existing ones.

A more robust approach involves using binary strings as identifiers. Binary strings offer a natural way to represent order and can easily guarantee an unbounded identifier space.

Consider two elements with binary identifiers 10 and 11. We can always generate new identifiers that fall lexicographically between them, such as 100, 101, 110, and 111, effectively creating space for four new elements in between. This process can be repeated indefinitely, ensuring an unbounded identifier space.

Tree Representation in TreeDoc

These binary string identifiers can be elegantly organized into a tree structure, which is the core idea behind TreeDoc. Each binary string identifier corresponds to a path from the root of the tree. The order of the identifiers (and thus the elements) is determined by performing an infix traversal of this tree.


In the diagram above, the order of elements is - {a, c, b, f, d, g, e} with IDs `00`, `0`, `01`, ``, `10`, `1`, `11` respectively

Insertions

When inserting a new element between two elements with existing identifiers X and Y, the TreeDoc algorithm traverses the tree to find the the leftmost available position between the nodes corresponding to X and Y. The new element is inserted at that position.

Handling Concurrent Insertions with the Same Identifier

A potential conflict arises when multiple writers concurrently attempt to insert elements and inadvertently assign them the same identifier (resulting in the same position in the tree). To address this, a node in the TreeDoc tree isn't a single element but rather a multi-node. This multi-node can contain multiple elements that share the same tree position (same ID). The order among these co-located elements within the multi-node is then determined using a disambiguator ID, as discussed earlier (e.g., using node identifiers to establish a consistent secondary ordering).


In the diagram above, both d1 and d2 share the same node position. The order of elements is {a, c, b, f, d1, d2, g}.

Deletions in TreeDoc

Instead of physically removing the corresponding node from the tree, the node associated with the element's identifier is simply marked as a tombstone. During an infix traversal, tombstoned nodes are effectively skipped, thus representing the deletion of the element while preserving the tree structure and the identifiers of other elements.


In the diagram above, the order of elements is {a, b, f, g, e}.

Rebalancing

Over time, the TreeDoc structure may become skewed or accumulate an excessive number of tombstoned elements. To address this, a rebalancing operation called flattening is performed. This process restructures the tree and, importantly, assigns new identifiers to all elements.


In the diagram above, the identifiers for a, g, and e change from `00` to `0`, `1` to ``, and `11` to `1` respectively.

Because flattening necessitates changes across all elements and their identifiers, it requires a consistent state across all replicas. To achieve this, the data structures must be locked on all replicas while flattening is in progress. The authors utilize Paxos-Commit, a commit protocol built upon a consensus algorithm, to ensure this consistency, effectively managing the transaction required for the flattening operation. It's important to note that Paxos-Commit itself is a commit protocol, not a consensus mechanism.

Simultaneous Text-Edit

TreeDoc represents an ordered set of elements, making it particularly well-suited for collaborative text editing by multiple users concurrently.

Consider a scenario where the current shared text is abc.
  • Replica 1 performs an insertion of the character e between a and b.
  • Simultaneously, Replica 2 performs an insertion of the character f between a and b.
Upon asynchronous broadcast of these operations, the final state of the shared text across all replicas will be updated to aefbc, as illustrated in the diagram below.


The insertions of e and f represent a concurrent conflict. This conflict is resolved based on disambiguators associated with each replica. Since the disambiguator for Replica 1 is less than the disambiguator for Replica 2, the insertion from Replica 1 (e) is ordered before the insertion from Replica 2 (f).

Q. Can Sequence CRDTs help in Distributed Logs?

No, Sequence CRDTs are not suitable for distributed logs. Distributed logs require that all replicas observe every operation in precisely the same sequential order (a total order). While Sequence CRDTs, such as ordered sets, guarantee eventual convergence to the same final state across replicas, they do not enforce a consistent order of operations.

CRDTs operate within a mathematical structure called a semi-lattice, characterized by a partial order of operations. Each replica can observe any valid linearization of this partial order, leading to eventual consistency. However, different replicas can (and often do) observe these operations in different sequences.

Consider the example in the diagram above. Assume three operations, a, b, and c, are performed on a distributed data structure.
  • Replica 1 observes the updates in the order: {a}, then {a, b}, then {a, b, c}.
  • Replica 2 observes the updates in the order: {b}, then {b, c}, then {a, b, c}.
  • Replica 3 observes the updates in the order: {c}, then {c, a}, then {a, b, c}.
As this example illustrates, although all replicas eventually reach the same final state ({a, b, c}), they process the individual operations (a, b, and c) in different orders. This lack of a globally consistent order of operations makes Sequence CRDTs unsuitable for distributed logs, where the order of events is critical for consensus and maintaining a consistent history.

Large-Scale Implementation of Sequence CRDT

The authors have deployed TreeDoc at a significant scale, accommodating clients globally. A key performance consideration for sequence CRDTs is the flattening operation, which necessitates a commit. To enable efficient commits across geographically dispersed nodes, the authors strategically segmented their implementation into two distinct site types:
  • Nebula Sites: These geographically distributed sites primarily handle the real-time streaming of local updates both back to the centralized core site and to other nebula sites.
  • Core Site: This centralized site aggregates updates from all nebula sites and also disseminates updates back to them. Critically, only nodes within the core site participate in the flattening operation, which requires a commit protocol.
The system operates in discrete time periods called epochs. Each site operates within a specific epoch, and the epoch number is incremented after every flattening operation performed by the core site. This epoch-based system ensures that the ID space is unique for each epoch. Importantly, while the core site is undergoing the flattening process, nebula sites can continue to accept and process local updates.

Communication and update propagation are streamlined based on the current epoch. Any two sites residing in the same epoch can directly exchange updates. This is feasible because within a given epoch, the ID space remains consistent, guaranteeing that all insertions and deletions are correctly applied and remain commutative.

Following a successful flattening operation by the core site, it broadcasts an update signaling the new epoch. This can lead to following scenarios:
  • A nebula site sends an update from a previous epoch: The core site will reject such updates, as it only accepts updates corresponding to the current epoch. The nebula site will need to undergo its own flattening process and receive new IDs for its pending updates before they can be successfully transmitted to the core site in the new epoch.
  • The core site sends updates from the new epoch to a nebula site: The nebula site must acknowledge the epoch change and prepare to accept updates from the new epoch. This typically involves the nebula site also performing a flattening operation to align its internal state with the new ID space before applying the incoming updates from the core site's new epoch.
Let's walkthrough an example to understand the concepts (detailed algorithm can be found in section 4).

Say, the core site had the following CRDT - a (`00`), b (`0`), and c (``) which is flattened to a new ID space -  a (`0`), b (``), and c (`1`).



Say the nebula sites also had the same CRDT, however, there were 2 updates:
  • b (`0`) was deleted.
  • d (`1`) was added.



If the nebula sites send delete `0` and add <`1`, d>, then those updates will be rejected by the core site (otherwise it will end up deleting a and failing to add duplicate ID `1`). Upon detection of this rejection failure, the nebula sites would run flattening, thereby, computing the new ID space. Post flattening, the insertion of d will also be assigned a new ID in the new space. Then the updates corresponding to the new ID space will be sent. In the example above, the updates will be delete `` and add <`11`, d>.

Paper Review

This paper is quite straightforward to read and understand. It's also likely one of the shortest featured on this blog. Given the widespread implementation of CRDTs across the industry, I highly recommend reading it as it presents an example of a non-trivial CRDT.

Comments

Popular Posts

Paper Insights #26 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s.

Paper Insights #27 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman .

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field.