Skip to main content

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik, originates from Facebook and dates back to 2008.

Paper Link

Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases.

Recommended Reads: 

Let's begin with the concept of failure detections and gossip protocols which are a highlight of this paper.

Failure Detectors

In the realm of computer science, the FLP Impossibility result is a foundational concept. It states that in an asynchronous network where a single node can crash, it's impossible to simultaneously guarantee both safety and liveness (or termination). Crash faults are always a possibility in large systems and real-world networks are always asynchronous. This makes consensusstate machine replication and atomic broadcasts impossible problems to solve.

Algorithms like Paxos ensure safety but not termination. Achieving termination necessitates some degree of synchrony, whether it's partial or full synchronization within the network. Consequently, all practical consensus algorithms in real-world asynchronous networks rely on assumptions about network time bounds. These bounds are crucial for making progress and ensuring termination.

Failure detectors are essential components within these algorithms. They assist in approximately identifying and isolating faulty nodes, enabling the system to continue making progress despite the presence of failures.

Failure detectors can be classified based on their:

  • Degree of Completeness:

    • Strong Completeness: Every faulty process is eventually suspected by every non-faulty process.
    • Weak Completeness: Every faulty process is eventually suspected by at least one non-faulty process.
  • Degree of Accuracy:

    • Strong Accuracy: No non-faulty process is ever falsely suspected.
    • Weak Accuracy: Some non-faulty processes may be falsely suspected.
    • Eventually Strong Accuracy: No non-faulty process is suspected after a certain period.
    • Eventually Weak Accuracy: Some non-faulty processes may be falsely suspected initially, but eventually, all non-faulty processes are no longer suspected.

Algorithms

SWIM (Scalable Weakly Consistent Infection Protocol)

SWIM is a prominent failure detection algorithm known for its strong completeness and weak accuracy. However, a detailed discussion of SWIM falls outside the scope of this write-up.

Heartbeat-Based Failure Detectors

These represent the most fundamental class of failure detectors. They operate by monitoring the regular reception of heartbeat messages from other nodes. Heartbeat-based detectors exhibit strong completeness but only achieve eventual weak accuracy. This implies that while they will eventually detect a node failure, they may temporarily suspect healthy nodes as faulty.

Ping-Ack Failure Detectors

Similar to heartbeat-based detectors, ping-ack mechanisms rely on the exchange of ping messages and acknowledgments (acks) to monitor node health. They share the same strong completeness and eventual weak accuracy characteristics as heartbeat-based detectors.

Accrual Failure Detectors

Cassandra employs Accrual Failure Detectors, which extend the concept of heartbeat-based mechanisms. Accrual Failure Detectors incorporate tunable parameters, allowing for adjustments to the balance between accuracy and completeness. This flexibility enables administrators to fine-tune the detector's behavior to meet specific system requirements.

φ represents a metric used to assess the likelihood of a machine failure which is inversely proportional to probability of receiving a heartbeat.

φ(t) = -log10 (P(t))

φ(t) is the probability that a machine has failed at time t.

P(t) is the probability of receiving a heartbeat from the machine at time t.

A higher phi value indicates a higher probability of machine failure. For example:
  • φ = 1 implies a 10% probability of receiving a heartbeat.
  • φ = 2 implies a 1% probability of receiving a heartbeat.
  • φ = 3 implies a 0.1% probability of receiving a heartbeat.

and so on.

How to define P(t)? This is achieved by maintaining a sliding window of heartbeat intervals. The samples within this window collectively form a probability distribution. Then:

P(t) = Integral of this distribution from t to ∞.

According to the requirements, the application may utilize cutoffs for φ to determine whether to accept or reject as a failure.

Gossip Protocols

Gossip protocols constitute a class of broadcast protocols within p2p networks. Other classes include point-to-point (one-to-one) and eager reliable (one-to-all) protocols.

The core principle of gossip protocols involves each node periodically transmitting a message to a randomly selected subset of other nodes. This decentralized approach ensures that, with high probability, a particular message will eventually propagate throughout the entire system. This characteristic finds valuable applications in various domains, including failure detection.

Types of Gossip Protocols

Gossip protocols are typically characterized by two key tunables: the required time for message dissemination and network usage. Based on these parameters, several distinct types of gossip protocols have emerged:

  • Anti-Entropy: In anti-entropy protocols, nodes transmit their entire message set to other nodes. This approach generally results in high bandwidth consumption. To mitigate this, diffing techniques can be employed to avoid redundant data transmission. The term "anti-entropy" originates from the protocol's objective: to minimize the information divergence between nodes by exchanging complete datasets.

  • Rumor-Mongering: Rumor-mongering protocols, also known as dissemination protocols, focus on transmitting only updates or changes to the message set. This approach significantly reduces network usage, enabling more frequent data exchange.

  • Aggregation: Aggregation protocols facilitate the exchange of system-wide aggregates through sampling techniques.

Fundamentally, all gossip protocols share a common approach: they disseminate information by having each node randomly transmit messages to a subset of its peers. This decentralized strategy relies on the probabilistic nature of these random transmissions to ensure that information eventually propagates throughout the entire network.

Cassandra

Apache Cassandra is a wide-column store NoSQL database. These databases organize data into columns, but unlike traditional relational databases, the columns can vary from row to row.

Data Model

Cassandra employs a <row, column> key format, closely resembling Bigtable (except for the time dimension). Notably, Bigtable, Cassandra, and HBase are all fundamentally implemented as key-value stores.

Each row in a Cassandra table is uniquely identified by a string row key, typically 16 to 36 bytes long. Column are organized into column families, and come in two forms: simple and super. Super column families can be conceptually understood as nested column families.

Regardless of the data model presented—be it tabular or graph-like—all NoSQL databases, including Cassandra, are fundamentally key-value stores.

API

Cassandra's core API simplifies data interaction to three fundamental operations: 
  • insert(table, rowkey, rowMutation)
  • get(table, rowkey, columnName)
  • delete(table, rowkey, columnName)
This essentially boils down to key-value put and get operations.

Data Structures

Exhibits strong similarities to Bigtable's tablet structures (implemented as an LSM tree):
  • Utilizes an in-memory data structure to track updates concurrently with writing to a commit log.
  • Periodically, this in-memory data is flushed to disk, accompanied by summary information and a Bloom filter to accelerate query performance. Although not explicitly mentioned in the paper, it's highly probable that the on-disk data files are stored in the SSTable format. An SSTable is a serialized representation of key-value pairs, where keys are arranged in sorted order alongside their corresponding values.
  • Data files undergo periodic compaction to optimize storage and improve read performance.

System Architecture

The architecture shares similarities with Dynamo, leveraging a chord-based consistent hashing scheme for data placement. 

It's noteworthy that Bigtable leverages Google File System (GFS) for efficient data storage and retrieval. In contrast, Cassandra does not rely on a distributed file system. This design choice may be attributed to the lack of a GFS equivalent within Facebook's infrastructure at the time of Cassandra's development.

The presence of a distributed file system could have significantly simplified Cassandra's implementation. All replicas of a given partition could have seamlessly relied on a consistent view of the data stored within the distributed file system. However, without a distributed file system, the Cassandra team had to implement a more intricate solution: consistent hashing for data partitioning and replication, coupled with sophisticated membership control mechanisms involving failure detectors and gossip protocols.

Range Assignment

As discussed above, Cassandra makes use of consistent hashing to distribute keys. Each range is assigned to not one, but N nodes, one of which act as the coordinator for that range. This replication is required to ensure that there are no single point of failure. Note that the same machine can be a node for multiple ranges.


In the diagram, the range between red and purple dot is owned by the nodes.

It is interesting to note that during that time there was a trend of building truly decentralized systems where each node is essentially the same.

Cassandra employs the quorum mechanism. The replication factor is set to a threshold value, N. A write operation is considered successful only if the coordinator receives successful write acknowledgments from at least N replicas.

Cassandra and Failure Detection

Within a range, it is crucial for replicas to detect the failure of the coordinator, enabling the election of a new coordinator.

Cassandra utilizes an anti-entropy gossiping mechanism in conjunction with an accrual failure detector. In each exchange, a node transmits its entire collection of heartbeat information (received from other nodes) to its peers. This comprehensive exchange often results in shorter heartbeat intervals. Consequently, an exponential distribution typically provides a more accurate representation of heartbeat arrival times compared to a Gaussian distribution.

Post failure detection, Cassandra elects a leader amongst its nodes using Zookeeper, a distributed coordination system.

Paper Review

This paper is remarkably concise, covering a wide range of concepts within a relatively small number of pages. While not overly challenging to read and comprehend, I recommend familiarizing oneself with Dynamo and Bigtable beforehand to gain a deeper understanding of the context and related concepts.

Comments

Popular Posts

Paper Insights #25 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s. Paper Link Let's begin with some basic concepts. Network Interface Card (NIC) The NIC facilitates data reception and transmission. Understanding its operation requires examining the fundamental interaction between the CPU and memory. CPU <-> Memory Communication In a Von Neumann Architecture , the CPU and memory are core components, enabling Turing computation. Their communication relies on the system bus (e.g. PCIe ), a set of electrical pathways connecting the CPU, memory, and I/O devices. The system bus comprises three primary logical components: Data Bus : Bidirectional, carrying the actual data being tran...

Paper Insights #26 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman . Paper Link Let's begin with some basic concepts. Causal Ordering In 1978, Leslie Lamport published Time, Clocks, and the Ordering of Events in a Distributed System , a seminal paper that significantly impacted distributed system design. This work, alongside Paxos and TLA+ , stands as one of Lamport's most influential contributions. A fundamental challenge in distributed systems is clock synchronization . Perfect synchronization is unattainable, a fact rooted in both computer science and physics. However, the goal isn't perfect synchronization itself, but rather the ability to totally order even...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...