Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik, originates from Facebook and dates back to 2008.

Paper Link

Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases.

Recommended Reads:

Let's begin with the concept of failure detections and gossip protocols which are a highlight of this paper.

Failure Detectors

In the realm of computer science, the FLP Impossibility result is a foundational concept. It states that in an asynchronous network where a single node can crash, it's impossible to simultaneously guarantee both safety and liveness (or termination). Crash faults are always a possibility in large systems and real-world networks are always asynchronous. This makes consensus, state machine replication and atomic broadcasts impossible problems to solve.

Algorithms like Paxos ensure safety but not termination. Achieving termination necessitates some degree of synchrony, whether it's partial or full synchronization within the network. Consequently, all practical consensus algorithms in real-world asynchronous networks rely on assumptions about network time bounds. These bounds are crucial for making progress and ensuring termination.

Failure detectors are essential components within these algorithms. They assist in approximately identifying and isolating faulty nodes, enabling the system to continue making progress despite the presence of failures.

Failure detectors can be classified based on their:

Degree of Completeness:
- Strong Completeness: Every faulty process is eventually suspected by every non-faulty process.
- Weak Completeness: Every faulty process is eventually suspected by at least one non-faulty process.
Degree of Accuracy:
- Strong Accuracy: No non-faulty process is ever falsely suspected.
- Weak Accuracy: Some non-faulty processes may be falsely suspected.
- Eventually Strong Accuracy: No non-faulty process is suspected after a certain period.
- Eventually Weak Accuracy: Some non-faulty processes may be falsely suspected initially, but eventually, all non-faulty processes are no longer suspected.

Algorithms

SWIM (Scalable Weakly Consistent Infection Protocol)

SWIM is a prominent failure detection algorithm known for its strong completeness and weak accuracy. However, a detailed discussion of SWIM falls outside the scope of this write-up.

Heartbeat-Based Failure Detectors

These represent the most fundamental class of failure detectors. They operate by monitoring the regular reception of heartbeat messages from other nodes. Heartbeat-based detectors exhibit strong completeness but only achieve eventual weak accuracy. This implies that while they will eventually detect a node failure, they may temporarily suspect healthy nodes as faulty.

Ping-Ack Failure Detectors

Similar to heartbeat-based detectors, ping-ack mechanisms rely on the exchange of ping messages and acknowledgments (acks) to monitor node health. They share the same strong completeness and eventual weak accuracy characteristics as heartbeat-based detectors.

Accrual Failure Detectors

Cassandra employs Accrual Failure Detectors, which extend the concept of heartbeat-based mechanisms. Accrual Failure Detectors incorporate tunable parameters, allowing for adjustments to the balance between accuracy and completeness. This flexibility enables administrators to fine-tune the detector's behavior to meet specific system requirements.

φ represents a metric used to assess the likelihood of a machine failure which is inversely proportional to probability of receiving a heartbeat.

φ(t) = -log₁₀(P(t))

φ(t) is the probability that a machine has failed at time t.

P(t) is the probability of receiving a heartbeat from the machine at time t.

A higher phi value indicates a higher probability of machine failure. For example:

φ = 1 implies a 10% probability of receiving a heartbeat.
φ = 2 implies a 1% probability of receiving a heartbeat.
φ = 3 implies a 0.1% probability of receiving a heartbeat.

and so on.

How to define P(t)? This is achieved by maintaining a sliding window of heartbeat intervals. The samples within this window collectively form a probability distribution. Then:

P(t) = Integral of this distribution from t to ∞.

According to the requirements, the application may utilize cutoffs for φ to determine whether to accept or reject as a failure.

Gossip Protocols

Gossip protocols constitute a class of broadcast protocols within p2p networks. Other classes include point-to-point (one-to-one) and eager reliable (one-to-all) protocols.

The core principle of gossip protocols involves each node periodically transmitting a message to a randomly selected subset of other nodes. This decentralized approach ensures that, with high probability, a particular message will eventually propagate throughout the entire system. This characteristic finds valuable applications in various domains, including failure detection.

Types of Gossip Protocols

Gossip protocols are typically characterized by two key tunables: the required time for message dissemination and network usage. Based on these parameters, several distinct types of gossip protocols have emerged:

Anti-Entropy: In anti-entropy protocols, nodes transmit their entire message set to other nodes. This approach generally results in high bandwidth consumption. To mitigate this, diffing techniques can be employed to avoid redundant data transmission. The term "anti-entropy" originates from the protocol's objective: to minimize the information divergence between nodes by exchanging complete datasets.
Rumor-Mongering: Rumor-mongering protocols, also known as dissemination protocols, focus on transmitting only updates or changes to the message set. This approach significantly reduces network usage, enabling more frequent data exchange.
Aggregation: Aggregation protocols facilitate the exchange of system-wide aggregates through sampling techniques.

Fundamentally, all gossip protocols share a common approach: they disseminate information by having each node randomly transmit messages to a subset of its peers. This decentralized strategy relies on the probabilistic nature of these random transmissions to ensure that information eventually propagates throughout the entire network.

Cassandra

Apache Cassandra is a wide-column store NoSQL database. These databases organize data into columns, but unlike traditional relational databases, the columns can vary from row to row.

Data Model

Cassandra employs a <row, column> key format, closely resembling Bigtable (except for the time dimension). Notably, Bigtable, Cassandra, and HBase are all fundamentally implemented as key-value stores.

Each row in a Cassandra table is uniquely identified by a string row key, typically 16 to 36 bytes long. Column are organized into column families, and come in two forms: simple and super. Super column families can be conceptually understood as nested column families.

Regardless of the data model presented—be it tabular or graph-like—all NoSQL databases, including Cassandra, are fundamentally key-value stores.

API

Cassandra's core API simplifies data interaction to three fundamental operations:

insert(table, rowkey, rowMutation)
get(table, rowkey, columnName)
delete(table, rowkey, columnName)

This essentially boils down to key-value put and get operations.

Data Structures

Exhibits strong similarities to Bigtable's tablet structures (implemented as an LSM tree):

Utilizes an in-memory data structure to track updates concurrently with writing to a commit log.
Periodically, this in-memory data is flushed to disk, accompanied by summary information and a Bloom filter to accelerate query performance. Although not explicitly mentioned in the paper, it's highly probable that the on-disk data files are stored in the SSTable format. An SSTable is a serialized representation of key-value pairs, where keys are arranged in sorted order alongside their corresponding values.
Data files undergo periodic compaction to optimize storage and improve read performance.

System Architecture

The architecture shares similarities with Dynamo, leveraging a chord-based consistent hashing scheme for data placement.

It's noteworthy that Bigtable leverages Google File System (GFS) for efficient data storage and retrieval. In contrast, Cassandra does not rely on a distributed file system. This design choice may be attributed to the lack of a GFS equivalent within Facebook's infrastructure at the time of Cassandra's development.

The presence of a distributed file system could have significantly simplified Cassandra's implementation. All replicas of a given partition could have seamlessly relied on a consistent view of the data stored within the distributed file system. However, without a distributed file system, the Cassandra team had to implement a more intricate solution: consistent hashing for data partitioning and replication, coupled with sophisticated membership control mechanisms involving failure detectors and gossip protocols.

Range Assignment

As discussed above, Cassandra makes use of consistent hashing to distribute keys. Each range is assigned to not one, but N nodes, one of which act as the coordinator for that range. This replication is required to ensure that there are no single point of failure. Note that the same machine can be a node for multiple ranges.

In the diagram, the range between red and purple dot is owned by the nodes.

It is interesting to note that during that time there was a trend of building truly decentralized systems where each node is essentially the same.

Cassandra employs the quorum mechanism. The replication factor is set to a threshold value, N. A write operation is considered successful only if the coordinator receives successful write acknowledgments from at least N replicas.

Cassandra and Failure Detection

Within a range, it is crucial for replicas to detect the failure of the coordinator, enabling the election of a new coordinator.

Cassandra utilizes an anti-entropy gossiping mechanism in conjunction with an accrual failure detector. In each exchange, a node transmits its entire collection of heartbeat information (received from other nodes) to its peers. This comprehensive exchange often results in shorter heartbeat intervals. Consequently, an exponential distribution typically provides a more accurate representation of heartbeat arrival times compared to a Gaussian distribution.

Post failure detection, Cassandra elects a leader amongst its nodes using Zookeeper, a distributed coordination system.

Paper Review

This paper is remarkably concise, covering a wide range of concepts within a relatively small number of pages. While not overly challenging to read and comprehend, I recommend familiarizing oneself with Dynamo and Bigtable beforehand to gain a deeper understanding of the context and related concepts.

Pico

Search This Blog