Skip to main content

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik, originates from Facebook and dates back to 2008.

Paper Link

Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation.

Recommended Read: Dynamo: Amazon's Highly Available Key-value Store

Let's begin with some of fundamental concepts.

SQL Databases

SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions.

To ensure this data consistency (the C in ACID), SQL databases necessitate support for atomic operations (the A in ACID), transaction isolation (the I in ACID), and durable storage (the D in ACID).

SQL databases can be categorized into two types:

  • Real-time, Strongly Consistent: Examples include Spanner, AuroraDB, and non-distributed versions of MySQL. These databases guarantee immediate visibility of transaction results, ensuring strong consistency in real-time.
  • Non-real-time, Consistent: Data warehouses like Mesa, Napa, and Snowflake fall under this category. While maintaining consistency, they may exhibit some data staleness as all transactions might not have been immediately applied.

Both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) applications critically rely on SQL databases. Their construction cannot be effectively based on NoSQL databases. NoSQL databases shouldn't be used for transactional or analytical workloads.

NoSQL Databases

Databases that do not adhere to the SQL semantics discussed above are generally categorized as NoSQL databases. These databases are sometimes referred to as BASE databases, contrasting them with ACID databases. They often prioritize faster transaction processing by relaxing some of the stringent ACID guarantees.

For example, Bigtable, a prominent NoSQL database, supports atomicity only for modifications within a single row.

NoSQL databases exhibit diverse architectures and can be broadly classified into the following categories:

  • Document Databases: In these databases, keys are typically strings, and the corresponding values are structured documents, often in formats like JSON. Many document databases, including MongoDB and CouchBase, draw inspiration from the design principles of Amazon DynamoDB. While MongoDB has undergone significant evolution, its core concepts remain largely similar to those of DynamoDB.
  • Wide-Column Databases: These databases allow for a dynamic number of columns in a table. Prominent examples include Cassandra, HBase, and Bigtable, all of which are heavily influenced by the data structures and design principles of Google's Bigtable.
  • Key-Value Stores: These databases can be viewed as a more generalized form of document stores, offering a simpler key-value data model.
  • Graph Databases: Designed to efficiently store and query highly interconnected data, graph databases like Facebook's Tao and Neo4j excel at representing and analyzing relationships within complex networks.

Cassandra

Key Format

Employs a <row, column_family> key format, closely resembling Bigtable. Notably, Bigtable, Cassandra, and HBase are all fundamentally implemented as key-value stores.

Data Structures

Exhibits strong similarities to Bigtable's data structures:
  • Utilizes an in-memory data structure to track updates concurrently with writing to a commit log.
  • Periodically, this in-memory data is flushed to disk, accompanied by summary information and a Bloom filter to accelerate query performance. Although not explicitly mentioned in the paper, it's highly probable that the on-disk data files are stored in the SSTable format. An SSTable is a serialized representation of key-value pairs, where keys are arranged in sorted order alongside their corresponding values.
  • Data files undergo periodic compaction to optimize storage and improve read performance.

System Architecture

Shares similarities with Dynamo, leveraging a chord-based consistent hashing scheme for data placement. 

It's noteworthy that Bigtable leverages Google File System (GFS) for efficient data storage and retrieval. In contrast, Cassandra does not rely on a distributed file system. This design choice may be attributed to the lack of a GFS equivalent within Facebook's infrastructure at the time of Cassandra's development.

The presence of a distributed file system could have significantly simplified Cassandra's implementation. All replicas of a given partition could have seamlessly relied on a consistent view of the data stored within the distributed file system. However, without a distributed file system, the Cassandra team had to implement a more intricate solution: consistent hashing for data partitioning and replication, coupled with sophisticated membership control mechanisms involving failure detectors and gossip protocols.

Range Assignment

As discussed above, Cassandra makes use of consistent hashing to distribute keys. Each range is assigned to not one, but N nodes, one of which act as the coordinator for that range. This replication is required to ensure that there are no single point of failure. Note that the same machine can be a node for multiple ranges.


In the diagram, the range between red and purple dot is owned by the nodes.

It is interesting to note that during that time there was a trend of building truly decentralized systems where each node is essentially the same.

Cassandra employs the quorum mechanism. The replication factor is set to a threshold value, N. A write operation is considered successful only if the coordinator receives successful write acknowledgments from at least N replicas.

Introduction to Failure Detectors

In the realm of computer science, the FLP Impossibility result is a foundational concept. It states that in an asynchronous network where a single node can crash, it's impossible to simultaneously guarantee both safety and liveness (or termination). Crash faults are always a possibility in large systems and real-world networks are always asynchronous. This makes consensus, state machine replication and atomic broadcasts impossible problems to solve.

Algorithms like Paxos ensure safety but not termination. Achieving termination necessitates some degree of synchrony, whether it's partial or full synchronization within the network. Consequently, all practical consensus algorithms in real-world asynchronous networks rely on assumptions about network time bounds. These bounds are crucial for making progress and ensuring termination.

Failure detectors are essential components within these algorithms. They assist in approximately identifying and isolating faulty nodes, enabling the system to continue making progress despite the presence of failures.

Failure detectors can be classified based on their:

  • Degree of Completeness:

    • Strong Completeness: Every faulty process is eventually suspected by every non-faulty process.
    • Weak Completeness: Every faulty process is eventually suspected by at least one non-faulty process.
  • Degree of Accuracy:

    • Strong Accuracy: No non-faulty process is ever falsely suspected.
    • Weak Accuracy: Some non-faulty processes may be falsely suspected.
    • Eventually Strong Accuracy: No non-faulty process is suspected after a certain period.
    • Eventually Weak Accuracy: Some non-faulty processes may be falsely suspected initially, but eventually, all non-faulty processes are no longer suspected.

Failure Detector Algorithms

SWIM (Scalable Weakly Consistent Infection Protocol)

SWIM is a prominent failure detection algorithm known for its strong completeness and weak accuracy. However, a detailed discussion of SWIM falls outside the scope of this write-up.

Heartbeat-Based Failure Detectors

These represent the most fundamental class of failure detectors. They operate by monitoring the regular reception of heartbeat messages from other nodes. Heartbeat-based detectors exhibit strong completeness but only achieve eventual weak accuracy. This implies that while they will eventually detect a node failure, they may temporarily suspect healthy nodes as faulty.

Ping-Ack Failure Detectors

Similar to heartbeat-based detectors, ping-ack mechanisms rely on the exchange of ping messages and acknowledgments (acks) to monitor node health. They share the same strong completeness and eventual weak accuracy characteristics as heartbeat-based detectors.

Accrual Failure Detectors

Cassandra employs Accrual Failure Detectors, which extend the concept of heartbeat-based mechanisms. Accrual Failure Detectors incorporate tunable parameters, allowing for adjustments to the balance between accuracy and completeness. This flexibility enables administrators to fine-tune the detector's behavior to meet specific system requirements.

φ represents a metric used to assess the likelihood of a machine failure which is inversely proportional to probability of receiving a heartbeat.

φ(t) = -log10 (P(t))

φ(t) is the probability that a machine has failed at time t.

P(t) is the probability of receiving a heartbeat from the machine at time t.

A higher phi value indicates a higher probability of machine failure. For example:
  • φ = 1 implies a 10% probability of receiving a heartbeat.
  • φ = 2 implies a 1% probability of receiving a heartbeat.
  • φ = 3 implies a 0.1% probability of receiving a heartbeat.

and so on.

How to define P(t)? This is achieved by maintaining a sliding window of heartbeat intervals. The samples within this window collectively form a probability distribution. Then:

P(t) = Integral of this distribution from t to ∞.

According to the requirements, the application may utilize cutoffs for φ to determine whether to accept or reject as a failure.

Gossip Protocols

Gossip protocols constitute a class of broadcast protocols within p2p networks. Other classes include point-to-point (one-to-one) and eager reliable (one-to-all) protocols.

The core principle of gossip protocols involves each node periodically transmitting a message to a randomly selected subset of other nodes. This decentralized approach ensures that, with high probability, a particular message will eventually propagate throughout the entire system. This characteristic finds valuable applications in various domains, including failure detection.

Types of Gossip Protocols

Gossip protocols are typically characterized by two key tunables: the required time for message dissemination and network usage. Based on these parameters, several distinct types of gossip protocols have emerged:

  • Anti-Entropy: In anti-entropy protocols, nodes transmit their entire message set to other nodes. This approach generally results in high bandwidth consumption. To mitigate this, diffing techniques can be employed to avoid redundant data transmission. The term "anti-entropy" originates from the protocol's objective: to minimize the information divergence between nodes by exchanging complete datasets.

  • Rumor-Mongering: Rumor-mongering protocols, also known as dissemination protocols, focus on transmitting only updates or changes to the message set. This approach significantly reduces network usage, enabling more frequent data exchange.

  • Aggregation: Aggregation protocols facilitate the exchange of system-wide aggregates through sampling techniques.

Fundamentally, all gossip protocols share a common approach: they disseminate information by having each node randomly transmit messages to a subset of its peers. This decentralized strategy relies on the probabilistic nature of these random transmissions to ensure that information eventually propagates throughout the entire network.

Cassandra and Failure Detection

Within a range, it is crucial for replicas to detect the failure of the coordinator, enabling the election of a new coordinator.

Cassandra utilizes an anti-entropy gossiping mechanism in conjunction with an accrual failure detector. In each exchange, a node transmits its entire collection of heartbeat information (received from other nodes) to its peers. This comprehensive exchange often results in shorter heartbeat intervals. Consequently, an exponential distribution typically provides a more accurate representation of heartbeat arrival times compared to a Gaussian distribution.

Post failure detection, Cassandra elects a leader amongst its nodes using Zookeeper.

Paper Review

This paper is remarkably concise, covering a wide range of concepts within a relatively small number of pages. While not overly challenging to read and comprehend, I recommend familiarizing oneself with Dynamo and Bigtable beforehand to gain a deeper understanding of the context and related concepts.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...