Paper Insights #14 - Bigtable: A Distributed Storage System for Structured Data

The ODSI 2006 conference proved to be a pivotal event for distributed systems, marked by the presentation of Google's highly influential Bigtable paper. Co-authored by luminaries such as Jeffery Dean and Sanjay Ghemawat, with significant contributions from Mike Burrows (the architect of Bigtable and creator of Chubby), this paper has become a landmark in the field. Notably, Mike Burrows's work on Chubby was also presented at ODSI 2006. Adding to the conference's impact, the paper Rethink the Sync was also presented, and it shared the Best Paper award with Bigtable. Clearly, ODSI 2006 was a watershed moment for the advancement of distributed systems.

Paper Link

Let's begin with some basic concepts.

Bloom Filters

Bloom filters are probabilistic data structures used in big data pipelines for efficient set membership testing. They provide:

A definitive NO for non-membership
A probable YES for membership.

A deterministic full hash sets would consume excessive memory. Conversely, bloom filters are memory-efficient by sacrificing the determinism for true positives.

Implementation

A basic implementation uses a bit array. Elements are mapped to array indices via hash functions, setting the corresponding bits to 1 for potential membership.

To minimize false positives, multiple hash functions are employed. If all corresponding bits are 1 after hashing, the element might be a member. If any bit is 0, it's definitely not. The number of hash functions and the size of the bit array affect the false positive rate.

SQL Databases

SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions.

To ensure this data consistency (the C in ACID), SQL databases necessitate support for atomic operations (the A in ACID), transaction isolation (the I in ACID), and durable storage (the D in ACID).

Classification

SQL databases can be categorized into two types:

Real-time, Strongly Consistent: Examples include Spanner, AuroraDB, and non-distributed versions of MySQL. These databases guarantee immediate visibility of transaction results, ensuring strong consistency in real-time.
Non-real-time, Consistent: Data warehouses like Mesa/Napa, and Snowflake fall under this category. While maintaining consistency, they may exhibit some data staleness as all transactions might not have been immediately applied.

Recommended Reads:

Applications

Both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) applications critically rely on SQL databases. NoSQL databases shouldn't be used for transactional or analytical workloads.

NoSQL Databases

Databases that do not adhere to the SQL semantics discussed above are generally categorized as NoSQL databases. These databases are sometimes referred to as BASE databases, contrasting them with ACID databases. They often prioritize faster request processing by relaxing some of the stringent ACID guarantees.

For example, Bigtable, a prominent NoSQL database, supports atomicity only for modifications within a single row.

Classification

NoSQL databases exhibit diverse architectures and can be broadly classified into the following categories:

Key-Value Stores: These are the generalized NoSQL databases, offering a simple key-value data model. Example - Dynamo.

Document Databases: In these databases, the values are structured documents, often in formats like JSON. Examples include MongoDB and CouchBase.

Wide-Column Databases: These databases allow for a dynamic number of columns in a table structure. Prominent examples include Cassandra, HBase, and Bigtable, all of which are heavily influenced by the data structures and design principles of Google's Bigtable.

Graph Databases: Like Facebook's TAO and Neo4j excel at representing and analyzing relationships within complex networks. They are designed to efficiently store and query highly interconnected data.

Recommended Reads:

Underlying Model

The key-value model serves as the underlying model for virtually all NoSQL databases. All other NoSQL models are, in fact, built upon this foundational key-value structure.

To illustrate, document databases utilize document IDs as keys and the documents themselves as values. Furthermore, even wide-column databases, such as Bigtable, are internally structured as key-value stores.

Transactions

Traditional database systems define a transaction as the atomic application of multiple actions. This ensures that a series of operations on multiple data items are treated as a single, indivisible unit, guaranteeing data consistency. However, key-value stores adopt a different approach, limiting transactions to individual data items. Consequently, multi-item transactions are not supported.

Document databases, like MongoDB, commonly implement single-row transactions. This is practical, as many real-world application transactions primarily involve operations on single records. For instance, a user updating their profile details within a Users table typically modifies only one row.

A single-data item transaction can be fundamentally understood as a read-write operation on a specific data item. This significantly simplifies the implementation compared to multi-row transaction systems. This simplicity is a core advantage and selling point of NoSQL databases, which often prioritize ease of use and performance by eschewing complex concurrency control, locking, and rollback mechanisms.

Conversely, multi-data items transactions, which are essential for SQL databases, necessitate sophisticated database components. These transactions require robust concurrency control to manage concurrent access, locking mechanisms to prevent data conflicts, and rollback support to maintain data integrity in the event of failures.

It's important to note that certain NoSQL databases, such as Dynamo, Cassandra, and TAO, completely forgo transaction support. These systems rely on blind writes, where modifications are applied without any transactional guarantees. That is, a client cannot decide to write a value based on the value it read (as in read-write transactions). This design choice prioritizes high availability and partition tolerance (ALPS principles) at the expense of strong consistency.

Database Engines

Data Structures Behind Databases

All persistent database systems, whether SQL or NoSQL, require durable storage mechanisms. To optimize read and write performance, various data structures are employed.

1. B-Trees

B-trees are a prevalent data structure, particularly in SQL databases. In essence, B-trees are self-balancing tree data structures optimized for disk-based storage, crucial for databases and file systems. They allow for efficient searching, insertion, and deletion of data by maintaining sorted keys within nodes and minimizing disk I/O through their balanced structure.

However, updating objects within a B-tree involves significant random I/O, as traversal and modification require accessing non-sequential disk locations.

Write-Ahead Logs (WAL) are primarily used in SQL databases to buffer transaction writes until commit. It also serve to convert random writes into sequential writes. By buffering writes in the WAL and periodically applying them to the B-tree, random disk access is minimized.

start 100013
write A, 50
write B, 100
commit 100013

B-trees remain the preferred data structure for read-heavy databases like MySQL, PostgreSQL, and Spanner, as they excel at range queries, a common SQL operation. Even popular NoSQL databases like Dynamo, MongoDB (which makes use of WiredTiger engine) and CouchDB also make use of B-trees.

The embedded database SQLite too makes use of B-trees.

2. LSM Trees (Log-Structured Merge Trees)

LSM trees are widely used in NoSQL databases. Their fundamental components are:

In-memory Memtables
On-disk SSTables

A memtable is often implemented as a hash map. Incoming writes are buffered in memory within the memtable. A write-ahead log ensures memtable durability in case of system crashes. When the memtable reaches a size threshold, it is serialized into an SSTable, and the memtable is cleared.

SSTables are file structures that store key-value pairs in sorted order. This sorted organization facilitates efficient indexing and searching. Each SSTable also includes an index to locate keys within the file.

Once materialized, SSTables are immutable. There may be several SSTables. Newer SSTables may contain newer versions of existing keys, effectively overriding older values. Therefore, the effective dataset comprises the current memtable and all SSTables, with newer SSTables taking precedence. Deletes are represented as tombstones.

SSTables are merged through a process called compaction, which produces a new, sorted SSTable. The algorithm to merge two SSTable S₁ and S₂ is similar to merging two hashmaps:

Iterate through S₁ and S₂, adding the key-value pair with the smaller key to S_merged.
If keys are equal, keep newest value.
Append remaining elements from either SSTable.

Compaction can create a leveled structure, with larger SSTables at deeper levels. This tree structure is what is known as an LSM tree.

Unlike B-tree updates, LSM tree writes are sequential as the materialization to SSTable happens in a single shot sequential write. This LSM trees are optimal for write-heavy data storage.

Read performance can be less efficient, as reads may require searching multiple SSTables. Additionally, there is increased disk space consumption due to the storage of older key versions.

To improve read performance in LSM trees, several strategies are employed:

Key-Range Caching: Caching the key ranges of SSTables to eliminate unnecessary searches.
Bloom Filters: Bloom filters efficiently determine whether a key might exist in an SSTable.

LSM trees are well-suited for write-intensive database workloads, as demonstrated by their use in systems like Bigtable, HBase, and Cassandra. Additionally, embedded databases like LevelDB (by Google) and RocksDB (a successor of LevelDB) also make use of LSM trees. Interestingly, all such databases are inspired by Bigtable's design.

Bigtable

Bigtable is a wide-column storage system.

But what exactly is wide-column storage?

Unlike traditional relational databases that organize data into rows with predefined columns, wide-column stores organize data into columns where the columns aren't necessarily fixed or predefined. Instead, they can be dynamically added, allowing for flexible schemas and sparse data representation.

In addition to its wide-column nature, Bigtable is also a distributed system, meaning data is spread across multiple servers for scalability and fault tolerance.

Bigtable is used by Google Analytics, Google Search, and Google Earth.

Data Model

As previously discussed, NoSQL databases, at their core, can be viewed as key-value stores. This fundamental principle extends to Bigtable, whose data model can be represented as:

<row: string, column: string, time: int64> -> string

The key component of this model is multifaceted and requires further explanation. Essentially, Bigtable leverages a key-value store to emulate a tabular data structure. The row and column components serve to represent the table's dimensions. The time field, however, introduces the capability to store time-versioned data, enabling the retrieval of historical values.

It's crucial to understand that the time dimension is flexible and not uniformly applied across all rows and columns. This flexibility distinguishes Bigtable from data warehouses (such as Mesa/Napa) or time-series databases (such as Monarch). A given row and column combination may have a single timestamp, while another may possess millions, resulting in a sparse data structure.

Rows

Row keys are arbitrary strings, up to 64 KB in size. These row keys form the basis for data partitioning, as we will explore later.

Columns

Unlike row keys, column keys adhere to a structured format: family:qualifier. The qualifier component can be an arbitrary string, while the family component is predefined and limited. Column keys are grouped into column families.

Column families serve multiple purposes:

They represent the smallest unit of access control.
They function as a unit of data compression, with data from the same family being compressed together.
They define the scope for garbage collection settings, allowing for the specification of retention policies (e.g., retaining the last N versions or versions within up to a specific time in the past).

Data Organization

Bigtable organizes data in a table according to the following principles:

Data is lexicographically ordered by row keys.
Data is partitioned into tablets, with each tablet containing a range of row keys. These tablets form the basis for data distribution and load balancing.
Within each tablet, columns are organized by column families, facilitating efficient data compression and retrieval.
Within a specific <row, column> combination, data is ordered by descending timestamp, ensuring that the most recent versions are accessed first.

It's important to recognize that these organizational structures are abstractions built upon the underlying key-value store. The core mental model remains that of a key-value store.

Transactions

Bigtable supports transactions at the row level. This capability is made possible by the fact that Bigtable partitions data by row key, and all keys for a single row reside within the same tablet. A tablet is owned by only one server (described below).

Architecture

Having established an understanding of the Bigtable APIs, let's now examine its architectural design.

Please note that the original Bigtable paper, published in 2006, employs the term master to denote a central coordinating server. While this terminology is outdated, I will retain it to maintain consistency with the source material and avoid confusion.

Bigtable's architecture features a single master server and multiple tablet servers. This design paradigm bears a strong resemblance to the Google File System (GFS), reflecting a prevalent architectural preference within Google at the time. It's worth noting that around 2007, the broader industry was exploring decentralized architectures, as exemplified by systems like Dynamo and IPFS.

The master server is responsible for crucial coordination tasks, including assigning tablets to tablet servers, detecting load imbalances, and initiating tablet reassignment. Each tablet server manages a collection of tablets, typically ranging from tens to thousands per server. Client applications communicate directly with tablet servers for read and write operations.

Similar to GFS, Bigtable adopts a per-cluster architecture. Each cluster operates independently, with no inter-cluster communication.

Within each cluster, there are multiple tables. Each table is further subdivided into tablets, which represent contiguous ranges of row keys.

Tablet Exclusive Assignment

A fundamental principle of Bigtable's design is that each tablet must be exclusively owned by a single tablet server. This constraint is essential for ensuring row-level transaction support.

Internally, Bigtable relies on Chubby, a distributed coordination service, to manage and track tablet servers. Chubby's functionality is analogous to ZooKeeper, providing a mechanism to guarantee that, at any given time, only one node can hold ownership of a specific key. While the FLP impossibility theorem highlights the inherent challenges of achieving perfect consensus in distributed systems, Chubby proves effective in practical applications.

Bigtable leverages Chubby to monitor tablet server availability. Upon startup, a tablet server acquires a lock in Chubby. The Bigtable master server periodically scans these locks to discover active tablet servers. Conversely, when a tablet server fails, its corresponding Chubby lock is released. The master then reassigns the tablet to another available tablet server.

For a deeper understanding of Chubby's operation, I highly recommend consulting its white paper.

The Bigtable master server itself acquires a unique lock within Chubby, ensuring that only one master instance is active at any time. This, combined with the master's ability to accurately track active tablet servers through their Chubby locks, guarantees that each tablet is exclusively owned by a single tablet server.

Tablet Location Hierarchy

Bigtable employs a three-level hierarchical structure to locate all tablets within the system:

Root Tablet: This tablet serves as the entry point for the tablet location system. It stores the location of all METADATA tablets within a special METADATA table.
METADATA Tablets: These tablets contain the locations of the actual data tablets. Each entry within a METADATA tablet points to the location of a specific row key range. Given that each metadata entry is approximately 1 KB in size, a 128 MB METADATA tablet can store the locations of a substantial number of user tablets.
User Tablets: These are the actual data tablets that hold the user's table rows.

The location of the root tablet is itself stored within a file in Chubby, providing a reliable and consistent starting point for the tablet location process.

Tablet Internals

In Bigtable, tablets are stored on the GFS, providing a robust foundation for data reliability. This separation of storage and computation allows tablet servers to function primarily as a computing layer. Consequently, if a tablet server fails, it can be readily replaced, and the new server can seamlessly access the data from the point where the previous server left off. This architectural design is known as a shared-disk architecture.

Splitting & Merging

Bigtable employs dynamic tablet splitting and merging to manage data distribution. When a tablet's size exceeds a predefined threshold, it is split into two or more smaller tablets. This involves partitioning the row key range into smaller, contiguous segments. Conversely, when tablets become underutilized and their size diminishes, they may be merged to optimize resource usage.

The process of splitting or merging tablets is designed to maintain data availability. Tablets remain available during these operations.

Let's examine the tablet splitting process in detail:

The tablet server responsible for the large tablet initiates the splitting operation.
The tablet server creates new metadata entries for the resulting smaller tablets.
The tablet server updates the METADATA table to reflect the new tablet locations.
The master server is notified of the split, and updates its information.
The new tablets are served to clients, while the original tablet is deleted.

Structure

Each tablet is implemented as an LSM tree, aligning with Bigtable's optimization for write-heavy workloads. Bigtable's read operations are often performed by batch jobs (MapReduce), which can tolerate higher query latencies.

Bigtable employs multiple compaction cycles for tablet's SSTable to manage data and reclaim storage space:

Minor Compaction: Minor compaction involves merging smaller, recently written SSTables into larger SSTables. This process reduces the number of SSTables that need to be read during queries, improving read performance. It also helps to prevent an excessive accumulation of small SSTables, which could lead to increased disk I/O.

Major Compaction: Major compaction involves merging all SSTables within a tablet into a single, consolidated SSTable. This process eliminates deleted data (tombstones) and older versions of data, reclaiming disk space and improving read efficiency.

Bigtable utilizes two levels of caching to enhance read performance:

Scan Cache: Caches key-value pairs retrieved from SSTables.
Block Cache: Caches raw byte blocks of SSTables from GFS.

Bigtable also leverages Bloom filters to efficiently determine the potential presence of keys within SSTables, reducing unnecessary disk reads.

Locality & Compression

Each locality group (a set of related column families) is stored in a separate SSTable. This design offers several advantages:

Independent reading of locality groups, enhancing throughput.
Fine-grained tuning of locality groups, enabling features like in-memory storage for frequently accessed data.

Within each SSTable, keys are divided into blocks, and each block is compressed using a user-defined compression format. This per-block compression minimizes decompression overhead during single key-value pair reads, as only the relevant block needs to be decompressed. The original Bigtable paper discusses various compression schemes, which are outside the scope of my current expertise!

Commit Log

A single commit log is used for all tablets residing on a tablet server. This approach amortizes the cost of group commits, as multiple transactions across tablets can be written to the log in a single append operation before acknowledgment.

However, the interleaving of tablet logs can pose challenges during tablet server failures. When tablets are reassigned to different servers, each server must read through the previous server's logs to retrieve the entries corresponding to its newly assigned tablets.

To mitigate this, the commit log is externally sorted before being read by the new servers.

Immutability at Play

As Pet Helland emphasized in Immutability Changes Everything, Bigtable's high write throughput relies on the immutability of SSTable files. This design ensures that once an SSTable is written, only new SSTables are created to reflect subsequent writes, effectively employing a copy-on-write approach.

Evaluation

A sign of a good paper is when the performance evaluation calls out both the good and the bad of the design.

Single Tablet Server Performance:

Random Reads: Random read performance is notably slow, achieving approximately 1,200 reads per second. The authors argued that this performance was sufficient, as it saturated the NIC bandwidth of the tablet server at the time. However, with modern NICs offering significantly higher speeds (at least 100x faster), it's crucial to re-evaluate the validity of this claim in current environments.
Writes (Random and Sequential): Both random and sequential writes exhibit strong performance, attributed to Bigtable's LSM tree-based design, which prioritizes write throughput.
Sequential Reads: Sequential reads outperform random reads, as expected.

Scalability:

Increasing the number of tablet servers enhances the system's overall throughput.
However, the scalability is not perfectly linear, due to:

Load imbalances across tablet servers.
Resource contention with other processes for CPU and network resources.

Random reads demonstrate poorer scaling, with only a 100x improvement observed for a 500x scale-up.

This diminished scaling is attributed to the amplification effect of retrieving 64 KB blocks for each random read, which places significant strain on shared network links.

Paper Review

For anyone embarking on a study of NoSQL databases, the Bigtable paper is an essential starting point. Its foundational concepts have profoundly influenced the design of subsequent systems, including Cassandra and HBase. Moreover, Bigtable's use of LSM trees inspired embedded databases like LevelDB, which in turn led to the development of RocksDB.

In terms of design complexity, Bigtable offers a relatively accessible introduction compared to systems like Dynamo. I would highly recommend going through section 9 as well as it has some very interesting lessons in system design.

Pico

Search This Blog

Paper Insights #14 - Bigtable: A Distributed Storage System for Structured Data

Bloom Filters

Implementation

SQL Databases

Classification

Applications

NoSQL Databases

Classification

Underlying Model

Transactions

Database Engines

Data Structures Behind Databases

1. B-Trees

2. LSM Trees (Log-Structured Merge Trees)

Bigtable

Data Model

Rows

Columns

Data Organization

Transactions

Architecture

Tablet Exclusive Assignment

Tablet Location Hierarchy

Tablet Internals

Splitting & Merging

Structure

Locality & Compression

Commit Log

Immutability at Play

Evaluation

Paper Review

Comments

Post a Comment

Popular Posts

Paper Insights #26 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Paper Insights #27 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

Paper Insights #24 - Spanner: Google's Globally-Distributed Database