Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 Very Large Database (VLDB) conference, a notable paper was presented by Michael Armbrust (Databricks), with co-authors including CEO Ali Ghodsi and CTO Matei Zaharia.

Paper Link

Before we delve into the paper's details, let’s begin with some basic concepts.

1. Databases, Data Warehouses, and Data Lakes

Let's start with a discussion about the differences between these frequently used jargons in the computer science industry.

1.1. Databases

While the term database is broad, it commonly refers to transactional SQL databases (OLTP). These SQL databases prioritize transactional integrity and offer varying levels of isolation. A fundamental principle of SQL databases is that data must always maintain consistency.

For example, in a bank's table, the total balance across all accounts must remain constant regardless of the number of transfers. If two transactions - one at time T₁ and another at time T₂ each add $1 to every account, the expected behavior is:

Before T₁: No accounts have $1 added.

After T₁: All accounts have $1 added.

After T₂: All accounts have $2 added.

Note: There are also NoSQL databases which often sacrifice some isolation guarantees for improved performance. In this article, we will focus on SQL databases only.

1.2. Data Warehouses

Designed for Online Analytical Processing (OLAP), data warehouses prioritize historical data for analysis. The data in a data warehouse must also be consistent, however, it does not necessarily reflect real-time changes.

See Paper Insights - Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google to learn more about how data warehouses work.

1.3. Data Lakes

Data lakes are repositories for storing large volumes of raw data in various formats (structured, semi-structured, unstructured). This raw data can then be processed through ETL pipelines (described below) to create data warehouses.

2. Database Storage Formats

All databases (except in-memory ones) persist data records in files on disk. The way these records are arranged within the files varies, leading to different storage formats.

2.1. Record I/O (Row-Oriented)

Record I/O organizes data by storing all column values for a single record (row) contiguously on disk, as shown in illustration 1. The record data might be spread across multiple files. An index structure, commonly a B-tree, facilitates efficient record location. The B-tree structure generally ensures fast index updates during writes.

Illustration 1: Record I/O.

Record I/O is used in write-heavy databases where write optimization is crucial.

Example Implementations - CSV and Apache Avro.

Advantages

Efficient retrieval of specific records by directly following the index and reading the record.

Disadvantages

Inefficient for projections, requiring multiple small I/Os to read specific columns.
Limited compression due to the storage of diverse data types together within a row.

2.2. Columnar (Column-Oriented)

Introduced in 2005 by Stonebraker et al. in their influential VLDB paper on C-Store, columnar storage arranges data by storing all values for a single column contiguously on disk, as shown in illustration 2.

Columnar storage is well-suited for read-heavy databases. It compresses each column individually and relies on header information to define column boundaries and store summary statistics (like key ranges for index columns).

Illustration 2: Columnar format.

Example Implementations - Apache Parquet.

Advantages

High compression ratios are achievable due to the homogeneous data types within each column (e.g., booleans as bit vectors), reducing storage and improving I/O.
Efficient reading of specific columns.

Disadvantages

Difficult record reads, requiring multiple small I/Os to reconstruct a single row.
Slower transaction processing.

2.3. Capacitor

Announced by Google's BigQuery in 2016, the Capacitor format enhances the columnar approach. Its key innovation is splitting records into partitions and then storing these partitions in a columnar fashion. It also excels at handling structured data efficiently.

Illustration 3: Capacitor format.

BigQuery leverages Capacitor's ability to dynamically reorganize and re-encode data based on usage, optimizing storage and query performance through techniques like run-length and dictionary encoding.

Another notable format is the SSTable, where keys are stored in sorted order. See Paper Insights - Bigtable: A Distributed Storage System for Structured Data to learn about SSTable and the corresponding LSM-tree structure.

3. Database Layers

Most databases are built as layers on top of other databases, as shown in illustration 4, though there are many exceptions.

At the foundation are core transaction engines, often provided as libraries, that manage transactions on the underlying data files. Examples include InnoDB and MyISAM for SQL databases, and WiredTiger, BerkeleyDB, and RocksDB for NoSQL databases. Building upon these engines are standalone databases like MySQL and MongoDB.

Above the core engines and standalone databases, a distributed layer can be implemented to distribute data across multiple instances. Amazon Aurora, for instance, distributes data across MySQL instances. Finally, distributed query engines handle query processing and can operate across various types of databases. Note that standalone databases can also incorporate query engines to process queries within their own data.

Illustration 4: Database layers.

Examples of exceptions are:

PostgreSQL - Unlike MySQL, PostgreSQL has its own query engine.
Spanner - Unlike Amazon Aurora, Spanner distributes the core engine itself instead of building on top of standalone engines.

4. ETL Pipelines

ETL pipelines are processes used to extract, transform, and load data into data warehouses. Data sources for ETL pipelines typically include other databases, data lakes, or other data warehouses.

Let's take a couple of examples to understand how ETL is used.

Example 1 (Illustration 5): Consider an e-commerce platform that processes numerous online orders. To gain insights into sales trends, the business needs an analytical data warehouse that summarizes total revenues on an hourly basis. The raw order data might reside in various sources: transactional SQL databases, periodic batch files from external partners, and even large datasets in diverse formats within a data lake. An ETL pipeline would then be crucial for connecting to these disparate systems, extracting the relevant order information, transforming it (e.g., calculating hourly sums), and loading it into the data warehouse. The pipeline would run every 24 hours and process data for all orders in the last 24 hours.

Illustration 5: ETL for OLTP.

Example 2 (Illustration 6): Consider a scenario where a system maintains a detailed analytical table with hourly revenue metrics. To facilitate higher-level analysis and reporting, a daily summary table is also required. An ETL pipeline could be designed to run each day, reading the hourly data from the analytical table, aggregating it to a daily level, and then populating the daily summary table.

Illustration 6: ETL for daily aggregation.

Example 3: Let's consider a company that gathers customer feedback from various channels, such as online surveys, social media posts, and call center transcripts. To understand overall customer sentiment, an ETL process could extract this textual data, transform it by performing sentiment analysis, and then load the aggregated sentiment scores into a reporting database. This allows the company to track customer satisfaction trends over time.

5. Cloud Data Store

The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include Google Cloud Storage, Amazon S3, and Azure Blob Storage.

5.1. Design Points

Key-Value Store with Eventual Consistency: Cloud data store functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be bytes ranging from a few kilobytes to terabytes.

Data Immutability: In most cloud stores, data is immutable. Appends are possible but generally not optimal.

Data Partitioning (Buckets): Keys are partitioned into buckets, analogous to directories in a distributed file system.

5.2. Limited Functionality

The key-value store is designed for simplicity. Transaction support, especially across multiple keys, is notably missing.

For example, consider a scenario where key A has the value 10 in the store. Transaction support would be necessary to atomically read the value of A and then write its double (20) back to the store. Without transaction support, another write operation could concurrently modify the value of A to 30 before the first operation completes, leading to an incorrect result (20 instead of 60).

5.3. Performance Characteristics

Typical latencies range from 5 to 10 ms. Achievable throughput typically falls within the 50-100 MB/s range.

5.4. Cloud Data Store v/s Distributed File System

Most cloud applications are composed of microservices, each with distinct storage and compute components:

Storage Layer: Comprises the underlying storage medium (e.g., disks) where data is persisted and accessed for computation.

Compute Layer: Handles user requests, performs computations. Typically involves a replicated service running continuously.

This architecture is well-suited for low-latency, real-time services, allowing independent scaling of storage and compute resources.

Distributed File Systems have gained widespread popularity since the advent of Google File System (GFS, also known as Colossus). A distributed file system offers a more comprehensive file system interface ideal for database systems. Colossus and HDFS are well-known examples of distributed file systems.

However, Distributed File Systems have inherent compute overhead. Distributed file systems incorporate a compute layer consisting of chunk servers (handling read/write requests) and metadata servers (managing metadata updates). These services run constantly, consuming compute resources even when idle.

On the other hand, Cloud Data Stores leverage serverless architectures like Lambda. The compute layer is invoked on-demand, only when data is read or modified. This pay-per-use model offers cost-effectiveness compared to the continuous compute consumption of distributed file systems, databases, or data warehouses.

6. Delta Lake

In this paper, the authors aim to construct an ACID table on top of a cloud data store. This approach can be extended to any eventually consistent key-value store given that the key-value store is not transient.

It is crucial to remember that this creates an ACID table, not an ACID database. Therefore, transactions spanning multiple tables may still violate ACID properties.

6.1. Data Storage

Delta lake stores records in Parquet files which are themselves stored on a cloud data store.

6.2. Transactions

table/date1/f1.data

f2.data

...

/date2/f1.data

f2.data

...

To handle data modifications, a log-based approach is employed. When a record is updated, the existing Parquet file is not directly altered. Instead:

A new Parquet file is created to accommodate the changes.
The original file is soft-deleted, meaning it's marked as inactive but not immediately removed.

Log records maintain a history of additions and deletions. Each log entry corresponds to a specific Parquet file. To enhance system stability and performance, these log records are periodically snapshotted into checkpoints.

To further optimize performance, data files are partitioned. This partitioning strategy improves the efficiency of the aforementioned operations.

6.3. Read-Only Transactions

By default, all reads are performed as snapshot reads. Serializable reads are not natively supported. To achieve serializable read behavior, a workaround is necessary: a dummy write operation must be executed.

6.4. Performance

Due to its current implementation, the system can only process one transaction at a time, significantly limiting its throughput. This limitation may be acceptable if the database is intended for use in environments with low transaction volumes.

6.5. Query Processing

As noted earlier, Parquet files store data in a columnar format. Each Parquet file includes summary statistics for each column, enabling efficient query processing by allowing the system to quickly identify and skip files that do not contain the relevant data. This approach is particularly effective when data is sorted by the queried column.

However, if data records are not sorted based on the query column, performing range queries on that column can become significantly more expensive.

Use a Secondary Index?

Building and maintaining a secondary index requires cross-table transactions, which are not supported by Delta Lake.

Solution: Z-Ordering

Z-ordering is a technique used to optimize the physical layout of data records within Parquet files.

Instead of sorting data along a single dimension, Z-ordering considers multiple attributes simultaneously. It aims to group together data points that are spatially close to each other in the multi-dimensional space.

Consider a 2D space with points represented by (X, Y) coordinates (shown in illustration 7).

Illustration 7: Z-order of 2-D coordinates.

A simple Z-order might traverse the space in the following order:

(1, 1), (1, 2), (2, 1), (2, 2), (1, 3), (1, 4), (2, 3), (2, 4),

(3, 1), (3, 2), (4, 1), (4, 2), (3, 3), (3, 4), (4, 3), (4, 4)

Z-ordering significantly reduces the search space when filtering on multiple attributes, leading to faster query execution. By grouping related data together, Z-ordering improves data locality, which can lead to better data compression and faster read/write operations.

Z-ordering can be extended to higher dimensions, making it suitable for complex datasets with multiple attributes.

7. Alternative Considerations

Alternative approaches to building ACID stores on top of cloud data stores often involve employing a separate transactional store for object metadata, analogous to metadata servers in distributed file systems. However, Delta Lake addresses this challenge by seamlessly integrating metadata management within the object store itself.

8. Related Systems

The paper mentions several key cloud systems and technologies:

8.1. Data Warehousing

Google Cloud BigQuery: A data warehouse service built on top of Dremel, leveraging Colossus for distributed storage. It offers columnar storage and separates storage from processing for scalability.
Amazon Redshift: A columnar-oriented data warehouse similar to BigQuery.
Snowflake: A cloud-agnostic data warehouse with a multi-cluster shared architecture.
Apache Hive: An open-source data warehouse built on Hadoop, providing a SQL-like query interface.

8.2. Columnar Data Formats

Apache Parquet: A widely used columnar storage format offering efficient compression and encoding schemes.
Apache Iceberg: A data format for analytic tables, enabling SQL-based queries across various engines like Spark, Flink, Hive, and Impala.

8.3. NoSQL Databases

Apache Kudu: A columnar-oriented data store within the Hadoop ecosystem, optimized for OLAP workloads and utilizing a relational schema.
Apache HBase: A non-relational, distributed database supporting wide-column storage, inspired by Google Bigtable.

8.4. Query Service

Amazon Redshift Spectrum: Enables querying data directly in Amazon S3 without prior loading into Redshift, decoupling storage from processing and optimizing costs.
AWS Athena: An interactive query service that allows users to analyze data directly in Amazon S3 using standard SQL.

8.5. Competitor

Apache Hudi: An open-source framework for managing data lakes, offering features like transactions, upserts, deletes, indexing, clustering, and compaction, positioning it as a direct competitor to Delta Lake.

9. Paper Review

While the core concept of the paper seems straightforward, the actual implementation likely presented significant engineering challenges. My primary concern revolves around the use of a single log file. This centralized approach could potentially introduce a single point of contention, limiting concurrency and making optimistic concurrency control difficult or inefficient.

The paper devotes a considerable portion (two pages) to discussing business use cases for Delta Lake, specifically emphasizing the need for millions of partitions. I remain somewhat skeptical about the practical frequency of these extreme use cases.

Anyway, the paper provides an excellent introduction to cloud data stores and other relevant big data technologies, making it a recommended read.

Pico

Search This Blog