Skip to main content

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by Michael Armbrust (Databricks), with co-authors including CEO Ali Ghodsi and Matei Zaharia.


Before we delve into the paper's details, I would like to introduce some topics to readers.

Cloud Data Store

The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include Google Cloud StorageAmazon S3, and Azure Blob Storage.

Design Points

  • Key-Value Store with Eventual Consistency: Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes.
  • Data Immutability: In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in adding a new block to chunk servers and updating metadata, appends in a cloud store can be less efficient.
  • Data Partitioning (Buckets): Keys are partitioned into buckets, analogous to directories in a distributed file system.

Limited Functionality

The key-value store is designed for simplicity. Transaction support, especially across multiple keys, is notably absent. For example, consider a scenario where key A has the value 10 in the store. Transaction support would be necessary to atomically read the value of A and then write its double (20) back to the store. Without transaction support, another write operation could concurrently modify the value of A to 30 before the first operation completes, leading to an incorrect result (20 instead of 60).

Performance Characteristics

Typical latencies range from 5 to 10 ms. Achievable throughput typically falls within the 50-100 MB/s range.

Databases, Data Warehouses, and Data Lakes

Databases

While the term database is broad, it commonly refers to transactional databases (OLTP). These databases, typically SQL-based, prioritize transactional integrity and offer varying levels of isolation. NoSQL databases, on the other hand, often sacrifice some isolation guarantees for improved performance. A fundamental principle of databases is that data must always maintain consistency.

Example of Consistency

In a bank's table, the total balance across all accounts must remain constant regardless of the number of transfers. If two transactions - one at time T1 and another at time T2 each add $1 to every account, the expected behavior is:

Before T1: No accounts have $1 added.
After T1: All accounts have $1 added.
After T2: All accounts have $2 added.

Consistency in databases have time semantics.

Data Warehouses

Designed for Online Analytical Processing (OLAP), data warehouses prioritize historical data for analysis. While data in a data warehouse must be consistent, it does not necessarily reflect real-time changes. In the example above, balances might not immediately reflect the latest transactions, but consistency is maintained. All accounts will reflect the same updated balance at a later point in time.

Data Lakes

Data lakes are repositories for storing large volumes of raw data in various formats (structured, semi-structured, unstructured). This raw data can then be processed through ETL pipelines to create data warehouses.

Data Pipelines (ETL)

These processes are used to extract, transform, and load data into data warehouses. Data sources for ETL pipelines typically include other databases, data lakes, or other data warehouses.

The Goal: Building a Database on top of  a Cloud Data Store

In this paper, the authors aim to construct an ACID table on top of a cloud data store. This approach can be extended to any eventually consistent key-value store given that the key-value store is not transient (like Memcache). 

It is crucial to remember that this creates an ACID table, not an ACID database. Therefore, transactions spanning multiple tables may still violate ACID properties.

Cloud Data Store v/s Distributed File System

Distributed File Systems have gained widespread popularity since the advent of Google File System (GFS, also known as Colossus). A distributed file system offers a more comprehensive file system interface ideal for database systems. Colossus and HDFS are well-known examples of distributed file systems.

A fundamental question arises: Why build upon a cloud data store instead of a distributed file system? To answer this, we must understand the distinction between compute and storage layers in cloud environments.

Cloud Computing Architecture

Most cloud applications are composed of microservices, each with distinct compute and storage components:
  • Storage Layer: Comprises the underlying storage medium (e.g., disks) where data is persisted and accessed for computation.
  • Compute Layer: Handles user requests, performs computations, and typically involves a replicated service running continuously.
This architecture is well-suited for low-latency, real-time services, allowing independent scaling of storage and compute resources.

Distributed File Systems has inherent compute overhead. Distributed file systems incorporate a compute layer consisting of chunk servers (handling read/write requests) and metadata servers (managing metadata updates). These services run constantly, consuming compute resources even when idle.

On the other hand, Cloud Data Stores leverage serverless architectures like Lambda. The compute layer is invoked on-demand, only when data is read or modified. This pay-per-use model offers cost-effectiveness compared to the continuous compute consumption of distributed file systems, databases, or data warehouses.

With the goal in mind, let's see how the authors approached this challenge.

Data Format

Tabular data, inherently two-dimensional with rows and columns, is typically serialized into files – continuous sequences of bytes – for storage. Two primary serialization strategies exist:
  • Row-oriented Encoding: Common examples include CSV, PostgreSQL, and Apache Avro.
  • Column-oriented Encoding: Popular choices include Parquet, ORC, Arrow, and DuckDB. Columnar formats often achieve superior compression, leading to improved performance.

Another notable format is the SSTable, where keys are stored in sorted order.

Apache Parquet leverages Apache Thrift, a serialization protocol comparable to Google Protocol Buffers, as its underlying framework. Delta lake stores records in Parquet files.

Transactions

table/date1/f1.data
            f2.data
            ...
     /date2/f1.data
            f2.data
            ...

To handle data modifications, a log-based approach is employed. When a record is updated, the existing Parquet file is not directly altered. Instead:
  • A new Parquet file is created to accommodate the changes.
  • The original file is soft-deleted, meaning it's marked as inactive but not immediately removed.

Log records maintain a history of additions and deletions. Each log entry corresponds to a specific Parquet file. To enhance system stability and performance, these log records are periodically snapshotted into checkpoints.

To further optimize performance, data files are partitioned. This partitioning strategy improves the efficiency of the aforementioned operations.

Read-Only Transactions

By default, all reads are performed as snapshot reads. Serializable reads are not natively supported. To achieve serializable read behavior, a workaround is necessary: a dummy write operation must be executed.

Performance

Due to its current implementation, the system can only process one transaction at a time, significantly limiting its throughput. This limitation may be acceptable if the database is intended for use in environments with low transaction volumes.

Querying

As noted earlier, Parquet files store data in a columnar format. Each Parquet file includes summary statistics for each column, enabling efficient query processing by allowing the system to quickly identify and skip files that do not contain the relevant data. This approach is particularly effective when data is sorted by the queried column.

However, if data records are not sorted based on the query column, performing range queries on that column can become significantly more expensive.

Use a Secondary Index?

Building and maintaining a secondary index requires cross-table transactions, which are not supported by Delta Lake.

Solution: Z-Ordering

Z-ordering is a technique used to optimize the physical layout of data records within Parquet files. By intelligently sorting data based on multiple attributes, Z-ordering enhances data locality.

Instead of sorting data along a single dimension, Z-ordering considers multiple attributes simultaneously. It aims to group together data points that are spatially close to each other in the multi-dimensional space.

Consider a 2D space with points represented by (X, Y) coordinates.


A simple Z-order might traverse the space in the following order:

(1, 1), (1, 2), (2, 1), (2, 2), (1, 3), (1, 4), (2, 3), (2, 4),
(3, 1), (3, 2), (4, 1), (4, 2), (3, 3), (3, 4), (4, 3), (4, 4)

Z-ordering significantly reduces the search space when filtering on multiple attributes, leading to faster query execution. By grouping related data together, Z-ordering improves data locality, which can lead to better data compression and faster read/write operations.

Z-ordering can be extended to higher dimensions, making it suitable for complex datasets with multiple attributes.

Alternatives

Alternative approaches to building ACID stores on top of cloud data stores often involve employing a separate transactional store for object metadata, analogous to metadata servers in distributed file systems. However, Delta Lake addresses this challenge by seamlessly integrating metadata management within the object store itself.

Related Systems

The paper introduces several key cloud systems and technologies:

Data Warehousing

  • Google Cloud BigQuery: A data warehouse service built on top of Dremel, leveraging Colossus for distributed storage. It offers columnar storage and separates storage from processing for scalability.
  • Amazon Redshift: A columnar-oriented data warehouse similar to BigQuery.
  • Snowflake: A cloud-agnostic data warehouse with a multi-cluster shared architecture.
  • Apache Hive: An open-source data warehouse built on Hadoop, providing a SQL-like query interface.

Columnar Data Formats

    • Apache Parquet: A widely used columnar storage format offering efficient compression and encoding schemes.
    • Apache Iceberg: A data format for analytic tables, enabling SQL-based queries across various engines like Spark, Flink, Hive, and Impala.

    NoSQL Databases

    • Apache Kudu: A columnar-oriented data store within the Hadoop ecosystem, optimized for OLAP workloads and utilizing a relational schema.
    • Apache HBase: A non-relational, distributed database supporting wide-column storage, inspired by Google Bigtable.

    Query Service

    • Amazon Redshift Spectrum: Enables querying data directly in Amazon S3 without prior loading into Redshift, decoupling storage from processing and optimizing costs.
    • AWS Athena: An interactive query service that allows users to analyze data directly in Amazon S3 using standard SQL.

    Competitor

    • Apache Hudi: An open-source framework for managing data lakes, offering features like transactions, upserts, deletes, indexing, clustering, and compaction, positioning it as a direct competitor to Delta Lake.

    Final Thoughts

    While the core concept of the paper seems straightforward, the actual implementation likely presented significant engineering challenges. Given the problem statement, I initially anticipated a design where table metadata would be managed by a strongly consistent service.

    My primary concern revolves around the use of a single log file. This centralized approach could potentially introduce a single point of contention, limiting concurrency and making optimistic concurrency control difficult or inefficient.

    The paper devotes a considerable portion (two pages) to discussing business use cases for Delta Lake, specifically emphasizing the need for millions of partitions. While I acknowledge the potential for such scenarios, I remain somewhat skeptical about the practical frequency of these extreme use cases.

    Comments

    Popular posts from this blog

    Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

    This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

    Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

    This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

    Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

    This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

    Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

    This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

    Paper Insights #5 - The Design and Implementation of a Log-Structured File System

    This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

    Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

    This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

    Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

    Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

    Paper Insights #24 - Spanner: Google's Globally-Distributed Database

    This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...