Before we delve into the paper's details, I would like to introduce some topics to readers.
Databases, Data Warehouses, and Data Lakes
Let's start with discussion about the differences between these frequently used jargons in computer science industry.
Databases
While the term
database is broad, it commonly refers to
transactional SQL databases (OLTP). These SQL databases prioritize transactional integrity and offer varying levels of isolation. A fundamental principle of SQL databases is that data must always maintain consistency.
In a bank's table, the total balance across all accounts must remain constant regardless of the number of transfers. If two transactions - one at time T1 and another at time T2 each add $1 to every account, the expected behavior is:
Before T1: No accounts have $1 added.
After T1: All accounts have $1 added.
After T2: All accounts have $2 added.
Note: There are also
NoSQL databases which often sacrifice some isolation guarantees for improved performance. In this article, we will focus on SQL databases only.
Data Warehouses
Designed for Online Analytical Processing (OLAP), data warehouses prioritize historical data for analysis. While data in a data warehouse must also be consistent, it does not necessarily reflect real-time changes.
Data Lakes
Data lakes are repositories for storing large volumes of raw data in various formats (structured, semi-structured, unstructured). This raw data can then be processed through ETL pipelines (described below) to create data warehouses.
Database Storage Formats
All databases (except in-memory ones) persist data records in files on disk. The way these records are arranged within the files varies, leading to different storage formats.
Record I/O (Row-Oriented)
Record I/O organizes data by storing all column values for a single record (row) contiguously on disk. While record data might be spread across multiple files, an index structure, commonly a B-tree, facilitates efficient record location. The B-tree structure generally ensures fast index updates during writes.
Record I/O is used in write-heavy databases where write optimization is crucial.
Example Implementations - CSV and
Apache Avro.
Advantages
- Efficient retrieval of specific records by directly following the index and reading the record.
Disadvantages
- Inefficient for projections, requiring multiple small I/Os to read specific columns.
- Limited compression due to the storage of diverse data types together within a row.
Columnar (Column-Oriented)
Introduced in 2005 by
Stonebraker et al. in their influential VLDB paper on
C-Store, columnar storage arranges data by storing all values for a single column contiguously on disk.
Columnar storage is well-suited for read-heavy databases. It compresses each column individually and relies on header information to define column boundaries and store summary statistics (like key ranges for index columns).
Example Implementations - Apache Parquet.
Advantages
- High compression ratios are achievable due to the homogeneous data types within each column (e.g., booleans as bit vectors), reducing storage and improving I/O.
- Efficient reading of specific columns.
Disadvantages
- Difficult record reads, requiring multiple small I/Os to reconstruct a single row.
- Slower transaction processing.
Capacitor
Announced by Google's BigQuery in 2016, the
Capacitor format enhances the columnar approach. Its key innovation is splitting records into partitions and then storing these partitions in a columnar fashion. It also excels at handling structured data efficiently.
BigQuery leverages Capacitor's ability to dynamically reorganize and re-encode data based on usage, optimizing storage and query performance through techniques like run-length and dictionary encoding.
Another notable format is the SSTable, where keys are stored in sorted order. See Paper Insights - Bigtable: A Distributed Storage System for Structured Data to learn about SSTable and the corresponding LSM-tree structure.
Database Layers
Most databases are built as layers on top of other databases, though there are many exceptions.
At the foundation are core transaction engines, often provided as libraries, that manage transactions on the underlying data files. Examples include InnoDB and MyISAM for SQL databases, and WiredTiger, BerkeleyDB, and RocksDB for NoSQL databases. Building upon these engines are standalone databases like MySQL and MongoDB.
Above the core engines and standalone databases, a distributed layer can be implemented to spread data across multiple instances. Amazon Aurora, for instance, distributes data across MySQL instances. Finally, distributed query engines handle query processing and can operate across various types of databases. It's worth noting that standalone databases can also incorporate query engines to process queries within their own data.
Examples of exceptions are:
- PostgreSQL - Unlike MySQL, PostgreSQL has its own query engine.
- Spanner - Unlike Amazon Aurora, Spanner distributes the core engine itself instead of building on top of standalone engines.
ETL Pipelines
These processes are used to extract, transform, and load data into data warehouses. Data sources for ETL pipelines typically include other databases, data lakes, or other data warehouses.
Let's take a couple of examples to understand how ETL is used.
Example 1: Consider an e-commerce platform that processes numerous online orders. To gain insights into sales trends, the business needs an analytical data warehouse that summarizes total revenues on an hourly basis. The raw order data might reside in various sources: transactional SQL databases, periodic batch files from external partners, and even large datasets in diverse formats within a data lake. An ETL pipeline would then be crucial for connecting to these disparate systems, extracting the relevant order information, transforming it (e.g., calculating hourly sums), and loading it into the data warehouse. The pipeline would run every 24 hours and process data for all orders in the last 24 hours.

Example 2: Consider a scenario where a system maintains a detailed analytical table with hourly revenue metrics. To facilitate higher-level analysis and reporting, a daily summary table is required. An ETL pipeline could be designed to run each day, reading the hourly data from the analytical table, aggregating it to a daily level, and then populating the daily summary table.
Example 3: Let's consider a company that gathers customer feedback from various channels, such as online surveys, social media posts, and call center transcripts. To understand overall customer sentiment, an ETL process could extract this textual data, transform it by performing sentiment analysis, and then load the aggregated sentiment scores into a reporting database. This allows the company to track customer satisfaction trends over time.
Cloud Data Store
The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include
Google Cloud Storage,
Amazon S3, and
Azure Blob Storage.
Design Points
- Key-Value Store with Eventual Consistency: Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes.
- Data Immutability: In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in adding a new block to chunk servers and updating metadata, appends in a cloud store can be less efficient.
- Data Partitioning (Buckets): Keys are partitioned into buckets, analogous to directories in a distributed file system.
Limited Functionality
The key-value store is designed for simplicity. Transaction support, especially across multiple keys, is notably missing.
For example, consider a scenario where key A has the value 10 in the store. Transaction support would be necessary to atomically read the value of A and then write its double (20) back to the store. Without transaction support, another write operation could concurrently modify the value of A to 30 before the first operation completes, leading to an incorrect result (20 instead of 60).
Performance Characteristics
Typical latencies range from 5 to 10 ms. Achievable throughput typically falls within the 50-100 MB/s range.
Cloud Data Store v/s Distributed File System
Most cloud applications are composed of microservices, each with distinct compute and storage components:
- Storage Layer: Comprises the underlying storage medium (e.g., disks) where data is persisted and accessed for computation.
- Compute Layer: Handles user requests, performs computations, and typically involves a replicated service running continuously.
This architecture is well-suited for low-latency, real-time services, allowing independent scaling of storage and compute resources.
Distributed File Systems have gained widespread popularity since the advent of
Google File System (GFS, also known as
Colossus). A distributed file system offers a more comprehensive file system interface ideal for database systems. Colossus and
HDFS are well-known examples of distributed file systems.
Distributed File Systems has inherent compute overhead. Distributed file systems incorporate a compute layer consisting of chunk servers (handling read/write requests) and metadata servers (managing metadata updates). These services run constantly, consuming compute resources even when idle.
On the other hand, Cloud Data Stores leverage serverless architectures like
Lambda. The compute layer is invoked on-demand, only when data is read or modified. This pay-per-use model offers cost-effectiveness compared to the continuous compute consumption of distributed file systems, databases, or data warehouses.
Delta Lake
In this paper, the authors aim to construct an
ACID table on top of a cloud data store. This approach can be extended to any eventually consistent key-value store given that the key-value store is not transient (like
Memcache).
It is crucial to remember that this creates an ACID table, not an ACID database. Therefore, transactions spanning multiple tables may still violate ACID properties.
Data Storage
Delta lake stores records in Parquet files which are themselves stored on cloud data store.
Transactions
table/date1/f1.data
f2.data
...
/date2/f1.data
f2.data
...
To handle data modifications, a log-based approach is employed. When a record is updated, the existing Parquet file is not directly altered. Instead:
- A new Parquet file is created to accommodate the changes.
- The original file is soft-deleted, meaning it's marked as inactive but not immediately removed.
Log records maintain a history of additions and deletions. Each log entry corresponds to a specific Parquet file. To enhance system stability and performance, these log records are periodically snapshotted into checkpoints.
To further optimize performance, data files are partitioned. This partitioning strategy improves the efficiency of the aforementioned operations.
Read-Only Transactions
By default, all reads are performed as snapshot reads. Serializable reads are not natively supported. To achieve serializable read behavior, a workaround is necessary: a dummy write operation must be executed.
Performance
Due to its current implementation, the system can only process one transaction at a time, significantly limiting its throughput. This limitation may be acceptable if the database is intended for use in environments with low transaction volumes.
Query Processing
As noted earlier, Parquet files store data in a columnar format. Each Parquet file includes summary statistics for each column, enabling efficient query processing by allowing the system to quickly identify and skip files that do not contain the relevant data. This approach is particularly effective when data is sorted by the queried column.
However, if data records are not sorted based on the query column, performing range queries on that column can become significantly more expensive.
Use a Secondary Index?
Building and maintaining a secondary index requires cross-table transactions, which are not supported by Delta Lake.
Solution: Z-Ordering
Z-ordering is a technique used to optimize the physical layout of data records within Parquet files. By intelligently sorting data based on multiple attributes, Z-ordering enhances data locality.
Instead of sorting data along a single dimension, Z-ordering considers multiple attributes simultaneously. It aims to group together data points that are spatially close to each other in the multi-dimensional space.
Consider a 2D space with points represented by (X, Y) coordinates.
A simple Z-order might traverse the space in the following order:
(1, 1), (1, 2), (2, 1), (2, 2), (1, 3), (1, 4), (2, 3), (2, 4),
(3, 1), (3, 2), (4, 1), (4, 2), (3, 3), (3, 4), (4, 3), (4, 4)
Z-ordering significantly reduces the search space when filtering on multiple attributes, leading to faster query execution. By grouping related data together, Z-ordering improves data locality, which can lead to better data compression and faster read/write operations.
Z-ordering can be extended to higher dimensions, making it suitable for complex datasets with multiple attributes.
Alternative Considerations
Alternative approaches to building ACID stores on top of cloud data stores often involve employing a separate transactional store for object metadata, analogous to metadata servers in distributed file systems. However, Delta Lake addresses this challenge by seamlessly integrating metadata management within the object store itself.
Related Systems
The paper introduces several key cloud systems and technologies:
Data Warehousing
- Google Cloud BigQuery: A data warehouse service built on top of Dremel, leveraging Colossus for distributed storage. It offers columnar storage and separates storage from processing for scalability.
- Amazon Redshift: A columnar-oriented data warehouse similar to BigQuery.
- Snowflake: A cloud-agnostic data warehouse with a multi-cluster shared architecture.
- Apache Hive: An open-source data warehouse built on Hadoop, providing a SQL-like query interface.
Columnar Data Formats
- Apache Parquet: A widely used columnar storage format offering efficient compression and encoding schemes.
- Apache Iceberg: A data format for analytic tables, enabling SQL-based queries across various engines like Spark, Flink, Hive, and Impala.
NoSQL Databases
- Apache Kudu: A columnar-oriented data store within the Hadoop ecosystem, optimized for OLAP workloads and utilizing a relational schema.
- Apache HBase: A non-relational, distributed database supporting wide-column storage, inspired by Google Bigtable.
Query Service
- Amazon Redshift Spectrum: Enables querying data directly in Amazon S3 without prior loading into Redshift, decoupling storage from processing and optimizing costs.
- AWS Athena: An interactive query service that allows users to analyze data directly in Amazon S3 using standard SQL.
Competitor
- Apache Hudi: An open-source framework for managing data lakes, offering features like transactions, upserts, deletes, indexing, clustering, and compaction, positioning it as a direct competitor to Delta Lake.
Paper Review
While the core concept of the paper seems straightforward, the actual implementation likely presented significant engineering challenges. Given the problem statement, I initially anticipated a design where table metadata would be managed by a strongly consistent service.
My primary concern revolves around the use of a single log file. This centralized approach could potentially introduce a single point of contention, limiting concurrency and making optimistic concurrency control difficult or inefficient.
The paper devotes a considerable portion (two pages) to discussing business use cases for Delta Lake, specifically emphasizing the need for millions of partitions. While I acknowledge the potential for such scenarios, I remain somewhat skeptical about the practical frequency of these extreme use cases.
Comments
Post a Comment