Paper Insights #30 - Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google
Napa represents the next generation of planet-scale data warehousing at Google, following Mesa. A key system for analytics workloads, Napa stores enormous datasets for various tenants within Google. The extensive authorship of the paper underscores the collaborative effort behind its creation. This paper was presented at VLDB 2021.
Let's begin with some basic concepts.
Data Warehouse
Data warehouses are SQL-compatible databases designed primarily for analytical data storage. Data warehouses, like SQL databases, offer data consistency. However, they typically aren't updated in real-time. In fact, most data warehouses rely on ETL (Extract, Transform, Load) pipelines to feed data into them.
Consider this example: a data warehouse contains two relational tables:
-
HourlyRevenue: This table stores the revenue generated from orders received each hour. ETL pipelines periodically update this table by pulling data from various sources like logs and OLTP systems. These pipelines ensure that duplicate updates are avoided. Importantly, because updates are periodic, the data warehouse might not reflect the very latest hour's revenue in real-time.
-
DailyRevenue: Similar to HourlyRevenue, this table stores daily revenue figures. To maintain consistency, these daily numbers are essentially materialized views derived from the HourlyRevenue table. There are two main approaches to materializing these views:
- Real-time Materialization: Any update to the HourlyRevenue table immediately updates the DailyRevenue table, ensuring strong consistency between the two.
- ETL-based Materialization: A separate ETL pipeline aggregates the hourly revenue data for each day and updates the DailyRevenue table.
It's important to note that data warehouses are not designed for transactional workloads. Furthermore, they typically store numeric data suitable for aggregation, rather than arbitrary data types like structs. For storing diverse data types, other database solutions like SQL or NoSQL databases are more appropriate.
Data Model
Let's explore how data is represented and stored within a data warehouse. You can find more details in the Mesa paper by Google.
Data is organized into tables. The tables are not strictly relational, instead, the underlying model leans towards a key-value structure. The key and value can encompass multiple columns. But crucially, value columns must be of an aggregatable type, typically numeric (such as integers or floats).
- Associative: F(v0, F(v1,v2)) = F(F(v0, v1), v2)
- Commutative: F(v0, v1) = F(v1, v0)
Differential Storage Scheme
The core principle behind analytical data storage is often to record data as a sequence of changes, or deltas.
For instance, imagine an ETL pipeline runs and captures the revenue for hour X. Let's say the first run records X -> $100. When the ETL pipeline runs again, it detects new orders (not present in the previous run) that contribute to the revenue for hour X. This second run now reports X -> $150.
Instead of simply overwriting the previous value, the data can be stored as a series of deltas:
X -> +$100, +$50
This method of storage is known as a differential storage scheme.
Deltas
A delta is essentially a file that contains keys along with the corresponding change in value for each column.
Each time an ETL pipeline runs to bring data into the data warehouse, it generates a new delta. This newly created delta is then applied to both the base table and any associated materialized views.
The entire analytical database is therefore built upon a sequence of these deltas. These deltas are often organized in a structure similar to a Log-Structured Merge (LSM) tree. This involves a hierarchy of delta files, where higher levels contain progressively larger deltas formed by merging smaller deltas from lower levels. The merging process for each key utilizes the aggregator functions defined earlier. The largest, merged deltas residing at the bottommost level are referred to as base deltas.
A significant advantage of this delta-based approach is the ability to execute a query against any prefix of the delta sequence. This will yield a consistent snapshot of the data as it existed at that specific point in time.
Each table independently maintains its data deltas. Consequently, within the database, every table has its own LSM tree structure. Collectively, these individual LSM trees constitute what can be described as a log-structured merge forest.
Physical Format
Delta files are typically organized in a columnar format. Within a delta file, the keys are segmented into blocks. Each of these blocks is then stored column by column, often employing compression techniques to reduce storage space.
Alongside each delta data file, an index file is maintained. This index provides pointers to the different blocks within the corresponding data file. Furthermore, to accelerate query performance, each data file block also stores the actual keys present within it.
The lookup process then involves:
- Performing a binary search on the index file to locate the relevant block.
- Conducting a binary search within the key list of the identified data file block.
Consistency
Applying incremental updates (deltas) to base tables and their associated materialized views can occur asynchronously. Tables might be updated before their corresponding roll-up views, and in sharded data warehouses replicated across clusters, updates can arrive at different shards independently and out of logical order.
Despite this potential for temporary inconsistencies, it's crucial that queries return consistent results. For instance, the total daily revenue calculated from an hourly table must always align with the revenue recorded in the daily summary table, even though the daily table is updated offline periodically.
A straightforward approach to ensure this consistency is to globally order all delta applications across the database. By restricting queries to the point where all relevant deltas have been applied across all shards, tables, and views, we can achieve consistency. This simplification is viable because data warehouses prioritize eventual consistency over real-time updates, which significantly eases their design compared to distributed SQL databases.
In the example above, the consistent data points are queryable. Conversely, the inconsistent point, lacking the necessary updates for the corresponding hourly table view (the daily rollup table), cannot be queried.
Napa
Napa, Google's next-generation, large-scale data warehouse, builds upon the foundation of Mesa. Its architecture divides data processing into distinct stages:
This decoupling is strategically designed to optimize for a primary objective: achieving low-latency and consistent query performance.
The ingestion phase in Napa prioritizes high throughput, generating data deltas rapidly. However, unprocessed deltas older than a threshold become inefficient and significantly increase query costs.
Ultimately, Napa aims to strike a balance between three crucial factors:
- Data Freshness: How quickly updates become available for querying.
- Query Latency: The speed at which queries are executed and results are returned.
- Cost: The resources expended to ensure timely data updates.
Achieving all three simultaneously – high freshness, low query latency, and low cost – is inherently challenging. The trade-offs are as follows:
- High Freshness + Low Cost = High Query Latency: Prioritizing immediate data availability with limited resources leads to slower query execution.
- Low Cost + Low Query Latency = Low Freshness: Focusing on efficient querying with minimal expenditure necessitates delaying data updates.
- High Freshness + Low Query Latency = High Cost: Delivering real-time data with fast query responses requires significant resource investment.
Sacrificing data freshness allows for low-latency and cost-effective querying. Traditional data warehouses, including Mesa, exemplify this by batching updates and applying them periodically (e.g., hourly). This consolidation optimizes data for querying; for instance, merging all deltas for a table reduces the need to access multiple delta files during lookups. Consequently, updates are intentionally delayed to enhance query performance and minimize costs.
Conversely, prioritizing high ingestion rates can compromise query performance. The ETL pipeline operates at a high frequency, results in queries needing to process and combine data from numerous deltas, thus increasing latency.
Queryable Timestamp
Architecture
Storage Structure
The underlying storage structure mirrors that of Mesa. Data for each table is organized as a collection of deltas, and these deltas across all tables in a database collectively form an LSM forest.
The primary storage format for these deltas is columnar, where retrieving each column necessitates a separate I/O operation. For smaller deltas, an optimized format called PAX (Portable Archives) is utilized, enabling the entire record to be read with a single I/O operation.
View Maintenance
Let's say the data is -
Compaction
Query Serving
Production Metrics
Paper Review
The paper presents a wealth of information, and I must admit that certain concepts are likely more accessible to database domain experts. As someone less familiar with the intricacies of this field, some sections were not entirely clear. Additionally, the production metrics, while present, felt somewhat less insightful than I had hoped for.
Despite these points, both this paper and its predecessor on Mesa offer a comprehensive introduction to the inner workings of data warehouses. For those seeking to deepen their understanding, I would also recommend exploring The Snowflake Elastic Data Warehouse. After grasping the concepts presented in these two papers, the Snowflake architecture should be relatively straightforward to comprehend.
Comments
Post a Comment