Paper Insights #32 - Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google
Napa represents the next generation of planet-scale data warehousing at Google, following Mesa. Napa is a key system for analytics workloads that stores enormous datasets for various tenants within Google. The extensive authorship of the paper underscores the collaborative effort behind its creation. This paper was presented at VLDB 2021.
Let's begin with some basic concepts.
Data Warehouse
Data warehouses are SQL-compatible databases designed primarily for analytical data storage. Data warehouses, like SQL databases, offer data consistency. However, they typically aren't updated in real-time. In fact, most data warehouses rely on ETL (Extract, Transform, Load) pipelines to feed data into them.
Consider the following example of a data warehouse that contains two relational tables:
-
HourlyRevenue: This table stores the revenue generated from orders received each hour. ETL pipelines periodically update this table by pulling data from various sources like logs and OLTP databases. These pipelines tail the sources such that it ensures that duplicate updates are avoided. Importantly, because updates are periodic, the data warehouse might not reflect the very latest hour's revenue in real-time.
-
DailyRevenue: Similar to HourlyRevenue, this table stores daily revenue figures. To maintain consistency, these daily numbers are essentially materialized views derived from the HourlyRevenue table. There are two main approaches to materializing these views:
- Real-time Materialization: Any update to the HourlyRevenue table immediately updates the DailyRevenue table, ensuring strong consistency between the two. This technique is rare as it can slow down the ingestion pipeline.
- ETL-based Materialization: A separate ETL pipeline aggregates the hourly revenue data for each day and updates the DailyRevenue table.
Note that data warehouses are designed to typically store numeric data suitable for aggregation, rather than arbitrary data types like structs. For storing diverse data types, other database solutions like SQL or NoSQL databases are more appropriate.
Data Model
Let's explore how data is represented and stored within a data warehouse. You can find more details in the Mesa paper by Google. Here, I am summarizing a few crucial points.
Data is organized into tables. The tables are not strictly relational, instead, the underlying model leans towards a key-value structure. Both key and value can encompass multiple columns. But crucially, value columns must be of an aggregatable type, typically numeric (such as integers or floats).
- Associative: F(v0, F(v1, v2)) = F(F(v0, v1), v2)
- Commutative: F(v0, v1) = F(v1, v0)
Differential Storage Scheme
The core principle behind analytical data storage is often to record data as a sequence of changes, or deltas.
For instance, imagine an ETL pipeline runs and captures the revenue for hour X. Let's say the first run records X -> $100. When the ETL pipeline runs again, it detects new orders (not present in the previous run) that contribute to the revenue for hour X. This second run now reports X -> $150.
Instead of simply overwriting the previous value, the data can be stored as a series of deltas:
X -> +$100, +$50
This method of storage is known as a differential storage scheme.
Deltas
A delta is essentially a file that contains keys along with the corresponding change in value for each column.
Each time an ETL pipeline runs to bring data into the data warehouse, it generates a new delta. This newly created delta is then applied to both the base table and any associated materialized views.
The entire analytical database is therefore built upon a sequence of these deltas. These deltas are often organized in a structure similar to a Log-Structured Merge (LSM) tree. This involves a hierarchy of delta files, where higher levels contain progressively larger deltas formed by merging smaller deltas from lower levels. The merging process for each key utilizes the aggregator functions defined earlier. The largest, merged delta residing at the bottommost level is referred to as base delta/table.
A significant advantage of this delta-based approach is the ability to execute a query against any prefix of the delta sequence. This will yield a consistent snapshot of the data as it existed at that specific point in time.
Each table independently maintains its data deltas. Consequently, within the database, every table has its own LSM tree structure. Collectively, these individual LSM trees constitute what can be described as a log-structured merge forest.
Physical Format
Delta files are typically organized in a columnar format. Within a delta file, the keys are segmented into blocks. Each of these blocks is then stored column by column, often employing compression techniques to reduce storage space.
Alongside each delta data file, an index file is maintained. This index provides pointers to the different blocks within the corresponding data file. Furthermore, to accelerate query performance, each data file block also stores the actual keys present within it.
The lookup process then involves:
- Performing a binary search on the index file to locate the relevant block.
- Conducting a binary search within the key list of the identified data file block.
Consistency
Applying incremental updates (deltas) to base tables and their associated materialized views can occur asynchronously. Tables might be updated before their corresponding roll-up views. Additionally, in sharded data warehouses replicated across clusters, updates can arrive at different shards independently and out of logical order.
Despite this potential for temporary inconsistencies, it's crucial that queries return consistent results. For instance, the total daily revenue calculated from an hourly table must always align with the revenue recorded in the daily summary table, even though the daily table is updated offline periodically.
A straightforward approach to ensure this consistency is to restrict queries to the point where all relevant deltas have been applied across all shards, tables, and views. This simplification is viable because data warehouses do not need real-time updates, which significantly eases their design compared to distributed SQL databases.
In the example above, the consistent data points are queryable. Conversely, the inconsistent point, lacking the necessary updates for the corresponding daily table view, cannot be queried.
Napa
Napa, Google's next-generation, large-scale data warehouse, builds upon the foundation of Mesa. Its architecture divides data processing into distinct stages:
This decoupling is strategically designed to optimize for a primary objective: achieving low-latency and consistent query performance.
The ingestion phase in Napa prioritizes high throughput, generating data deltas rapidly. However, unmerged deltas become inefficient and significantly increase query costs.
Ultimately, Napa aims to strike a balance between three crucial factors:
- Data Freshness: How quickly updates become available for querying.
- Query Latency: The speed at which queries are executed and results are returned.
- Cost: The resources expended to ensure timely data updates.
Achieving all three simultaneously – high freshness, low query latency, and low cost – is inherently challenging. The trade-offs are as follows:
- High Freshness + Low Cost = High Query Latency: Prioritizing immediate data availability with limited resources leads to slower query execution.
- Low Cost + Low Query Latency = Low Freshness: Focusing on efficient querying with minimal expenditure necessitates delaying data updates.
- High Freshness + Low Query Latency = High Cost: Delivering real-time data with fast query responses requires significant resource investment.
Sacrificing data freshness allows for low-latency and cost-effective querying. Traditional data warehouses, including Mesa, exemplify this by batching updates and applying them periodically (e.g., hourly). This consolidation optimizes data for querying; for instance, merging all deltas for a table reduces the need to access multiple delta files during lookups. Consequently, updates are intentionally delayed to enhance query performance and minimize costs.
Conversely, prioritizing high ingestion rates can compromise query performance. The ETL pipeline operates at a high frequency, results in queries needing to process and combine data from numerous deltas, thus increasing latency.
Queryable Timestamp
Architecture
Storage Structure
The underlying storage structure mirrors that of Mesa. Data for each table is organized as a collection of deltas, and these deltas across all tables in a database collectively form an LSM forest.
The primary storage format for these deltas is columnar, where retrieving each column necessitates a separate I/O operation. For smaller deltas, an optimized format called PAX (Portable Archives) is utilized, enabling all records to be read with a single I/O operation.
View Maintenance
Data Skew
View Sort Order
- Prefix Key: If the view's key is a prefix of the base table's key (e.g., base key <A, B, C, D>, view key <A, B>), no sorting is necessary.
- Partial Prefix Key: If the view's key is a partial prefix (e.g., view key base key <A, B, C, D>, <A, B, D>), the non-prefix components of the view's key (i.e., D) need to be sorted before reduction. However, this scenario might not introduce significant sorting requirements.
- Non-Prefix Key: If the view's key shares no prefix with the base table's key (e.g., base key <A, B, C, D>, view key <D, C, A>), substantial sorting is required.
Compaction
Query Serving
View-Based Querying
Filter Pushdown
B-tree Indexing for Delta Pruning
Optimized QT Lookup
Distributed Read-Through Cache
Prefetching Cache Layer
- Offline Prefetching: For frequently queried tables, blocks are proactively fetched into the cache as soon as the QT advances.
- Online Prefetching: If a shadow query execution has already fetched relevant data blocks, the main query executor can leverage this cached data. The introduction of shadow query execution lacks clarity and detail.
Lazy Merging Across Deltas
Hybrid File Format (Columnar and PAX)
Read Hedging
Production Metrics
- Query latency decreases with an increasing number of materialized views. This is anticipated, as materialized views enable queries to avoid costly operations like joins or sorts on frequently accessed data paths.
- Query latency increases with a greater number of deltas involved in a query. This is attributed to the increased I/O and merging operations required by the delta servers.
- When the view maintenance pipeline experienced issues, query latency remained unaffected, although data freshness was compromised. This stability was due to the use of QT for query execution.
- Napa finds application across various domains, including:
- Internal experimentation and analysis clients, which prioritize query performance and low cost over strict data freshness.
- Another application type that prioritizes lower resource costs, accepting higher query latency as a trade-off.
- External, user-facing applications that prioritize both high data freshness and high query performance, incurring higher costs to achieve this.
Paper Review
The paper presents a wealth of information, and I must admit that certain concepts are likely more accessible to database domain experts. As someone less familiar with the intricacies of this field, some sections were not entirely clear. Additionally, the production metrics, while present, felt somewhat less insightful than I had hoped for.
Despite these points, both this paper and its predecessor on Mesa offer a comprehensive introduction to the inner workings of data warehouses. For those seeking to deepen their understanding, I would also recommend exploring The Snowflake Elastic Data Warehouse. After grasping the concepts presented in these two papers, the Snowflake architecture should be relatively straightforward to comprehend.
Comments
Post a Comment