Pico

Posts

Paper Insights #32 - Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams

Presented at SIGMOD 2013, this paper from Google details another innovation stemming from Google Ads, a platform known for its planet-scale data processing. Notable authors include Ashish Gupta , a senior engineering leader within Google Ads, and Manpreet Singh , a principal engineer at Google. The year 2013 marked a significant period for stream processing, as Google was concurrently developing MillWheel and Dataflow , foundational technologies that influenced the creation of Apache Flink and Apache Beam . Paper Link Must Read : Paper Insights - Apache Flink™: Stream and Batch Processing in a Single Engine Paper Insights - The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing Let's begin with what streaming join is. Streaming Joins Query engines enable users to query data from a variety of sources. If the retrieved data is in a relational format, it can be joined based on common keys. This...

Paper Insights #31 - F1 Query: Declarative Querying at Scale

We shift our focus from databases to a query engine. Google presented this paper at VLDB, the premier global database conference, in 2018. Notably, this paper has a number of authors and is incredibly dense. With so many parts, the paper only provides a high-level idea of its different components. Paper Link Recommended Read : Paper Insights - MapReduce: Simplified Data Processing on Large Clusters Let's begin with some basic concepts. Query Engine SQL (Structured Query Language) stands as the standard formal language for data extraction and processing. A query engine , then, is the system that executes these SQL queries. Query Engines vs. SQL Databases SQL databases manage the storage of data in a relational format and handle transactions on that data. They inherently provide atomicity, concurrency control, durability, and transaction isolation. In contrast, a query engine's role is to extract a view of the data from a SQL database based on the user's formal query. In tra...

Paper Insights #30 - Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google

Napa represents the next generation of planet-scale data warehousing at Google, following Mesa . A key system for analytics workloads, Napa stores enormous datasets for various tenants within Google. The extensive authorship of the paper underscores the collaborative effort behind its creation. This paper was presented at VLDB 2021. Paper Link Let's begin with some basic concepts. Data Warehouse Data warehouses are SQL-compatible databases designed primarily for analytical data storage. Data warehouses, like SQL databases, offer data consistency. However, they typically aren't updated in real-time. In fact, most data warehouses rely on ETL (Extract, Transform, Load) pipelines to feed data into them. Consider this example: a data warehouse contains two relational tables: HourlyRevenue: This table stores the revenue generated from orders received each hour. ETL pipelines periodically update this table by pulling data from various sources like logs and OLTP systems. These pip...

Pico

Search This Blog

Posts

Paper Insights #33 - CRDTs: Consistency without Concurrency Control

Paper Insights #32 - Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams

Paper Insights #31 - F1 Query: Declarative Querying at Scale

Paper Insights #30 - Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google