Skip to main content

Paper Insights #12 - The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

This influential paper from Google, presented at VLDB 2015, is a landmark in Data Engineering. Authored by Tyler Akidau, a distinguished engineer at Snowflake and former Google employee, it explores groundbreaking concepts. Akidau's work on Millwheel significantly influenced Apache Flink, while his Dataflow Model laid the groundwork for Apache Beam. Notably, Google Cloud Dataflow implements the Apache Beam framework.

Paper Link

For a deeper understanding, I recommend a valuable YouTube playlist that effectively explains the core ideas presented in this paper.

Motivating Example

This paper focuses on streaming systems. For a better understanding of the context, I recommend reviewing my previous post, Apache Flink™: Stream and Batch Processing in a Single Engine, which explains the key differences between batch and stream processing.

In a streaming system, events arrive continuously and are processed in an ongoing manner. The core concept of this paper is to process these events based on their "event time" – the actual time when the event occurred.

Challenges of Event Time Processing

Processing events based on their event time often leads to out-of-order processing. This is because events may not arrive at the processing system in the same order as they were generated.

Example:

Consider a stream of logs generated by end-user devices. To process these logs in batches, we can define a window (e.g., 2 minutes). All events arriving within this window are collected and processed together. However, we have multiple notions of time:

  • Processing Time: The time when a log reaches the processing system is referred to as "processing time". 

  • Event Time: The actual time when a log was generated is its "event time". 

For example, if a log was generated at 9:58 AM and reached the system at 9:59 AM, its event time is 9:58 AM and its processing time is 9:59 AM.

Processing events based on processing time window is relatively simpler. Systems like Spark Streaming operate based on processing time semantics, collecting events into batches based on their arrival time.

Processing events based on event time window is complex. With billions of devices generating logs, it's difficult to determine when all events within a specific event time window will arrive at the processing system. For instance, if we want to process all events generated before 10:00 AM, we need to know when all events from that timeframe have been received.

Watermarks

To address this challenge, the concept of "watermarks" is introduced. A watermark is a control message sent to the system to indicate that all events generated before a certain time are likely to have arrived. For example, a watermark at 10:01 AM may suggest that all events generated before 10:00 AM have probably been received, allowing the system to proceed with processing the batch.

Limitations of Watermarks

Even with watermarks, there's still a possibility of "late" events – events that arrive after their expected event-time based processing window has closed. For instance, a log generated at 9:59 AM might arrive at the system at 10:05 AM. Since the watermark for the 10:00 AM window has already passed, the system will be forced to ignore this late event.

Apache Flink makes use of watermarks and has a similar limitation.

The Dataflow Model

To address the issue of late events, the Dataflow model provides mechanisms to handle them appropriately. This ensures that the impact of late events is accurately reflected in the system's processing, even though they arrive after their designated batch has been processed.

Dataflow

The Dataflow model represents a novel approach to stream processing, offering a unique set of advantages:

  • Event Time Processing: Processes events based on their actual occurrence time, not just their arrival time.
  • Fault Tolerance: Resilient to system failures, ensuring data integrity and continuous processing.
  • Eventual Correctness: A novel idea that allows for controlled trade-offs between processing latency and accuracy.
  • Scalability: Designed to handle massive volumes of data with high throughput.
  • Low Latency: Achieves minimal processing delays for time-sensitive applications.

The core of the Dataflow model lies in its ability to flexibly adjust the balance between latency and correctness.

Event time v/s Processing time

Event Time: The time at which the event actually occurred.
Processing Time: The time at which the event is processed by the system.

Ideally, processing time should perfectly align with event time, meaning events are processed immediately upon occurrence. However, in reality, events may arrive at the processing system with a delay.

Challenges of Out-of-Order and Late Events

Delays can cause processing discrepancies. Late events, those arriving significantly after their expected time, complicate the situation. Maintaining the correct processing sequence becomes challenging.

In the graph above, an event appearing late can result in out-of-order processing. Both Spark and Flink ignore this event as "late" leading to incorrectness.

For most real-world applications, it's common to encounter out-of-order, late events. Effectively handling these situations is critical for ensuring the accuracy and reliability of the stream processing system.

Watermarks

Watermarks are mechanisms (either internal or external to the system) that signal the system's knowledge of event arrival. Watermarks always correspond to event time. They represent specific points in the event time space, indicating that all events occurring before that point are believed to have arrived at the system.

It's important to note that watermarks may not always be accurate reflections of event arrival. Events can arrive late leading to out-of-order arrivals. Despite this, the watermark itself must always monotonically increase. This ensures that the system's understanding of event arrival progresses forward in event time.

The watermark line progresses monotonically as events arrive. Events that arrive after their corresponding watermark is established are considered "late" events.

Windows and Windows Pane

Windows are defined as segments or regions within the event time space. They can be:

  • Overlapping: Windows can partially overlap with each other.
  • Non-overlapping: Windows have no overlap.
  • Aligned: All windows cover the same event time bounds across all event keys.
  • Unaligned: Window event time bounds may vary across different event keys.

Types of Windows

  • Fixed Windows (Tumbling Windows): Non-overlapping windows with a fixed duration.
  • Sliding Windows: Overlapping windows with a fixed duration, sliding forward at regular intervals.
  • Session Windows: Capture activity within a specific time period for a particular subset of data, unaligned.

The paper provides a detailed algorithm for creating windows from event streams in section 2.2. I will not delve into the specifics here, as the paper itself offers a clear explanation with illustrative examples.

Window Panes and Processing

A window of events in event time space may be processed multiple times due to the arrival of late events. Each processing instance of a window is referred to as a "pane".

Triggering Window Pane Processing (Vertical Axis Bounds)

To initiate the processing of a window pane, specific triggers are required:

  • Processing Time Intervals: Trigger processing at regular intervals (e.g., every minute).
  • Num Events: Trigger processing when a certain number of events have arrived.

Triggering Window Processing (Horizontal Axis Bounds)

To determine when a window is complete in event time, divide the event time space into equal intervals and then utilize watermarks to track the progress of event time and identify when a specific point in event time has likely been reached.

The processing machine has the knowledge of processing time using system clock, however, it lacks direct knowledge of event time progression. Watermarks are crucial for tracking the advancement of event time within the system.

Triggering the Rectangle

By combining triggers in both the horizontal (processing time) and vertical (event time) dimensions, the system can effectively determine when a rectangular window is ready for processing.

The paper provides great examples in section 2.4, along with API usage to explain how windowing mechanism works for different window types.

Incremental Processing

Since the same window may have multiple panes due to late events, the system needs an efficient mechanism to handle these incremental updates. Three common approaches are:

  • Discarding: Discard all previously processed panes for the window and process only the events in the new pane. This is the most efficient method.
  • Accumulating: Accumulate all events across all panes of the window and process them together. This approach is suitable only if the processing logic overwrites previous results with the latest known value.
  • Accumulating and Retraction: This two-step process involves first retracting (or reversing the effect) of the previously processed panes and then processing the new pane, which now includes all accumulated events for the window. This method is powerful but requires that the consumers of the processed results can effectively reverse the impact of previous processing. This is not feasible if the processing results have already been committed as part of a transaction.

Related Systems

Google Dataflow

The concepts presented in this paper have been materialized in Google Cloud Dataflow, a powerful stream and batch processing service built upon the foundations of FlumeJava and MillWheel.

  • FlumeJava (2010): An early batch processing system built on the MapReduce framework. It provides a set of APIs, specifically immutable parallel collections, for writing MapReduce programs. FlumeJava acts as a compiler framework, optimizing user-defined MapReduce applications.
  • MillWheel (2013): A low-latency stream processing engine where user-defined logic is represented as a directed graph. Each node in the graph represents a processing step. MillWheel guarantees exactly-once processing semantics.
  • Dataflow (2013): A unified framework that seamlessly integrates batch processing (leveraging MapReduce) and stream processing (utilizing MillWheel). It offers a simplified programming model and high-level abstractions. Dataflow is open-sourced as Apache Beam and is also available as a managed service on GCP as Cloud Dataflow.

Other Google Stream Processing Technologies

Google also developed Photon, a stream processing engine that evolved from Ubiq. A key distinction of Photon is its multi-homed architecture, where jobs are replicated globally while maintaining exactly-once semantics through PaxosDB-based transactions.


Google Query Engines

Google has a suite of powerful query engines that makes use of data processing system under the hood:

  • Dremel: For interactive query processing. It is the brain behind Google's BigQuery. Underneath it makes use of MapReduce.
  • F1 Query: Google's most advanced query engine, also leveraging MapReduce for efficient query execution.

Key Technologies Outside Google

  • Hadoop: A foundational framework for distributed storage and processing of large datasets.
  • Pig: A high-level language for expressing dataflow programs that execute on Hadoop. (Relationship: Pig to Hadoop :: FlumeJava to MapReduce)
  • Hive: Built on top of Hadoop, enabling SQL-like data querying and analysis.
  • Spark: A fast and general-purpose cluster computing system that extends the MapReduce model.
    • Spark Streaming: An extension of Spark for stream processing, primarily focusing on micro-batching.
  • Apache Storm: A distributed stream processing platform that provides at-least-once processing guarantees.

Paper Review

This paper does not introduce a novel system architecture for data processing. Instead, it leverages existing systems and introduces a new concept for achieving a balance between correctness and latency in data processing.

I initially found the paper's concepts challenging to grasp. The most difficult aspect was comprehending the problem statement. However, once I fully understood the problem, the rest of the paper became much easier to follow. I highly recommend this paper to anyone interested in Big Data.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...