Skip to main content

Paper Insights #11 - Apache Flink™: Stream and Batch Processing in a Single Engine

This paper was presented at the IEEE International Conference on Cloud Engineering Workshop in 2015. Workshops, generally considered less prestigious than the main conference itself, provide a forum for more focused discussions and specialized research. The authors of the paper acknowledge that the work presented is not solely their own, demonstrating commendable modesty in recognizing the contributions of others.

Batch and Stream Processing

Data processing involves performing operations on collections of data, often represented as records. These records can reside in various locations, such as database tables or files. Two primary approaches to data processing are:

Batch Processing

Processes data in discrete batches or groups. There are certain primitive operations in batch processing:

  • Map: Applies a function to each individual record within a batch, producing an output for each input.
  • Reduce: Aggregates the results of the map operation, often by combining values associated with the same key.
  • Shuffle: (Optional) Redistributes data between processing stages, typically between the map and reduce phases, to improve data locality and efficiency.

Example

Counting word frequencies across multiple documents:
  • Map: Count occurrences of each word within each document.
  • Shuffle: Group all occurrences of the same word from different documents.
  • Reduce: Sum the counts for each word across all documents.
Batch processing is suitable for tasks that can be processed periodically, such as batch updates, nightly reports, and data warehousing.

Stream Processing

Processes data continuously as it arrives, one record at a time. Prime characteristics are:

  • Low latency: Processes data with minimal delay.
  • Real-time analysis: Enables immediate insights and responses to events.
Stream processing is ideal for real-time applications like fraud detection, stock market analysis, and IoT data processing.

Examples of Big Data Processing Systems

Early systems, such as FlumeJava (2010) built upon Google's MapReduce framework, primarily focused on batch processing. Apache Spark, another prominent early player, also initially centered around batch processing capabilities.

Around 2013, the landscape shifted towards stream processing. Spark Streaming emerged, addressing stream processing by dividing the incoming data stream into short intervals and processing each interval as a mini-batch. While effective, this approach retains some characteristics of batch processing. 

Google's Millwheel, on the other hand, pioneered true stream processing by enabling continuous, real-time processing of data with features like stateful operations (e.g., incrementing a counter for each incoming record).

The terms "batch-native" and "stream-native" are often used to categorize systems. Batch-native systems, like early versions of Spark, were initially designed for batch processing and later extended to support stream processing. Conversely, stream-native systems, such as Apache Flink, were inherently designed for stream processing and subsequently adapted to handle batch processing as a special case.

Apache Flink, drawing significant inspiration from Millwheel, exemplifies a stream-native approach. Its core architecture is fundamentally designed for stream processing, with batch processing effectively treated as a specialized form of stream processing.

Data Flow Graph

A data flow graph is a directed acyclic graph (DAG) consisting of stateful operators.

This example illustrates data flow in a stream processing system by extending the word frequency count scenario to real-time processing:

  • Counter: This is a stateless operator. It receives a document as input and produces a map of words to their frequencies within that document.

  • Shuffle: Another stateless operator, it distributes individual words from the word-frequency maps to corresponding "Adder" operators based on the word's alphabetical range. This ensures efficient distribution of data.

  • Adder: A stateful operator responsible for maintaining the current count of each word across all processed documents.

    • It receives word-frequency maps from the "Shuffle" operator.
    • Updates its internal state by adding the frequencies of incoming words to the existing counts.
    • Can be sharded (e.g., one Adder for words A-M, another for N-Z) to improve performance and scalability.
    • The current state of the Adders (word counts) can be visualized on an analytics dashboard to provide real-time insights.

This example demonstrates a basic data flow in a stream processing system. However, real-world systems incorporate a much wider range of operators, enabling complex data transformations and analyses.

Pipelined Operators

In stream-native processing systems, operators must be designed to support continuous data flow, enabling efficient pipelining of events. Key examples of pipelined operators include:

  • Window:

    • Collects a stream of events within a defined time window (e.g., sliding window).
    • Pipelined because the collection of subsequent windows can proceed concurrently, without waiting for the processing of previous windows to complete.
    • In systems like Flink, batch processing can be emulated by applying a window operator to collect all events within a specific batch interval.
  • Map:

    • Applies a function to each incoming event independently.
    • Highly pipelined as each event can be processed concurrently without relying on other events.
  • Group By (Shuffle):

    • Redistributes events based on a key.
    • Pipelined because events are continuously forwarded to the appropriate downstream operator (e.g., reducer) responsible for the corresponding key.
  • Reduce:

    • Aggregates incoming events.
    • Supports "rolling reduce" for continuous aggregation, requiring an associative function (e.g., sum, count).
    • The output represents the ongoing reduction of all events seen thus far.
  • Interval Join:

    • The only type of join that can be effectively pipelined.
    • Joins events within a specific time interval.
    • Essentially equivalent to applying a window operator followed by a join on the accumulated events within that window.

Non-pipelined Operators

Certain operators are inherently non-pipelined, meaning they cannot process data continuously and require access to the entire dataset before producing results.

Example: A sort-merge join that requires sorting the entire dataset before performing the join operation is a classic example of a non-pipelined operation.

Data Exchange in Flink

Operators in Flink communicate by exchanging data.

  • Pipelined Streams: Characterize continuous, uninterrupted data flow from non-window operators (e.g., Counter, Shuffle), ideal for stream processing.
  • Blocking Streams: Generated by window operators, requiring the entire window to be collected before processing, typically used in batch processing.
Operators exchange data through serialization and buffering (accumulating data before transmission). Buffers are periodically flushed to the consumer, improving throughput but potentially introducing latency.

Control Events in Flink

In addition to data records, control messages can flow through the data flow graph, enabling the triggering of specific conditions or actions within the processing pipeline.

Watermarks

Stream processing involves two notions of time:

  • Processing Time: The time at which an event is processed by the system.
  • Event Time: The time at which the event occurred in the real world.

Windowing operations rely heavily on these time concepts.

  • Processing Time Windows: These windows are defined by the system's processing time. Events arriving within a specific processing time window are grouped together. This approach can be problematic when event arrival times are unpredictable, as processing time may not accurately reflect the true event times.

  • Event Time Windows: These windows are defined by the event times themselves. Events with event times falling within a specific window are grouped together. This approach provides a more accurate representation of real-world event occurrences.

For example, consider a stream of application logs from mobile devices. Due to network latency, these logs may experience delays before reaching the processing system. This discrepancy between the event's actual occurrence time (event time) and its processing time can lead to inaccurate results when using processing time windows.

Watermarks address this challenge. A watermark signals that no further events with an event time earlier than the watermark's timestamp are expected to arrive. This allows the system to proceed with window processing even if some late events might still be in transit.

In other words, processing time advances monotonically with the system clock. In contrast, event time requires explicit mechanisms like watermarks to progress, ensuring timely window closure and subsequent processing.

Limitations of Watermarks

While watermarks effectively advance event time processing, they can be less effective in handling significantly late events (e.g., events arriving long after the watermark has passed).

In our example, what happens if a user disconnects from the internet? Logs generated during this period might be delayed significantly. The watermark may have already progressed? That where Apache Beam is useful. It has the capability of handling late events.

Recommended ReadThe Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Examples

  • Spark Streaming: Only works with processing time windows.
  • Flink: Supports both processing time and event time windows, providing greater flexibility.
  • Apache Beam: Offers robust support for handling late events, providing more sophisticated mechanisms for dealing with data arriving out of order.

Checkpoint Barriers

Flink guarantees exactly-once processing semantics, ensuring that each event in the stream affects the system state exactly once. To achieve this, Flink leverages Asynchronous Barrier Snapshotting (ABS).

  • ABS Mechanism: Checkpoint barriers are propagated through the operator pipeline. Upon encountering a barrier, each operator records its current state (e.g., the current count in a counter operator) to a persistent storage. This creates a consistent snapshot of the system's state across all operators.

  • Restart and Recovery: In case of a failure, operators can recover from the last successful checkpoint. They resume processing from the point where the checkpoint was taken, replaying events and applying updates to their state accordingly.

Limitations of ABS

ABS ensures exactly-once processing within the Flink system itself. However, it cannot prevent duplicate external triggers by the processing logic. For instance, if an operator initiates an external transaction (e.g., booking a flight ticket) as part of its processing, recovering from a checkpoint will result in the transaction being executed again, leading to duplicates.

Extending Exactly-Once Semantics Beyond Flink

Since 2017, Flink has extended its exactly-once guarantees to external systems. This is achieved through a two-phase commit protocol:

  1. Prepare Phase: When the last operator in the pipeline receives a checkpoint barrier, it initiates a two-phase commit with all downstream sinks. Before committing any data, the operator buffers all output records in memory.

  2. Commit Phase: Once all operators have successfully recorded their checkpoints, the last operator commits all buffered output records to the sinks.

Example: Kafka Sink

For a Kafka sink, the last operator starts a Kafka transaction when a checkpoint barrier arrives. Subsequent barriers trigger the completion of the previous transaction, ensuring that each event is written to Kafka exactly once.

Windowing Algorithms in Flink

Flink offers several windowing algorithms for grouping and processing data streams:

Tumbling Windows

  • Divide the data stream into fixed-size, non-overlapping intervals.
  • Suitable for applications requiring regular, discrete time intervals for analysis, such as hourly or daily aggregations.

Sliding Windows

  • Divide the data stream into fixed-size windows that overlap.
  • Enable continuous analysis with overlapping intervals, useful for rolling averages or trend detection.

Session Windows

  • Dynamically sized windows defined by periods of inactivity in the data stream.
  • Ideal for analyzing user sessions or other scenarios where activity patterns are irregular.

Global Windows

  • Unbounded windows that encompass the entire data stream.
  • Typically used for global aggregations or analyses requiring consideration of all available data.
  • Require an external trigger (e.g., a manual command or a scheduled event) to close the window and trigger processing.

Bulk Synchronous Parallel v/s Stale Synchronous Parallel

BSP requires all workers to synchronize at the end of each iteration, leading to potential bottlenecks due to barrier synchronization.


SSP allows workers to proceed independently, improving speed but potentially compromising consistency.


SSP is suitable for algorithms like stochastic gradient descent where some degree of staleness is acceptable.

Difference b/w Apache Flink, Apache Spark, Apache Storm?

Apache FlinkDesigned primarily for stream processing with strong support for stateful computations and event-time semantics. Ensures data is processed exactly once, even in case of failures.

Apache SparkExcels at batch processing tasks, with stream processing capabilities added later. Known for its fast and general-purpose nature, supporting a wide range of data processing workloads.

Apache StormFocused solely on real-time stream processing with high throughput and low latency. Guarantees that each message will be processed at least once, but may be processed multiple times in case of failures.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...