Paper Insights #10 - MapReduce: Simplified Data Processing on Large Clusters

We're now diving into a foundational paper in system design, a work that has sparked countless discussions. Authored by Google's Jeff Dean and Sanjay Ghemawat, it was presented at OSDI in 2004. Notably, this is one of the easiest papers we'll explore in this series.

Paper Link

Let's begin with some basic concepts.

Data Processing

Data processing involves performing operations on collections of data, often represented as records. These records can reside in various locations, such as database tables or files. Two primary approaches to data processing are:

Batch Processing

Processes data in discrete batches or groups. Batch processing is suitable for tasks that can be processed periodically, such as batch updates, nightly reports, and data warehousing. The focus of this article will be on batch processing.

Stream Processing

Processes data continuously as it arrives, one record at a time. Stream processing is ideal for real-time applications like fraud detection, stock market analysis, and IoT data processing.

To learn about stream processing, refer:

The MapReduce Programming Model

MapReduce is a programming model designed for processing massive datasets across distributed computing environments. It operates through three core phases:

Map: Transforms each input record into a key-value pair. This operation is stateless, meaning it relies solely on the input record for its output.

Shuffle: Redistributes and groups the key-value pairs based on their keys. This step prepares the data for the reduce phase, ensuring that all values associated with the same key are processed together.

Reduce: Aggregates the values associated with each key, producing a final output.

Shuffle

The shuffle phase is a critical phase of MapReduce, responsible for redistributing data to ensure that all values associated with the same key are processed by the same reducer.

Partitioning Function - A partitioning function determines which reducer will process each key-value pair. It takes an input record and outputs a partition number. Crucially, it must ensure that all records with the same key are assigned to the same partition.

Reducer Properties

A key requirement for reducers is that the operation they perform must be have the following properties -

Associativity - Ensures that the grouping of elements doesn't change the outcome.

reduce(reduce(a, b), c) = reduce(a, reduce(b, c))

Commutativity - Ensures that the order of elements doesn't change the outcome.

reduce(a, b) = reduce(b, a)

In essence, these properties guarantee that the order in which elements are aggregated is irrelevant.

Illustrative Example: Word Frequency Counting

Consider the task of counting the frequency of each word in the sentence: "the quick brown fox jumps over the lazy dog".

1. Map Phase

The map function processes each word, emitting a key-value pair where the word is the key and 1 is the value:

(the, 1)
(quick, 1)
(brown, 1)
(fox, 1)
(jumps, 1)
(over, 1)
(the, 1)
(lazy, 1)
(dog, 1)

2. Shuffle Phase

The shuffle phase groups the key-value pairs by key, creating a list of values for each unique key:

(brown, [1])
(dog, [1])
(fox, [1])
(jumps, [1])
(lazy, [1])
(over, [1])
(quick, [1])
(the, [1, 1])

3. Reduce Phase

The reduce function aggregates the values for each key, summing them to produce the final word counts:

(brown, 1)
(dog, 1)
(fox, 1)
(jumps, 1)
(lazy, 1)
(over, 1)
(quick, 1)
(the, 2)

Scalability and Applications

This simple example demonstrates the core concept. MapReduce's power lies in its ability to scale to handle massive datasets. When dealing with billions or trillions of words, the input is distributed across multiple machines. The map, shuffle, and reduce phases are executed in parallel, enabling efficient processing.

MapReduce and SQL

MapReduce is fundamental to many large-scale data processing systems, including those used for executing SQL queries. Data analytics often involves processing millions of rows, which aligns perfectly with MapReduce's map-shuffle-reduce paradigm. Platforms like BigQuery, Dremel, and F1 utilize MapReduce pipelines to execute complex SQL queries efficiently.

Distributed Execution of MapReduce

While a MapReduce pipeline can run on a single machine, its true power lies in distributed execution across multiple machines, known as workers. This parallel processing significantly accelerates computation.

Components

Workers: These machines execute the map and reduce functions.
Driver Program: This program orchestrates the entire MapReduce job, scheduling tasks and managing workers.

Execution Flow

All distributed execution of MapReduce has roughly the following stages:

Input Partitioning: The driver program initially divides the input dataset into smaller, manageable partitions. These partitions are distributed among the map workers. Ideally, each input partition should fit within a worker's memory.

Map Phase: Each map worker processes its assigned input partition, generating key-value pairs. The number of output records from the map phase directly corresponds to the number of input records processed.

Shuffle and Partitioning: The key-value pairs are shuffled and partitioned. This step groups values with the same key and distributes them to the appropriate reduce workers. Typically, partitioning is integrated into map and reduce phase itself.

Reduce Phase: Reduce workers scan the map output partitions, retrieving all key-value pairs relevant to their assigned key space. They then perform the reduce operation, aggregating values for each key.

Statelessness

The key to distributed MapReduce execution is ensuring map and reduce functions are stateless. Stateless functions always produce the same output for the same input and are therefore deterministic.

This is important because it lets the system safely re-run tasks without changing the final outcome, which is essential for handling failures and keeping the system consistent.

Underlying File System

A crucial aspect of distributed MapReduce is the underlying file system. It facilitates data storage, retrieval, and distribution across the worker nodes. We will explore how Google's MapReduce leverages its file system to achieve this.

Google's MapReduce

Google's MapReduce system is designed for large-scale data processing across a cluster of machines. Here's a breakdown of its key components and execution flow:

1. Input Splitting

Input files are divided into M partitions, typically ranging from 16MB to 64MB.

2. Master (Driver)

A central master coordinates all MapReduce tasks, acting as the orchestrator and scheduler for workers in a MapReduce cluster.

3. Map Phase

Map workers are preferably run on machine where the input already resides.
Map workers read their assigned input splits.
Map output is buffered in memory.
Periodically, the buffered output is written to local disk, partitioned based on the keys (partial shuffles).
The locations of these partitioned files are communicated to the master.

4. Shuffle Phase

Reduce workers are assigned specific key space.
Reduce workers make RPCs to read the required data from the map workers' local disks.
Because the partitioning done by mappers was only local to their memory buffer, the reduce phase performs a global scan (total shuffle) to fetch all keys that it has been assigned.

5. Reduce Phase

Reduce workers perform the reduction operation.
The final output is written to one or more output files.

Example

Consider the SQL query:

SELECT COUNT(*) FROM PROFESSORS GROUP BY DEPARTMENT;

The GROUP BY clause necessitates shuffling to group all professor records by their respective departments before counting (reduction).

The input table is divided into input partitions, each assigned to a mapper. Each mapper processes its assigned partition, extracting rows and grouping them by department, which becomes the key. These grouped records, representing partial shuffles, are then written to the mapper's local disk.

Each reducer, responsible for specific departments (keys), retrieves the corresponding sub-partitions from the local disks of all mappers and aggregates the count.

Complexity Analysis

Consider M map tasks, each producing R partitions, one for each reducible key. These partitions represent partial shuffles.

Consequently, each mapper performs one I/O operation to read its input and writes R sub-partitions to local disk in another I/O operation. Each reducer requires M I/O operations to read its complete partition and one I/O to write it's output, resulting in a total I/O cost of M + M + (M x R) + R = (2 + R) x M + R.

The master node maintains metadata for M map tasks and R reduce tasks, leading to a memory complexity of O(M + R).

Key Features

Master as Central Coordinator

The master program maintains the state of all map and reduce tasks and facilitates file location sharing. The state of the master is periodically snapshotted to save the state and facilitate recovery.

Output Storage

Map output (immediate output) is stored on local disks, while reduce output (actual output) is stored on Google File System.

Partitioning Function

The default partitioning function is hash(key) % R, where R is the number of reducers. Users can also define custom partitioning functions.

Combiner Function (Partial Shuffling)

After the map stage, the combiner function divides the output into partitions. Because each mapper task processes only a subset of the input data, these partitions represent partial shuffles. The reducer tasks then aggregate these partial shuffles to create the final, totally partitioned shuffle.

Locality Optimization

Google's MapReduce attempts to schedule map tasks on machines that hold the input data locally, minimizing network traffic.

Stragglers

The system addresses "stragglers" (slow workers) by running backup tasks. Examples of Stragglers are:

Machines with bad disk.
Machines with software bugs such as OS errors.

The backup task increase the overall computational resources for the operation by no more than a few percentage.

Fault Tolerance

Since map output is on local disk, machine failures can lead to data loss. The master program re-executes failed map tasks. Map and reduce tasks are meant to be stateless, enabling safe retries without affecting correctness.

When a map task finishes, it sends a completion notification to the master. The master ensures atomicity by marking the task as completed and ignoring any subsequent completion reports for that task, including those from backup tasks. For reduce tasks, completion involves atomically renaming the output file on GFS.

Side-Effects

While we've primarily discussed stateless and deterministic map/reduce tasks, they can also produce side-effects, such as modifying external system states. To ensure system reliability, especially in the face of failures that trigger task retries, application developers must guarantee that these side-effects are idempotent.

Bonus: Shuffle Implementations

There are different implementations of shuffle architecture for MapReduce.

Pull-Based Shuffle

Mappers write their output to local disk. Reducers pull their required partitions from the mappers' local disks. Mappers perform M I/O operations, and reducers perform MxR I/O operations.

This is exactly what is used in Google's MapReduce.

This is inefficient due to numerous small I/O operations, impacting disk throughput. Reliability and scalability are also concerns.

Pre-Shuffle Merge

Mappers on the same machine merge their outputs before writing to disk. This reduces the number of I/O operations. If F mappers are bundled, mappers perform M/F disk I/O, and Reducers perform MxR/F disk I/O. This decreases the total number of I/O operations.

This is used in Riffle by Facebook.

Push-Based Shuffle

Introduces a dedicated shuffle service between mappers and reducers. Mappers push their sub-partitions to the shuffle service, in addition to writing to local disk. Reducers retrieve their partitions from the shuffle service. This decouples mappers and reducers, improving fault tolerance and scalability.

This is used in Magnet by LinkedIn.

The shuffle service stores the shuffled partitions on disk to prevent memory overload. Reducers can perform a single large I/O operation to read their assigned partition from the shuffle service.

Note that, mappers push to the shuffle service is best effort. If the push fails, reducers can fallback to pulling from the mappers local disk.

Paper Review

This paper is one of the easiest papers in distributed systems. It is pretty simple to grasp all the concepts as well as the system. The paper is also a very influential one, as it became the backbone of FlumeJava and several batch processing systems.

Pico