Skip to main content

Paper Insights #10 - MapReduce: Simplified Data Processing on Large Clusters

We're now diving into a foundational paper in system design, a work that has sparked countless discussions. Authored by Google's Jeff Dean and Sanjay Ghemawat, it was presented at OSDI in 2004. Notably, this is one of the easiest papers we'll explore in this series.

Paper Link

Let's begin with some basic concepts.

Data Processing

Data processing involves performing operations on collections of data, often represented as records. These records can reside in various locations, such as database tables or files. Two primary approaches to data processing are:

Batch Processing

Processes data in discrete batches or groups. Batch processing is suitable for tasks that can be processed periodically, such as batch updates, nightly reports, and data warehousing. The focus of this article will be on batch processing.

    Stream Processing

    Processes data continuously as it arrives, one record at a time. Stream processing is ideal for real-time applications like fraud detection, stock market analysis, and IoT data processing.

    To learn about stream processing, refer:

    The MapReduce Programming Model

    MapReduce is a programming model designed for processing massive datasets across distributed computing environments. It operates through three core phases:

    • Map: Transforms each input record into a key-value pair. This operation is stateless, meaning it relies solely on the input record for its output.
    • Shuffle: Redistributes and groups the key-value pairs based on their keys. This step prepares the data for the reduce phase, ensuring that all values associated with the same key are processed together.
    • Reduce: Aggregates the values associated with each key, producing a final output.

    Shuffle

    The shuffle phase is a critical phase of MapReduce, responsible for redistributing data to ensure that all values associated with the same key are processed by the same reducer.

    Partitioning Function - A partitioning function determines which reducer will process each key-value pair. It takes an input record and outputs a partition number. Crucially, it must ensure that all records with the same key are assigned to the same partition.

    Reducer Properties

    A key requirement for reducers is that the operation they perform must be have the following properties -
    • Associativity - Ensures that the grouping of elements doesn't change the outcome. 
    reduce(reduce(a, b), c) = reduce(a, reduce(b, c))
    • Commutativity - Ensures that the order of elements doesn't change the outcome.
    reduce(a, b) = reduce(b, a)

    In essence, these properties guarantee that the order in which elements are aggregated is irrelevant.

    Illustrative Example: Word Frequency Counting

    Consider the task of counting the frequency of each word in the sentence: "the quick brown fox jumps over the lazy dog".

    1. Map Phase

    The map function processes each word, emitting a key-value pair where the word is the key and 1 is the value:

    • (the, 1)
    • (quick, 1)
    • (brown, 1)
    • (fox, 1)
    • (jumps, 1)
    • (over, 1)
    • (the, 1)
    • (lazy, 1)
    • (dog, 1)

    2. Shuffle Phase

    The shuffle phase groups the key-value pairs by key, creating a list of values for each unique key:

    • (brown, [1])
    • (dog, [1])
    • (fox, [1])
    • (jumps, [1])
    • (lazy, [1])
    • (over, [1])
    • (quick, [1])
    • (the, [1, 1])

    3. Reduce Phase

    The reduce function aggregates the values for each key, summing them to produce the final word counts:

    • (brown, 1)
    • (dog, 1)
    • (fox, 1)
    • (jumps, 1)
    • (lazy, 1)
    • (over, 1)
    • (quick, 1)
    • (the, 2)

    Scalability and Applications

    This simple example demonstrates the core concept. MapReduce's power lies in its ability to scale to handle massive datasets. When dealing with billions or trillions of words, the input is distributed across multiple machines. The map, shuffle, and reduce phases are executed in parallel, enabling efficient processing.

    MapReduce and SQL

    MapReduce is fundamental to many large-scale data processing systems, including those used for executing SQL queries. Data analytics often involves processing millions of rows, which aligns perfectly with MapReduce's map-shuffle-reduce paradigm. Platforms like BigQuery, Dremel, and F1 utilize MapReduce pipelines to execute complex SQL queries efficiently.

    Distributed Execution of MapReduce

    While a MapReduce pipeline can run on a single machine, its true power lies in distributed execution across multiple machines, known as workers. This parallel processing significantly accelerates computation.

    Components

    • Workers: These machines execute the map and reduce functions.
    • Driver Program: This program orchestrates the entire MapReduce job, scheduling tasks and managing workers.

    Execution Flow

    All distributed execution of MapReduce has roughly the following stages:
    • Input Partitioning: The driver program initially divides the input dataset into smaller, manageable partitions. These partitions are distributed among the map workers. Ideally, each input partition should fit within a worker's memory.
    • Map Phase: Each map worker processes its assigned input partition, generating key-value pairs. The number of output records from the map phase directly corresponds to the number of input records processed.
    • Shuffle and Partitioning: The key-value pairs are shuffled and partitioned. This step groups values with the same key and distributes them to the appropriate reduce workers. Typically, partitioning is integrated into map and reduce phase itself.
    • Reduce Phase: Reduce workers scan the map output partitions, retrieving all key-value pairs relevant to their assigned key space. They then perform the reduce operation, aggregating values for each key.

    Statelessness

    The key to distributed MapReduce execution is ensuring map and reduce functions are stateless. Stateless functions always produce the same output for the same input and are therefore deterministic. 

    This is important because it lets the system safely re-run tasks without changing the final outcome, which is essential for handling failures and keeping the system consistent.

    Underlying File System

    A crucial aspect of distributed MapReduce is the underlying file system. It facilitates data storage, retrieval, and distribution across the worker nodes. We will explore how Google's MapReduce leverages its file system to achieve this.

    Google's MapReduce

    Google's MapReduce system is designed for large-scale data processing across a cluster of machines. Here's a breakdown of its key components and execution flow:

    1. Input Splitting

    Input files are divided into M partitions, typically ranging from 16MB to 64MB.

    2. Master (Driver)

    A central master coordinates all MapReduce tasks, acting as the orchestrator and scheduler for workers in a MapReduce cluster.

    3. Map Phase

    • Map workers are preferably run on machine where the input already resides.
    • Map workers read their assigned input splits.
    • Map output is buffered in memory.
    • Periodically, the buffered output is written to local disk, partitioned based on the keys (partial shuffles).
    • The locations of these partitioned files are communicated to the master.

    4. Shuffle Phase

    • Reduce workers are assigned specific key space.
    • Reduce workers make RPCs to read the required data from the map workers' local disks.
    • Because the partitioning done by mappers was only local to their memory buffer, the reduce phase performs a global scan (total shuffle) to fetch all keys that it has been assigned.

    5. Reduce Phase

    • Reduce workers perform the reduction operation.
    • The final output is written to one or more output files.

    Example

    Consider the SQL query: 

    SELECT COUNT(*) FROM PROFESSORS GROUP BY DEPARTMENT; 

    The GROUP BY clause necessitates shuffling to group all professor records by their respective departments before counting (reduction).

    The input table is divided into input partitions, each assigned to a mapper. Each mapper processes its assigned partition, extracting rows and grouping them by department, which becomes the key. These grouped records, representing partial shuffles, are then written to the mapper's local disk. 

    Each reducer, responsible for specific departments (keys), retrieves the corresponding sub-partitions from the local disks of all mappers and aggregates the count.


    Complexity Analysis

    Consider M map tasks, each producing R partitions, one for each reducible key. These partitions represent partial shuffles.

    Consequently, each mapper performs one I/O operation to read its input and writes R sub-partitions to local disk in another I/O operation. Each reducer requires M I/O operations to read its complete partition and one I/O to write it's output, resulting in a total I/O cost of M + M + (M x R) + R = (2 + R) x M + R.

    The master node maintains metadata for M map tasks and R reduce tasks, leading to a memory complexity of O(M + R).

    Key Features

    Master as Central Coordinator

    The master program maintains the state of all map and reduce tasks and facilitates file location sharing. The state of the master is periodically snapshotted to save the state and facilitate recovery. 

    Output Storage

    Map output (immediate output) is stored on local disks, while reduce output (actual output) is stored on Google File System.

    Partitioning Function

    The default partitioning function is hash(key) % R, where R is the number of reducers. Users can also define custom partitioning functions.

    Combiner Function (Partial Shuffling)

    After the map stage, the combiner function divides the output into partitions. Because each mapper task processes only a subset of the input data, these partitions represent partial shuffles. The reducer tasks then aggregate these partial shuffles to create the final, totally partitioned shuffle.

    Locality Optimization

    Google's MapReduce attempts to schedule map tasks on machines that hold the input data locally, minimizing network traffic.

    Stragglers

    The system addresses "stragglers" (slow workers) by running backup tasks. Examples of Stragglers are:
    • Machines with bad disk.
    • Machines with software bugs such as OS errors.
    The backup task increase the overall computational resources for the operation by no more than a few percentage.

    Fault Tolerance

    Since map output is on local disk, machine failures can lead to data loss. The master program re-executes failed map tasks. Map and reduce tasks are meant to be stateless, enabling safe retries without affecting correctness.

    When a map task finishes, it sends a completion notification to the master. The master ensures atomicity by marking the task as completed and ignoring any subsequent completion reports for that task, including those from backup tasks. For reduce tasks, completion involves atomically renaming the output file on GFS.

    Side-Effects

    While we've primarily discussed stateless and deterministic map/reduce tasks, they can also produce side-effects, such as modifying external system states. To ensure system reliability, especially in the face of failures that trigger task retries, application developers must guarantee that these side-effects are idempotent.

    Bonus: Shuffle Implementations

    There are different implementations of shuffle architecture for MapReduce.

    Pull-Based Shuffle

    Mappers write their output to local disk. Reducers pull their required partitions from the mappers' local disks. Mappers perform M I/O operations, and reducers perform MxR I/O operations.

    This is exactly what is used in Google's MapReduce.

    This is inefficient due to numerous small I/O operations, impacting disk throughput. Reliability and scalability are also concerns.

    Pre-Shuffle Merge

    Mappers on the same machine merge their outputs before writing to disk. This reduces the number of I/O operations. If F mappers are bundled, mappers perform M/F disk I/O, and Reducers perform MxR/F disk I/O. This decreases the total number of I/O operations.

    This is used in Riffle by Facebook.

    Push-Based Shuffle

    Introduces a dedicated shuffle service between mappers and reducers. Mappers push their sub-partitions to the shuffle service, in addition to writing to local disk. Reducers retrieve their partitions from the shuffle service. This decouples mappers and reducers, improving fault tolerance and scalability.

    This is used in Magnet by LinkedIn.

    The shuffle service stores the shuffled partitions on disk to prevent memory overload. Reducers can perform a single large I/O operation to read their assigned partition from the shuffle service.


    Note that, mappers push to the shuffle service is best effort. If the push fails, reducers can fallback to pulling from the mappers local disk.

    Paper Review

    This paper is one of the easiest papers in distributed systems. It is pretty simple to grasp all the concepts as well as the system. The paper is also a very influential one, as it became the backbone of FlumeJava and several batch processing systems.

    Comments

    Popular Posts

    Paper Insights #25 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

    Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s. Paper Link Let's begin with some basic concepts. Network Interface Card (NIC) The NIC facilitates data reception and transmission. Understanding its operation requires examining the fundamental interaction between the CPU and memory. CPU <-> Memory Communication In a Von Neumann Architecture , the CPU and memory are core components, enabling Turing computation. Their communication relies on the system bus (e.g. PCIe ), a set of electrical pathways connecting the CPU, memory, and I/O devices. The system bus comprises three primary logical components: Data Bus : Bidirectional, carrying the actual data being tran...

    Paper Insights #26 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

    This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman . Paper Link Let's begin with some basic concepts. Causal Ordering In 1978, Leslie Lamport published Time, Clocks, and the Ordering of Events in a Distributed System , a seminal paper that significantly impacted distributed system design. This work, alongside Paxos and TLA+ , stands as one of Lamport's most influential contributions. A fundamental challenge in distributed systems is clock synchronization . Perfect synchronization is unattainable, a fact rooted in both computer science and physics. However, the goal isn't perfect synchronization itself, but rather the ability to totally order even...

    Paper Insights #24 - Spanner: Google's Globally-Distributed Database

    This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...