We're now diving into a foundational paper in system design, a work that has sparked countless discussions. Authored by Google's Jeff Dean and Sanjay Ghemawat, it was presented at OSDI in 2004. Notably, this is one of the easiest papers we'll explore in this series.
Let's begin with some basic concepts.
Data Processing
Data processing involves performing operations on collections of data, often represented as records. These records can reside in various locations, such as database tables or files. Two primary approaches to data processing are:
Batch Processing
Processes data in discrete batches or groups. Batch processing is suitable for tasks that can be processed periodically, such as batch updates, nightly reports, and data warehousing. The focus of this article will be on batch processing.
Stream Processing
Processes data continuously as it arrives, one record at a time. Stream processing is ideal for real-time applications like fraud detection, stock market analysis, and IoT data processing.
The MapReduce Programming Model
MapReduce is a programming model designed for processing massive datasets across distributed computing environments. It operates through three core phases:
- Map: Transforms each input record into a key-value pair. This operation is stateless, meaning it relies solely on the input record for its output.
- Shuffle: Redistributes and groups the key-value pairs based on their keys. This step prepares the data for the reduce phase, ensuring that all values associated with the same key are processed together.
- Reduce: Aggregates the values associated with each key, producing a final output.
Shuffle
Reducer Properties
- Associativity - Ensures that the grouping of elements doesn't change the outcome.
- Commutativity - Ensures that the order of elements doesn't change the outcome.
Illustrative Example: Word Frequency Counting
Consider the task of counting the frequency of each word in the sentence: "the quick brown fox jumps over the lazy dog".
1. Map Phase
The map function processes each word, emitting a key-value pair where the word is the key and 1 is the value:
- (the, 1)
- (quick, 1)
- (brown, 1)
- (fox, 1)
- (jumps, 1)
- (over, 1)
- (the, 1)
- (lazy, 1)
- (dog, 1)
2. Shuffle Phase
The shuffle phase groups the key-value pairs by key, creating a list of values for each unique key:
- (brown, [1])
- (dog, [1])
- (fox, [1])
- (jumps, [1])
- (lazy, [1])
- (over, [1])
- (quick, [1])
- (the, [1, 1])
3. Reduce Phase
The reduce function aggregates the values for each key, summing them to produce the final word counts:
- (brown, 1)
- (dog, 1)
- (fox, 1)
- (jumps, 1)
- (lazy, 1)
- (over, 1)
- (quick, 1)
- (the, 2)
Scalability and Applications
This simple example demonstrates the core concept. MapReduce's power lies in its ability to scale to handle massive datasets. When dealing with billions or trillions of words, the input is distributed across multiple machines. The map, shuffle, and reduce phases are executed in parallel, enabling efficient processing.
MapReduce and SQL
MapReduce is fundamental to many large-scale data processing systems, including those used for executing SQL queries. Data analytics often involves processing millions of rows, which aligns perfectly with MapReduce's map-shuffle-reduce paradigm. Platforms like BigQuery, Dremel, and F1 utilize MapReduce pipelines to execute complex SQL queries efficiently.
Distributed Execution of MapReduce
While a MapReduce pipeline can run on a single machine, its true power lies in distributed execution across multiple machines, known as workers. This parallel processing significantly accelerates computation.
Components
- Workers: These machines execute the map and reduce functions.
- Driver Program: This program orchestrates the entire MapReduce job, scheduling tasks and managing workers.
Execution Flow
- Input Partitioning: The driver program initially divides the input dataset into smaller, manageable partitions. These partitions are distributed among the map workers. Ideally, each input partition should fit within a worker's memory.
- Map Phase: Each map worker processes its assigned input partition, generating key-value pairs. The number of output records from the map phase directly corresponds to the number of input records processed.
- Shuffle and Partitioning: The key-value pairs are shuffled and partitioned. This step groups values with the same key and distributes them to the appropriate reduce workers. Typically, partitioning is integrated into map and reduce phase itself.
- Reduce Phase: Reduce workers scan the map output partitions, retrieving all key-value pairs relevant to their assigned key space. They then perform the reduce operation, aggregating values for each key.
Statelessness
Underlying File System
A crucial aspect of distributed MapReduce is the underlying file system. It facilitates data storage, retrieval, and distribution across the worker nodes. We will explore how Google's MapReduce leverages its file system to achieve this.
Google's MapReduce
Google's MapReduce system is designed for large-scale data processing across a cluster of machines. Here's a breakdown of its key components and execution flow:
1. Input Splitting
Input files are divided into M partitions, typically ranging from 16MB to 64MB.2. Master (Driver)
A central master coordinates all MapReduce tasks, acting as the orchestrator and scheduler for workers in a MapReduce cluster.3. Map Phase
- Map workers are preferably run on machine where the input already resides.
- Map workers read their assigned input splits.
- Map output is buffered in memory.
- Periodically, the buffered output is written to local disk, partitioned based on the keys (partial shuffles).
- The locations of these partitioned files are communicated to the master.
4. Shuffle Phase
- Reduce workers are assigned specific key space.
- Reduce workers make RPCs to read the required data from the map workers' local disks.
- Because the partitioning done by mappers was only local to their memory buffer, the reduce phase performs a global scan (total shuffle) to fetch all keys that it has been assigned.
5. Reduce Phase
- Reduce workers perform the reduction operation.
- The final output is written to one or more output files.
Example
SELECT COUNT(*) FROM PROFESSORS GROUP BY DEPARTMENT;
The GROUP BY
clause necessitates shuffling to group all professor records by their respective departments before counting (reduction).
The input table is divided into input partitions, each assigned to a mapper. Each mapper processes its assigned partition, extracting rows and grouping them by department, which becomes the key. These grouped records, representing partial shuffles, are then written to the mapper's local disk.
Each reducer, responsible for specific departments (keys), retrieves the corresponding sub-partitions from the local disks of all mappers and aggregates the count.
Complexity Analysis
Consider M map tasks, each producing R partitions, one for each reducible key. These partitions represent partial shuffles.
Consequently, each mapper performs one I/O operation to read its input and writes R sub-partitions to local disk in another I/O operation. Each reducer requires M I/O operations to read its complete partition and one I/O to write it's output, resulting in a total I/O cost of M + M + (M x R) + R = (2 + R) x M + R.
The master node maintains metadata for M map tasks and R reduce tasks, leading to a memory complexity of O(M + R).
Key Features
Master as Central Coordinator
Output Storage
Partitioning Function
Combiner Function (Partial Shuffling)
Locality Optimization
Stragglers
- Machines with bad disk.
- Machines with software bugs such as OS errors.
Fault Tolerance
Side-Effects
Bonus: Shuffle Implementations
Pull-Based Shuffle
Mappers write their output to local disk. Reducers pull their required partitions from the mappers' local disks. Mappers perform M I/O operations, and reducers perform MxR I/O operations.This is inefficient due to numerous small I/O operations, impacting disk throughput. Reliability and scalability are also concerns.
Pre-Shuffle Merge
Mappers on the same machine merge their outputs before writing to disk. This reduces the number of I/O operations. If F mappers are bundled, mappers perform M/F disk I/O, and Reducers perform MxR/F disk I/O. This decreases the total number of I/O operations.Push-Based Shuffle
Introduces a dedicated shuffle service between mappers and reducers. Mappers push their sub-partitions to the shuffle service, in addition to writing to local disk. Reducers retrieve their partitions from the shuffle service. This decouples mappers and reducers, improving fault tolerance and scalability.Paper Review
This paper is one of the easiest papers in distributed systems. It is pretty simple to grasp all the concepts as well as the system. The paper is also a very influential one, as it became the backbone of FlumeJava and several batch processing systems.
Comments
Post a Comment