Real World Architecture · MapReduce

Key Question

How does data flow from Map to Reduce, and what happens during the Shuffle?

Deep Dive

A MapReduce job has five phases: Split, Map, Shuffle, Sort, Reduce. The Shuffle is the most complex and the most network-intensive.

Full Execution Flow

Input Files (on GFS)
     │
     ▼
┌─────────────────────────────────────────┐
│ Split into M = 3 partitions (each ~64MB)│
└─────────────────────────────────────────┘
     │          │          │
     ▼          ▼          ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Map 1   │ │ Map 2   │ │ Map 3   │      ← M Map tasks
│ (local) │ │ (local) │ │ (local) │         on machines with data
└────┬────┘ └────┬────┘ └────┬────┘
     │           │           │
  Local disk  Local disk  Local disk        ← Intermediate data stored
  ┌─┬─┬─┐      ┌─┬─┬─┐     ┌─┬─┬─┐           on local disk (not GFS)
  │P│P│P│      │P│P│P│     │P│P│P│
  │0│1│2│      │0│1│2│     │0│1│2│           ← R partitions per map
  └┬┴┬┴┬┘      └┬┴┬┴┬┘     └┬┴┬┴┬┘
   │ │ │         │ │ │       │ │ │
   │ │ └─────────┼─┼─────────┼─┘ │
   │ └───────────┼─┼─────────┼───┘            ← SHUFFLE: all-to-all
   └─────────────┼─┼─────────┘                   (each reducer fetches
                 │ │                              from EVERY map node)
                 ▼ ▼
           ┌──────────┐
           │ Reduce 1 │                         ← SORT on (key, value)
           │ (merged) │
           └────┬─────┘
                ▼
           Output part-00001
                 │
                 ▼
           GFS (final output)

Before the Shuffle: Map Output Partitioning

When a Map task finishes processing its input split, it doesn’t just dump all key-value pairs into one pile. It partitions them into R partitions (R = number of reduce tasks) using a partitioning function:

partition = hash(key) mod R

Each partition is written to the local disk of the Map task’s machine. So if you have M = 1000 Map tasks and R = 100 Reduce tasks, there are 100,000 partition files on local disks across the cluster (1000 maps × 100 partitions each).

The Shuffle (Step 3): All-to-All Data Transfer

The Shuffle is the phase where intermediate data moves from Map machines to Reduce machines. It’s called “all-to-all” because every reducer must fetch data from every mapper.

Here’s what the Shuffle involves for a single reducer (say, Reducer 7):

Reducer 7 needs partition 7 from every Map task.

Map 1    →  fetch /map1/part-00007    (over network)
Map 2    →  fetch /map2/part-00007    (over network)
Map 3    →  fetch /map3/part-00007    (over network)
...
Map 1000 →  fetch /map1000/part-00007 (over network)

Total data per reducer ≈ (input size × intermediate ratio) / R

The Shuffle starts as soon as any Map task finishes (it doesn’t wait for all Maps). Each Reducer continuously fetches its partitions from completed Map tasks.

Sort (Step 4): Grouping by Key

Once a Reducer has fetched all its partition data, it must sort by key. This groups all values for the same key together:

Before sort:
  (dog, 1), (brown, 1), (fox, 1), (brown, 1), (dog, 1)

After sort:
  (brown, [1, 1])
  (dog,   [1, 1])
  (fox,   [1])

The sort is external (uses disk) if the data is too large to fit in memory.

Why the Shuffle Is the Bottleneck

The Shuffle is the most expensive phase for three reasons:

Network saturation. All intermediate data crosses the network exactly once between the Map and Reduce phases. For a 1 TB input, the intermediate data could be 500 GB to 2 TB (depending on the computation). All of this data is transferred over the cluster network.
Many-to-many communication. Every mapper communicates with every reducer. With 1000 mappers and 100 reducers, that’s 100,000 network connections.
Straggler sensitivity. If one Map task is slow, it delays ALL reducers (they cannot produce final output until they have data from every Map task).

Optimizing Shuffle performance is the single most important tuning step for MapReduce jobs. Compression, combiner functions, and reducing intermediate data size are the primary techniques.

Check Your Understanding

Why is intermediate data written to local disk instead of GFS? What would change if it were written to GFS?
Can a Reduce task start processing before all Map tasks finish? What constraint governs this?
With M = 1000 Map tasks and R = 10 Reduce tasks, how many partition files exist on disk after all Maps complete? Where are they?

The “So What?”

The Shuffle is often the performance bottleneck in MapReduce jobs. Understanding it is crucial for tuning: compress intermediate data, use Combiners to reduce data before the Shuffle, and ensure even key distribution (no hot keys). Spark optimized this by keeping data in memory and using DAG-based shuffle, but the all-to-all data movement is unavoidable in the map-and-aggregate pattern.

✏️ Exercises

MapReduce: Exercises

Map output loss. A Map task completes successfully and stores its output on local disk. Then the worker machine loses power (crashes). The master detects the failure and marks the Map task as “idle” for re-execution. Why can’t the master just use the completed output? Where is the output stored?
Reduce task dependency. The Map phase has 100 tasks. The Reduce phase has 10 tasks. Can Reduce task 3 begin processing its data before Map task 57 has finished? Explain the constraint precisely.
Straggler math. You have 1000 Map tasks. Each task takes 10 seconds normally. There is a 1% chance per task that it runs on a “bad” machine and takes 100 seconds. What is the expected job completion time WITHOUT backup tasks? What happens WITH backup tasks (assume the backup has a 99% chance of finishing in 10 seconds on a non-bad machine)?
Google’s evolution. Google eventually moved away from MapReduce toward Flume (and later Apache Beam/Dataflow). What limitations of MapReduce drove this migration? What did Flume/Dataflow do differently?

👁️ View Solutions

MapReduce: Solutions

1. Map output loss

Answer. The Map task’s output is stored on the local disk of the worker machine. It is NOT stored in GFS. The reason: intermediate data is temporary (it will be consumed by Reduce tasks and then deleted), so writing it to GFS would be unnecessarily expensive (3x replication, network I/O). When the worker machine crashes, its local disk becomes inaccessible, so the output is lost. The master marks the task as “idle” and re-executes it on a different worker. The Map function is deterministic, so re-execution produces identical output.

2. Reduce task dependency

Answer. The precise constraint: for Reduce task r to produce its final output, it must have received partition r from ALL Map tasks (1 through 100). This is because each Map task may have emitted key-value pairs that hash to partition r. However, the Reducer does NOT need to wait for all Map tasks to finish before it starts fetching — it can fetch partition r from each Map task as soon as that Map task finishes. The Shuffle runs concurrently with the tail of the Map phase. But the Reducer cannot apply the Reduce function to produce final output until partition r is complete (i.e., all Map tasks have provided their share of partition r).

3. Straggler math

Answer.

Without backup tasks: The probability that a specific task is a straggler is 1% (0.01). With 1000 tasks, the number of stragglers follows a binomial distribution. The probability of at least one straggler is 1 - P(zero stragglers) = 1 - (0.99)^1000 ≈ 1 - 0.000043 ≈ 0.999957. With at least one straggler, the completion time is at least 100 seconds (the straggler time). So the expected completion time ≈ 100 seconds (technically slightly less if there’s a tiny chance of zero stragglers, but effectively 100s).

With backup tasks: When the master detects the remaining in-progress tasks (near the end of the job), it schedules backup copies. If the original task is slow because of a bad machine, the backup on a good machine finishes in 10 seconds. The backup finishes first, and the master reports completion at ~11 seconds (10s for the fast tasks + a small delay for the backup to be scheduled and complete). The expected completion time drops from ~100s to ~11s — a ~9x improvement. This matches the paper’s reported results.

4. Google’s evolution

Answer. Google moved away from MapReduce because of several limitations:

Batch-only. MapReduce processes data in fixed batches. It cannot handle streaming or real-time data. Flume introduced pipelines that process data incrementally.
Disk-based intermediate data. MapReduce writes Map outputs to disk before the Shuffle. Flume/Dataflow keeps data in memory across stages (pipelined execution), dramatically reducing latency.
Limited programming model. MapReduce is a two-stage model (Map then Reduce). Many real-world pipelines need more stages (Map → Reduce → Map → Reduce) or branching. Flume/Dataflow uses a DAG (Directed Acyclic Graph) of transformations, not a fixed two-stage structure.
No join optimization. MapReduce handles joins poorly (they must be implemented manually as secondary sort or multi-stage jobs). Flume/Dataflow has native join operations with optimizer-calculated strategies.
No incremental processing. Every MapReduce job processes the entire input. Dataflow supports windowing and incremental processing, so repeating a computation on new data doesn’t re-process the entire dataset.
Resource inefficiency. The fixed slot model (fixed number of Map and Reduce slots) wastes resources. Dataflow uses dynamic resource allocation.

Apache Beam (open-source successor to Flume) unifies batch and streaming under a single programming model, which was the key innovation Google needed beyond MapReduce.