Real World Architecture · MapReduce

Key Question

What happens when one of the 1000 Map tasks fails halfway through?

Deep Dive

MapReduce runs on commodity machines that fail regularly. Disk failures, network partitions, OOM kills, and operator errors are expected. The system must handle failures without losing data or leaving the job in an inconsistent state.

The Master’s Watch

A single master coordinates the MapReduce job. The master:

Assigns Map tasks to workers (data-local workers get priority)
Assigns Reduce tasks after Map phase completes
Pings workers periodically (heartbeat every few seconds)
Tracks task state: idle, in-progress, completed, or failed
Maintains the identity and location of every intermediate file

Task State Machine:

          start     assigned     worker reports
  idle ────────► in-progress ──────────────► completed
                    │  ▲
                    │  │ heartbeat timeout
                    │  └──► idle (re-execute)
                    │
                    └──► failed (error reported)

  failed tasks are retried up to a limit,
  then the entire job fails.

Worker Failure: The Classic Case

A Map worker fails (machine crashes, process dies, network lost):

Timeline:

T=0  Master assigns Map task 42 to Worker W
T=5  W finishes Map 42, reports completion
T=7  Master assigns Map task 55 to Worker W
T=10 Master's heartbeat to W times out
     ┌─ Master marks W as dead
     ├─ All of W's in-progress tasks → idle
     ├─ All of W's completed tasks → idle  ← CRITICAL
     └─ Master re-assigns these tasks to other workers

T=12 Master assigns Map task 55 to Worker X (was idle)
T=13 Master assigns Map task 42 to Worker Y (was "completed"!)
T=18 Worker X finishes Map 55
T=20 Worker Y finishes Map 42 (re-execution)

Why re-execute completed Map tasks? Because Map outputs are stored on the local disk of the worker machine. When the worker goes down, its local disk is inaccessible. Even though the Map task had completed, its output is gone. Reduce tasks that were in the process of fetching data from that worker will retry when the task is re-executed.

This is why intermediate data is NOT stored in GFS: writing to GFS would be expensive (3x replication, network overhead), and the data is temporary — it will be consumed by Reduce tasks and then discarded. The trade-off is that completed Maps on dead workers must be re-executed.

Reduce Worker Failure

A Reduce worker failing is less costly:

Reduce worker R7 fails:
1. Master detects R7 is dead
2. R7's in-progress tasks are set to idle
3. R7's Reduce tasks are re-assigned to other workers
4. No re-execution of Map tasks needed — the intermediate data is still
   on the Map workers' disks (those Map workers are still alive)
5. The new Reduce worker re-fetches partition data from all Map workers

The Purity Assumption

The entire fault tolerance model rests on one critical assumption: Map and Reduce functions are pure (deterministic, side-effect free). If you run the same Map function on the same input twice, you get exactly the same output. This means re-execution is always safe.

Consider what happens if Map is NOT pure:

Non-deterministic Map. If Map calls random() or reads the system clock, re-execution could produce different intermediate key-value pairs. The Reduce output would be inconsistent across runs.
Side effects in Map. If Map writes to an external database, re-execution would duplicate the writes. The database would get double-counted data.
Non-idempotent Map. If Map increments a shared counter, re-execution would double-count.

MapReduce enforces purity by design. The framework doesn’t check it; your job just breaks subtly if you violate it. (Debugging a non-deterministic MapReduce job is a nightmare I hope you never experience.)

Handling Slow Workers (Speculative Execution)

The master also handles workers that aren’t failing but are suspiciously slow (these are called stragglers — covered in detail in the next lesson). The master can speculatively execute a backup copy of a slow task, keeping whichever finishes first.

Limits of Fault Tolerance

MapReduce’s fault tolerance has limits:

Master failure. If the master fails, the entire job fails. There’s no master HA in the original MapReduce.
Determinism requirement. Non-deterministic functions produce irreproducible results after re-execution.
Input data loss. If the underlying GFS chunk is corrupted and unrecoverable, Map tasks cannot be re-executed — the input is gone.

Check Your Understanding

If a Map task fails, why are its outputs lost even if they were “completed”?
Can a Reduce function start before ALL Map tasks finish? What mechanism governs this?
A Map task reads the current timestamp and uses it as part of its output key. The worker fails mid-task. The master re-executes the task on a different worker. What’s the problem?

The “So What?”

MapReduce’s fault tolerance depends on the Map function being deterministic. This purity constraint is not theoretical — it’s a fundamental requirement passed down to every framework that uses the MapReduce model. When you write Spark transformations, they inherit this same constraint: transformations must be deterministic for lineage-based recovery to work correctly.

✏️ Exercises

MapReduce: Exercises

Map output loss. A Map task completes successfully and stores its output on local disk. Then the worker machine loses power (crashes). The master detects the failure and marks the Map task as “idle” for re-execution. Why can’t the master just use the completed output? Where is the output stored?
Reduce task dependency. The Map phase has 100 tasks. The Reduce phase has 10 tasks. Can Reduce task 3 begin processing its data before Map task 57 has finished? Explain the constraint precisely.
Straggler math. You have 1000 Map tasks. Each task takes 10 seconds normally. There is a 1% chance per task that it runs on a “bad” machine and takes 100 seconds. What is the expected job completion time WITHOUT backup tasks? What happens WITH backup tasks (assume the backup has a 99% chance of finishing in 10 seconds on a non-bad machine)?
Google’s evolution. Google eventually moved away from MapReduce toward Flume (and later Apache Beam/Dataflow). What limitations of MapReduce drove this migration? What did Flume/Dataflow do differently?

👁️ View Solutions

MapReduce: Solutions

1. Map output loss

Answer. The Map task’s output is stored on the local disk of the worker machine. It is NOT stored in GFS. The reason: intermediate data is temporary (it will be consumed by Reduce tasks and then deleted), so writing it to GFS would be unnecessarily expensive (3x replication, network I/O). When the worker machine crashes, its local disk becomes inaccessible, so the output is lost. The master marks the task as “idle” and re-executes it on a different worker. The Map function is deterministic, so re-execution produces identical output.

2. Reduce task dependency

Answer. The precise constraint: for Reduce task r to produce its final output, it must have received partition r from ALL Map tasks (1 through 100). This is because each Map task may have emitted key-value pairs that hash to partition r. However, the Reducer does NOT need to wait for all Map tasks to finish before it starts fetching — it can fetch partition r from each Map task as soon as that Map task finishes. The Shuffle runs concurrently with the tail of the Map phase. But the Reducer cannot apply the Reduce function to produce final output until partition r is complete (i.e., all Map tasks have provided their share of partition r).

3. Straggler math

Answer.

Without backup tasks: The probability that a specific task is a straggler is 1% (0.01). With 1000 tasks, the number of stragglers follows a binomial distribution. The probability of at least one straggler is 1 - P(zero stragglers) = 1 - (0.99)^1000 ≈ 1 - 0.000043 ≈ 0.999957. With at least one straggler, the completion time is at least 100 seconds (the straggler time). So the expected completion time ≈ 100 seconds (technically slightly less if there’s a tiny chance of zero stragglers, but effectively 100s).

With backup tasks: When the master detects the remaining in-progress tasks (near the end of the job), it schedules backup copies. If the original task is slow because of a bad machine, the backup on a good machine finishes in 10 seconds. The backup finishes first, and the master reports completion at ~11 seconds (10s for the fast tasks + a small delay for the backup to be scheduled and complete). The expected completion time drops from ~100s to ~11s — a ~9x improvement. This matches the paper’s reported results.

4. Google’s evolution

Answer. Google moved away from MapReduce because of several limitations:

Batch-only. MapReduce processes data in fixed batches. It cannot handle streaming or real-time data. Flume introduced pipelines that process data incrementally.
Disk-based intermediate data. MapReduce writes Map outputs to disk before the Shuffle. Flume/Dataflow keeps data in memory across stages (pipelined execution), dramatically reducing latency.
Limited programming model. MapReduce is a two-stage model (Map then Reduce). Many real-world pipelines need more stages (Map → Reduce → Map → Reduce) or branching. Flume/Dataflow uses a DAG (Directed Acyclic Graph) of transformations, not a fixed two-stage structure.
No join optimization. MapReduce handles joins poorly (they must be implemented manually as secondary sort or multi-stage jobs). Flume/Dataflow has native join operations with optimizer-calculated strategies.
No incremental processing. Every MapReduce job processes the entire input. Dataflow supports windowing and incremental processing, so repeating a computation on new data doesn’t re-process the entire dataset.
Resource inefficiency. The fixed slot model (fixed number of Map and Reduce slots) wastes resources. Dataflow uses dynamic resource allocation.

Apache Beam (open-source successor to Flume) unifies batch and streaming under a single programming model, which was the key innovation Google needed beyond MapReduce.