Real World Architecture · MapReduce

Key Question

Why does one slow machine (a straggler) ruin the performance of the entire MapReduce job?

Deep Dive

Imagine you have 1000 Map tasks running on 1000 machines. 999 finish in 10 seconds. One machine takes 60 seconds. Your job’s completion time is 60 seconds — not 10 seconds. The job is as slow as the slowest task.

This is the straggler problem, and it gets worse at scale because the probability of at least one slow machine grows with the cluster size.

What Causes Stragglers?

Causes of stragglers in production:

┌────────────────────────────────────────────┐
│ Cause                    │ Likelihood       │
├──────────────────────────┼──────────────────┤
│ Disk thrashing           │ Common (shared   │
│ (other jobs on machine)  │  cluster)        │
├──────────────────────────┼──────────────────┤
│ CPU contention           │ Common (shared   │
│ (hyperthread, colocation)│  cluster)        │
├──────────────────────────┼──────────────────┤
│ Memory pressure (OOM)    │ Common           │
├──────────────────────────┼──────────────────┤
│ Network bandwidth        │ Fairly common    │
│ (other jobs shuffle-ing) │                  │
├──────────────────────────┼──────────────────┤
│ Bad disk sector          │ Rare but         │
│ (reads slow down)        │ devastating      │
├──────────────────────────┼──────────────────┤
│ Data skew (one split     │ Application-     │
│ has more data)           │ dependent        │
└──────────────────────────┴──────────────────┘

The key insight: in a large cluster, at least one machine is always in a bad state. You can guarantee that during any MapReduce job, some machine is experiencing disk contention, network congestion, or CPU throttling.

The Backup Task Solution

The MapReduce paper introduces backup tasks (also called speculative execution):

Timeline with NO backup task:

Task  1: ████████████████████▌                          (10s)
Task  2: ████████████████████                            (10s)
...
Task 99: ████████████████████                            (10s)
Task100: ████████████████████████████████████████████████ (60s) ★ straggler

Job complete at: 60 seconds

Timeline WITH backup task (threshold = 80% complete):

Task  1: ████████████████████▌                          (10s)  → done
Task  2: ████████████████████                            (10s)  → done
...
Task 99: ████████████████████                            (10s)  → done
Task100: ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░  (10s)  → 83% complete
                                                              ═ Master activates backup
Task100b:                                ████████████████████   (10s)  → backup finishes first
  (on a free machine)

Job complete at: ~11 seconds

Here’s how it works:

The master monitors task progress.
When a job is near completion (e.g., 80-90% of tasks finished), the master identifies in-progress tasks — tasks that are running but not yet finished.
For each in-progress task, the master schedules a backup copy on a different machine (one that’s currently idle).
Both copies run in parallel on different machines.
Whichever finishes first wins. The other copy is killed.
The master reports the job as complete when ALL tasks (original or backup) have finished.

Why 44% Improvement?

The MapReduce paper reports that backup tasks reduce job completion time by 44%. Let’s see why it’s so effective:

Without backup tasks: If 1% of tasks are stragglers and take 10× longer, the job slowdown is 10×. With 1000 tasks, the probability of at least one straggler approaches 1. The job’s completion time is the max of all task times, which is dominated by the straggler.

With backup tasks: The backup is scheduled on an idle machine. The backup almost always finishes faster than the original straggler (because the backup machine has no contention). The effective task time becomes the minimum of two independent runs. Since stragglers are caused by environmental noise (not deterministic slowness), the backup has a very high probability of being fast.

Straggler Mitigation in Practice

Modern systems use variations of backup tasks:

Hadoop MapReduce: Speculative execution (enabled by default, can be tuned).
Spark: Speculative execution for tasks (spark.speculation).
Google Cloud Dataflow: “Hot key” handling and dynamic work rebalancing.
Others: Some systems use “slow start” (delay scheduling of the last few tasks), or “hedged requests” (issue the same request to multiple servers).

The Tail at Scale

The straggler problem is a manifestation of a more general phenomenon: the tail latency problem. In a system with many parallel operations, the tail (99th percentile) latency of a single component determines the overall latency. As systems scale, tail latency gets worse, not better.

This is why high-scale systems invest in hedged requests (issue to multiple replicas, take the first response), backup tasks, and cross-datacenter replication — all to mitigate the tail.

Check Your Understanding

Why does the probability of at least one straggler approach 1 as the number of Map tasks increases?
If the original straggler finishes its task at the exact same moment as the backup, both produce valid output. How does the master handle this?
Backup tasks waste resources (two copies of work). Why is this an acceptable trade-off?

The “So What?”

Straggler mitigation is essential at scale — the slowest machine determines your job completion time. The backup task technique is a concrete, pragmatic solution to the “tail at scale” problem. When you configure speculative execution in Spark or Hadoop, you’re using the exact same technique described in the 2004 MapReduce paper. Modern systems still struggle with stragglers — Spark’s speculative execution is less effective than MapReduce’s because of pipelined execution.

✏️ Exercises

MapReduce: Exercises

Map output loss. A Map task completes successfully and stores its output on local disk. Then the worker machine loses power (crashes). The master detects the failure and marks the Map task as “idle” for re-execution. Why can’t the master just use the completed output? Where is the output stored?
Reduce task dependency. The Map phase has 100 tasks. The Reduce phase has 10 tasks. Can Reduce task 3 begin processing its data before Map task 57 has finished? Explain the constraint precisely.
Straggler math. You have 1000 Map tasks. Each task takes 10 seconds normally. There is a 1% chance per task that it runs on a “bad” machine and takes 100 seconds. What is the expected job completion time WITHOUT backup tasks? What happens WITH backup tasks (assume the backup has a 99% chance of finishing in 10 seconds on a non-bad machine)?
Google’s evolution. Google eventually moved away from MapReduce toward Flume (and later Apache Beam/Dataflow). What limitations of MapReduce drove this migration? What did Flume/Dataflow do differently?

👁️ View Solutions

MapReduce: Solutions

1. Map output loss

Answer. The Map task’s output is stored on the local disk of the worker machine. It is NOT stored in GFS. The reason: intermediate data is temporary (it will be consumed by Reduce tasks and then deleted), so writing it to GFS would be unnecessarily expensive (3x replication, network I/O). When the worker machine crashes, its local disk becomes inaccessible, so the output is lost. The master marks the task as “idle” and re-executes it on a different worker. The Map function is deterministic, so re-execution produces identical output.

2. Reduce task dependency

Answer. The precise constraint: for Reduce task r to produce its final output, it must have received partition r from ALL Map tasks (1 through 100). This is because each Map task may have emitted key-value pairs that hash to partition r. However, the Reducer does NOT need to wait for all Map tasks to finish before it starts fetching — it can fetch partition r from each Map task as soon as that Map task finishes. The Shuffle runs concurrently with the tail of the Map phase. But the Reducer cannot apply the Reduce function to produce final output until partition r is complete (i.e., all Map tasks have provided their share of partition r).

3. Straggler math

Answer.

Without backup tasks: The probability that a specific task is a straggler is 1% (0.01). With 1000 tasks, the number of stragglers follows a binomial distribution. The probability of at least one straggler is 1 - P(zero stragglers) = 1 - (0.99)^1000 ≈ 1 - 0.000043 ≈ 0.999957. With at least one straggler, the completion time is at least 100 seconds (the straggler time). So the expected completion time ≈ 100 seconds (technically slightly less if there’s a tiny chance of zero stragglers, but effectively 100s).

With backup tasks: When the master detects the remaining in-progress tasks (near the end of the job), it schedules backup copies. If the original task is slow because of a bad machine, the backup on a good machine finishes in 10 seconds. The backup finishes first, and the master reports completion at ~11 seconds (10s for the fast tasks + a small delay for the backup to be scheduled and complete). The expected completion time drops from ~100s to ~11s — a ~9x improvement. This matches the paper’s reported results.

4. Google’s evolution

Answer. Google moved away from MapReduce because of several limitations:

Batch-only. MapReduce processes data in fixed batches. It cannot handle streaming or real-time data. Flume introduced pipelines that process data incrementally.
Disk-based intermediate data. MapReduce writes Map outputs to disk before the Shuffle. Flume/Dataflow keeps data in memory across stages (pipelined execution), dramatically reducing latency.
Limited programming model. MapReduce is a two-stage model (Map then Reduce). Many real-world pipelines need more stages (Map → Reduce → Map → Reduce) or branching. Flume/Dataflow uses a DAG (Directed Acyclic Graph) of transformations, not a fixed two-stage structure.
No join optimization. MapReduce handles joins poorly (they must be implemented manually as secondary sort or multi-stage jobs). Flume/Dataflow has native join operations with optimizer-calculated strategies.
No incremental processing. Every MapReduce job processes the entire input. Dataflow supports windowing and incremental processing, so repeating a computation on new data doesn’t re-process the entire dataset.
Resource inefficiency. The fixed slot model (fixed number of Map and Reduce slots) wastes resources. Dataflow uses dynamic resource allocation.

Apache Beam (open-source successor to Flume) unifies batch and streaming under a single programming model, which was the key innovation Google needed beyond MapReduce.