Key Question
Why does one slow machine (a straggler) ruin the performance of the entire MapReduce job?
Deep Dive
Imagine you have 1000 Map tasks running on 1000 machines. 999 finish in 10 seconds. One machine takes 60 seconds. Your jobβs completion time is 60 seconds β not 10 seconds. The job is as slow as the slowest task.
This is the straggler problem, and it gets worse at scale because the probability of at least one slow machine grows with the cluster size.
What Causes Stragglers?
Causes of stragglers in production:
ββββββββββββββββββββββββββββββββββββββββββββββ
β Cause β Likelihood β
ββββββββββββββββββββββββββββΌβββββββββββββββββββ€
β Disk thrashing β Common (shared β
β (other jobs on machine) β cluster) β
ββββββββββββββββββββββββββββΌβββββββββββββββββββ€
β CPU contention β Common (shared β
β (hyperthread, colocation)β cluster) β
ββββββββββββββββββββββββββββΌβββββββββββββββββββ€
β Memory pressure (OOM) β Common β
ββββββββββββββββββββββββββββΌβββββββββββββββββββ€
β Network bandwidth β Fairly common β
β (other jobs shuffle-ing) β β
ββββββββββββββββββββββββββββΌβββββββββββββββββββ€
β Bad disk sector β Rare but β
β (reads slow down) β devastating β
ββββββββββββββββββββββββββββΌβββββββββββββββββββ€
β Data skew (one split β Application- β
β has more data) β dependent β
ββββββββββββββββββββββββββββ΄βββββββββββββββββββ
The key insight: in a large cluster, at least one machine is always in a bad state. You can guarantee that during any MapReduce job, some machine is experiencing disk contention, network congestion, or CPU throttling.
The Backup Task Solution
The MapReduce paper introduces backup tasks (also called speculative execution):
Timeline with NO backup task:
Task 1: βββββββββββββββββββββ (10s)
Task 2: ββββββββββββββββββββ (10s)
...
Task 99: ββββββββββββββββββββ (10s)
Task100: ββββββββββββββββββββββββββββββββββββββββββββββββ (60s) β
straggler
Job complete at: 60 seconds
Timeline WITH backup task (threshold = 80% complete):
Task 1: βββββββββββββββββββββ (10s) β done
Task 2: ββββββββββββββββββββ (10s) β done
...
Task 99: ββββββββββββββββββββ (10s) β done
Task100: βββββββββββββββββββββββββββββββββββββββββββββ (10s) β 83% complete
β Master activates backup
Task100b: ββββββββββββββββββββ (10s) β backup finishes first
(on a free machine)
Job complete at: ~11 seconds
Hereβs how it works:
- The master monitors task progress.
- When a job is near completion (e.g., 80-90% of tasks finished), the master identifies in-progress tasks β tasks that are running but not yet finished.
- For each in-progress task, the master schedules a backup copy on a different machine (one thatβs currently idle).
- Both copies run in parallel on different machines.
- Whichever finishes first wins. The other copy is killed.
- The master reports the job as complete when ALL tasks (original or backup) have finished.
Why 44% Improvement?
The MapReduce paper reports that backup tasks reduce job completion time by 44%. Letβs see why itβs so effective:
Without backup tasks: If 1% of tasks are stragglers and take 10Γ longer, the job slowdown is 10Γ. With 1000 tasks, the probability of at least one straggler approaches 1. The jobβs completion time is the max of all task times, which is dominated by the straggler.
With backup tasks: The backup is scheduled on an idle machine. The backup almost always finishes faster than the original straggler (because the backup machine has no contention). The effective task time becomes the minimum of two independent runs. Since stragglers are caused by environmental noise (not deterministic slowness), the backup has a very high probability of being fast.
Straggler Mitigation in Practice
Modern systems use variations of backup tasks:
- Hadoop MapReduce: Speculative execution (enabled by default, can be tuned).
- Spark: Speculative execution for tasks (spark.speculation).
- Google Cloud Dataflow: βHot keyβ handling and dynamic work rebalancing.
- Others: Some systems use βslow startβ (delay scheduling of the last few tasks), or βhedged requestsβ (issue the same request to multiple servers).
The Tail at Scale
The straggler problem is a manifestation of a more general phenomenon: the tail latency problem. In a system with many parallel operations, the tail (99th percentile) latency of a single component determines the overall latency. As systems scale, tail latency gets worse, not better.
This is why high-scale systems invest in hedged requests (issue to multiple replicas, take the first response), backup tasks, and cross-datacenter replication β all to mitigate the tail.
Check Your Understanding
-
Why does the probability of at least one straggler approach 1 as the number of Map tasks increases?
-
If the original straggler finishes its task at the exact same moment as the backup, both produce valid output. How does the master handle this?
-
Backup tasks waste resources (two copies of work). Why is this an acceptable trade-off?
The βSo What?β
Straggler mitigation is essential at scale β the slowest machine determines your job completion time. The backup task technique is a concrete, pragmatic solution to the βtail at scaleβ problem. When you configure speculative execution in Spark or Hadoop, youβre using the exact same technique described in the 2004 MapReduce paper. Modern systems still struggle with stragglers β Sparkβs speculative execution is less effective than MapReduceβs because of pipelined execution.
βοΈ Exercises
MapReduce: Exercises
-
Map output loss. A Map task completes successfully and stores its output on local disk. Then the worker machine loses power (crashes). The master detects the failure and marks the Map task as βidleβ for re-execution. Why canβt the master just use the completed output? Where is the output stored?
-
Reduce task dependency. The Map phase has 100 tasks. The Reduce phase has 10 tasks. Can Reduce task 3 begin processing its data before Map task 57 has finished? Explain the constraint precisely.
-
Straggler math. You have 1000 Map tasks. Each task takes 10 seconds normally. There is a 1% chance per task that it runs on a βbadβ machine and takes 100 seconds. What is the expected job completion time WITHOUT backup tasks? What happens WITH backup tasks (assume the backup has a 99% chance of finishing in 10 seconds on a non-bad machine)?
-
Googleβs evolution. Google eventually moved away from MapReduce toward Flume (and later Apache Beam/Dataflow). What limitations of MapReduce drove this migration? What did Flume/Dataflow do differently?
ποΈ View Solutions
MapReduce: Solutions
1. Map output loss
Answer. The Map taskβs output is stored on the local disk of the worker machine. It is NOT stored in GFS. The reason: intermediate data is temporary (it will be consumed by Reduce tasks and then deleted), so writing it to GFS would be unnecessarily expensive (3x replication, network I/O). When the worker machine crashes, its local disk becomes inaccessible, so the output is lost. The master marks the task as βidleβ and re-executes it on a different worker. The Map function is deterministic, so re-execution produces identical output.
2. Reduce task dependency
Answer. The precise constraint: for Reduce task r to produce its final output, it must have received partition r from ALL Map tasks (1 through 100). This is because each Map task may have emitted key-value pairs that hash to partition r. However, the Reducer does NOT need to wait for all Map tasks to finish before it starts fetching β it can fetch partition r from each Map task as soon as that Map task finishes. The Shuffle runs concurrently with the tail of the Map phase. But the Reducer cannot apply the Reduce function to produce final output until partition r is complete (i.e., all Map tasks have provided their share of partition r).
3. Straggler math
Answer.
Without backup tasks: The probability that a specific task is a straggler is 1% (0.01). With 1000 tasks, the number of stragglers follows a binomial distribution. The probability of at least one straggler is 1 - P(zero stragglers) = 1 - (0.99)^1000 β 1 - 0.000043 β 0.999957. With at least one straggler, the completion time is at least 100 seconds (the straggler time). So the expected completion time β 100 seconds (technically slightly less if thereβs a tiny chance of zero stragglers, but effectively 100s).
With backup tasks: When the master detects the remaining in-progress tasks (near the end of the job), it schedules backup copies. If the original task is slow because of a bad machine, the backup on a good machine finishes in 10 seconds. The backup finishes first, and the master reports completion at ~11 seconds (10s for the fast tasks + a small delay for the backup to be scheduled and complete). The expected completion time drops from ~100s to ~11s β a ~9x improvement. This matches the paperβs reported results.
4. Googleβs evolution
Answer. Google moved away from MapReduce because of several limitations:
-
Batch-only. MapReduce processes data in fixed batches. It cannot handle streaming or real-time data. Flume introduced pipelines that process data incrementally.
-
Disk-based intermediate data. MapReduce writes Map outputs to disk before the Shuffle. Flume/Dataflow keeps data in memory across stages (pipelined execution), dramatically reducing latency.
-
Limited programming model. MapReduce is a two-stage model (Map then Reduce). Many real-world pipelines need more stages (Map β Reduce β Map β Reduce) or branching. Flume/Dataflow uses a DAG (Directed Acyclic Graph) of transformations, not a fixed two-stage structure.
-
No join optimization. MapReduce handles joins poorly (they must be implemented manually as secondary sort or multi-stage jobs). Flume/Dataflow has native join operations with optimizer-calculated strategies.
-
No incremental processing. Every MapReduce job processes the entire input. Dataflow supports windowing and incremental processing, so repeating a computation on new data doesnβt re-process the entire dataset.
-
Resource inefficiency. The fixed slot model (fixed number of Map and Reduce slots) wastes resources. Dataflow uses dynamic resource allocation.
Apache Beam (open-source successor to Flume) unifies batch and streaming under a single programming model, which was the key innovation Google needed beyond MapReduce.