Real World Architecture · MapReduce

Key Question

How does MapReduce let you process terabytes of data without writing any distributed code?

Deep Dive

MapReduce is built on a simple insight borrowed from functional programming: many data processing tasks can be expressed as a Map (apply a function to each element) followed by a Reduce (aggregate results by key). The MapReduce runtime handles everything else — distribution, parallelism, fault tolerance, and network communication.

The Two Functions

You write exactly two functions:

Map(k1, v1) → list(k2, v2)

  Input: a key-value pair
  Output: zero or more intermediate key-value pairs
  Side effects: NONE (purity is critical)

Reduce(k2, list(v2)) → list(v2)

  Input: an intermediate key and all values for that key
  Output: zero or more final values
  Side effects: NONE

That’s it. You never write: spawn_thread(), send_message(), accept_connection(), or checkpoint_state(). The framework invisibly parallelizes your Map and Reduce across thousands of machines.

Word Count: The “Hello World” of MapReduce

Input file: "the quick brown fox jumps over the lazy dog the quick brown fox"

Split 1: "the quick brown fox jumps"
Split 2: "over the lazy dog the quick brown fox"

Map (applied to each split in parallel):

  Split 1 Map output:
    (the, 1), (quick, 1), (brown, 1), (fox, 1), (jumps, 1)

  Split 2 Map output:
    (over, 1), (the, 1), (lazy, 1), (dog, 1), (the, 1), (quick, 1), (brown, 1), (fox, 1)

Shuffle + Sort (group all values by key):

  brown:  [1, 1]
  dog:    [1]
  fox:    [1, 1]
  jumps:  [1]
  lazy:   [1]
  over:   [1]
  quick:  [1, 1]
  the:    [1, 1, 1]

Reduce (applied to each key in parallel):

  brown:  sum(1, 1)    → [2]
  dog:    sum(1)       → [1]
  fox:    sum(1, 1)    → [2]
  jumps:  sum(1)       → [1]
  lazy:   sum(1)       → [1]
  over:   sum(1)       → [1]
  quick:  sum(1, 1)    → [2]
  the:    sum(1, 1, 1) → [3]

Output: [(brown, 2), (dog, 1), (fox, 2), (jumps, 1), (lazy, 1), (over, 1), (quick, 2), (the, 3)]

The Map function emits (word, 1) for each word it sees. Reduce sums the 1s for each word. The result is a word count — without ever writing a single line of networking code.

What the Framework Provides

The MapReduce runtime gives you:

Input splitting. M splits of input (aligned with GFS chunk boundaries, typically 64 MB).
Task scheduling. M Map tasks and R Reduce tasks scheduled across the cluster.
Parallelism. Each Map task runs on the machine holding a replica of its input split (data locality).
Shuffle + Sort. Intermediate data is automatically partitioned (by hash of key) and transferred to reducers.
Fault tolerance. Failed tasks are automatically re-executed.
Output. R output files (one per reducer), typically written back to GFS.

The MapReduce Abstraction Is Not Universal

MapReduce works great for: batch processing, text analysis, log parsing, inverted indices, sorting, and graph processing (iterative). It works poorly for: real-time queries, incremental updates, complex multi-step workflows, and tasks requiring shared state.

The beauty is the abstraction cost: you pay a performance penalty (serialization, all-to-all shuffle, disk-based intermediate storage) in exchange for not having to write any distributed systems code.

Check Your Understanding

Write the Map and Reduce functions for a “URL frequency counter” — given input (doc_id, text), count how many times each URL appears.
Why must the Map function be “pure” (no side effects)? What goes wrong if Map writes to a file?
MapReduce processes M map tasks and R reduce tasks. How many output files does a MapReduce job produce?

The “So What?”

MapReduce abstracts away distribution — you write sequential code, the framework parallelizes it. This abstraction made it possible for thousands of Google engineers who knew nothing about distributed systems to process petabytes of data daily. It’s the reason the modern “big data” ecosystem exists. When you write a Spark map() and reduceByKey(), you’re using the MapReduce abstraction, even if the runtime is different.

✏️ Exercises

MapReduce: Exercises

Map output loss. A Map task completes successfully and stores its output on local disk. Then the worker machine loses power (crashes). The master detects the failure and marks the Map task as “idle” for re-execution. Why can’t the master just use the completed output? Where is the output stored?
Reduce task dependency. The Map phase has 100 tasks. The Reduce phase has 10 tasks. Can Reduce task 3 begin processing its data before Map task 57 has finished? Explain the constraint precisely.
Straggler math. You have 1000 Map tasks. Each task takes 10 seconds normally. There is a 1% chance per task that it runs on a “bad” machine and takes 100 seconds. What is the expected job completion time WITHOUT backup tasks? What happens WITH backup tasks (assume the backup has a 99% chance of finishing in 10 seconds on a non-bad machine)?
Google’s evolution. Google eventually moved away from MapReduce toward Flume (and later Apache Beam/Dataflow). What limitations of MapReduce drove this migration? What did Flume/Dataflow do differently?

👁️ View Solutions

MapReduce: Solutions

1. Map output loss

Answer. The Map task’s output is stored on the local disk of the worker machine. It is NOT stored in GFS. The reason: intermediate data is temporary (it will be consumed by Reduce tasks and then deleted), so writing it to GFS would be unnecessarily expensive (3x replication, network I/O). When the worker machine crashes, its local disk becomes inaccessible, so the output is lost. The master marks the task as “idle” and re-executes it on a different worker. The Map function is deterministic, so re-execution produces identical output.

2. Reduce task dependency

Answer. The precise constraint: for Reduce task r to produce its final output, it must have received partition r from ALL Map tasks (1 through 100). This is because each Map task may have emitted key-value pairs that hash to partition r. However, the Reducer does NOT need to wait for all Map tasks to finish before it starts fetching — it can fetch partition r from each Map task as soon as that Map task finishes. The Shuffle runs concurrently with the tail of the Map phase. But the Reducer cannot apply the Reduce function to produce final output until partition r is complete (i.e., all Map tasks have provided their share of partition r).

3. Straggler math

Answer.

Without backup tasks: The probability that a specific task is a straggler is 1% (0.01). With 1000 tasks, the number of stragglers follows a binomial distribution. The probability of at least one straggler is 1 - P(zero stragglers) = 1 - (0.99)^1000 ≈ 1 - 0.000043 ≈ 0.999957. With at least one straggler, the completion time is at least 100 seconds (the straggler time). So the expected completion time ≈ 100 seconds (technically slightly less if there’s a tiny chance of zero stragglers, but effectively 100s).

With backup tasks: When the master detects the remaining in-progress tasks (near the end of the job), it schedules backup copies. If the original task is slow because of a bad machine, the backup on a good machine finishes in 10 seconds. The backup finishes first, and the master reports completion at ~11 seconds (10s for the fast tasks + a small delay for the backup to be scheduled and complete). The expected completion time drops from ~100s to ~11s — a ~9x improvement. This matches the paper’s reported results.

4. Google’s evolution

Answer. Google moved away from MapReduce because of several limitations:

Batch-only. MapReduce processes data in fixed batches. It cannot handle streaming or real-time data. Flume introduced pipelines that process data incrementally.
Disk-based intermediate data. MapReduce writes Map outputs to disk before the Shuffle. Flume/Dataflow keeps data in memory across stages (pipelined execution), dramatically reducing latency.
Limited programming model. MapReduce is a two-stage model (Map then Reduce). Many real-world pipelines need more stages (Map → Reduce → Map → Reduce) or branching. Flume/Dataflow uses a DAG (Directed Acyclic Graph) of transformations, not a fixed two-stage structure.
No join optimization. MapReduce handles joins poorly (they must be implemented manually as secondary sort or multi-stage jobs). Flume/Dataflow has native join operations with optimizer-calculated strategies.
No incremental processing. Every MapReduce job processes the entire input. Dataflow supports windowing and incremental processing, so repeating a computation on new data doesn’t re-process the entire dataset.
Resource inefficiency. The fixed slot model (fixed number of Map and Reduce slots) wastes resources. Dataflow uses dynamic resource allocation.

Apache Beam (open-source successor to Flume) unifies batch and streaming under a single programming model, which was the key innovation Google needed beyond MapReduce.