Real World Architecture · GFS

Key Question

How does GFS handle petabytes of data across thousands of machines with a SINGLE master that’s a bottleneck?

Deep Dive

The Google File System has three components: one Master, many Chunkservers, and many Clients. The Master owns all metadata — the file namespace, file-to-chunk-id mappings, and chunk-id-to-location mappings. Chunkservers store actual file data as 64 MB chunks on local Linux filesystems. Clients are applications (e.g., MapReduce workers) that read and write files through the GFS API.

The critical architectural insight: data never flows through the Master. The Master handles metadata operations (lookups, leases, re-replication decisions) but never touches file payloads. This separation of the metadata path from the data path is what makes a single master viable at all.

Here’s the read path:

Client                    Master                  Chunkserver
  |                         |                         |
  |-- "Where is byte X?" -->|                         |
  |                         |-- chunk handle + locs   |
  |<--- chunk handle, -------|                         |
  |     replica IPs         |                         |
  |                                                   |
  |--- Read(chunk handle, byte range) ---------------->|
  |<-- chunk data ------------------------------------|
  |                         |                         |

Step by step: (1) Client sends filename + byte offset to Master. (2) Master computes which chunk (file offset / 64MB), and returns the chunk handle and the IP addresses of all replicas (typically 3). (3) Client caches this metadata locally (with a timeout). (4) Client reads directly from the nearest chunkserver (e.g., the one on the same rack). The Master is never in the data path.

Why does this work at Google’s scale? The Master only handles ~1 metadata operation per read/write, while the data transfer can be gigabytes. Metadata operations are small (a few hundred bytes). The Master can handle tens of thousands of metadata ops/second. In practice, a single GFS master served thousands of clients across hundreds of terabytes.

The 64 MB chunk size is deliberate. It reduces the namespace size (fewer chunks = fewer metadata entries), lets clients cache chunk locations for repeated reads, and enables large sequential reads to stream efficiently from a single chunkserver.

This separation of concerns — Master orchestrates, Chunkservers execute — is a pattern you’ll see again in HDFS (NameNode + DataNode), Ceph (MON + OSD), and many distributed databases.

Check Your Understanding

Why does the data path bypass the master? What problem does this solve?
A client wants to read 128 MB of a file. How many master requests does it make, and how many chunkserver requests?
What happens to the cache entry for a chunk location when a chunkserver fails?

The “So What?”

GFS’s separation of metadata path from data path is a foundational pattern used by HDFS, Ceph, and many distributed storage systems. When you benchmark HDFS and wonder why NameNode needs so much RAM, you’re seeing the legacy of this design choice.

✏️ Exercises

GFS: Exercises

Data pipeline design. In GFS, the client pushes data to all replicas before sending the write command to the primary. Why does the design separate data transfer from write ordering? What problem does this solve?
Lease failure behavior. A chunkserver P holds the lease for chunk C at a 50 MB offset in file F. P crashes after serving 50 writes but before its lease expires (lease was granted 40 seconds ago, 60-second lease). Describe exactly what happens next, including:
- How does the master learn P is down?
- When can a new lease be granted?
- Are the 50 writes that P ordered lost?
GFS vs. Consensus. GFS uses replication for durability (3x chunk copies) but does NOT use a consensus protocol like Paxos or Raft for coordinating writes. Instead it uses a lease-based primary. If Google had used Paxos for each chunk write, what would be the costs? Why did they choose leases?
Shadow master promotion. You are the SRE on call. The primary GFS master’s disk failed (hardware fault). A shadow master lagging by 200 operations exists. What do you do? Walk through the steps and describe what data could be lost.

👁️ View Solutions

GFS: Solutions

1. Data pipeline design

Answer. The separation of data transfer from write ordering decouples the slowest part of the operation (moving gigabytes of data over the network) from the part that must be serialized (assigning sequence numbers to writes). If data transfer and ordering were coupled, the primary would be a bottleneck: it would have to receive the data before it could assign the order, blocking all other writes to the same chunk. By pushing data to all replicas first, the data is already local by the time the primary serializes the write. The primary’s only job is ordering, which is fast (microseconds). Additionally, the pipeline topology (chain forwarding) minimizes the client’s upload bandwidth: the client sends the data once to the nearest replica, and the chain carries it to the others.

2. Lease failure behavior

Answer. Here’s the sequence:

The master detects P is down when it misses P’s heartbeat (typically 3 successive missed heartbeats, ~15-20 seconds).
The master checks the lease for chunk C. Since 40 of 60 seconds have elapsed, ~20 seconds remain.
The master waits for the lease to expire naturally. It cannot grant a new lease while the old one is still active — P could come back and cause conflicts.
After ~20 seconds, the lease expires. The master grants a new lease to S1 (one of the secondaries).
The 50 writes that P ordered are applied to the secondaries (they were forwarded by P before the crash). Assuming at least one secondary has all 50 writes, the data is not lost. However, if the client did not receive an acknowledgment from P before the crash, the client does not know whether the writes completed. The client will retry, and the new primary (S1) must apply the retried writes. This could cause duplicate appends (GFS relies on record-level idempotency for atomic record appends, but for regular writes, duplicate writes may occur and must be handled by the application).
Any writes that were “in flight” (data had been pushed but the write command hadn’t reached P) will be retried by the client with the new primary.

3. GFS vs. Consensus

Answer. Using Paxos for each chunk write would have these costs:

Latency. Each Paxos write requires 2-3 RTTs (prepare-promise, propose-accept, possibly commit). GFS lease writes require 0 extra RTTs beyond the data pipeline. At Google’s scale (millions of writes/second), this would be devastating for latency.
Throughput. Paxos requires a leader and a quorum. Every write would need a majority of replicas to agree. GFS’s lease lets the primary act as a sole decider — no quorum needed for each write.
Complexity. Paxos is notoriously hard to implement correctly (Google learned this with Chubby). Leases are simple: grant, wait for timeout, re-grant.
Scalability. Each chunk would need its own Paxos group. With millions of chunks, the state machine overhead would be enormous.

Google chose leases because GFS’s workload is dominated by large sequential writes (data processing pipelines), not small random writes. The lease-based approach gives them the ordering they need at a fraction of the cost of consensus.

4. Shadow master promotion

Answer. Steps as the SRE:

Stop the shadow master. Prevent it from reading further to preserve the state at the lagged point.
Check the operation log. The primary master’s disk failed, but hopefully the operation log was on a replicated storage or was backed up to a different disk. If the op log is lost entirely, some metadata is lost forever — specifically the operations between the last checkpoint and the crash.
Determine the lag. The shadow master is 200 operations behind. Those 200 operations may include: new file creations, chunk lease grants, chunk location changes, and re-replication metadata.
Promote the shadow master. Start it in primary mode. It will serve the stale metadata (missing the last 200 operations). Any files created in those 200 operations will be invisible. Any chunk leases granted in those 200 ops will be missing, but chunkservers will hear from the new master.
Reconcile. The new master contacts all chunkservers and learns current chunk locations (chunkservers report what they have). This reconciles any missing location metadata.
Accept data loss. Files created in the last ~200ms (approximately) are lost. The operation log gap means those namespace entries are gone. Files that already existed are fine — their chunk data is still on chunkservers.
Root cause and prevent. The disk failure should be investigated. Future improvements: replicate the operation log to a secondary disk or use RAID on the master.