Real World Architecture · GFS

Key Question

With a single master, how does GFS handle reliability?

Deep Dive

GFS has a single master. If the master goes down, the entire filesystem becomes read-only at best, and unavailable at worst. This is a single point of failure by design. The paper acknowledges this trade-off: a simple master is much easier to implement than a distributed metadata service, and in practice the master is rarely the bottleneck or the first thing to fail.

Chunk Reliability (Not Master Reliability)

The GFS team prioritized chunk reliability over master reliability because chunk data is far larger and more likely to be lost:

File: /user/data/hugefile.bin
  Chunk 0 ──── Chunkserver A (rack 1)
           ├── Chunkserver B (rack 2)
           └── Chunkserver C (rack 3)

  Chunk 1 ──── Chunkserver D (rack 1)
           ├── Chunkserver E (rack 2)
           └── Chunkserver F (rack 3)

Each chunk is replicated 3x on different racks. This means: (1) If a rack loses power, one replica survives. (2) If a chunkserver fails, the master re-replicates its chunks to another server. (3) Checksums on each chunkserver detect bit rot. (4) No single chunk is ever lost as long as one replica survives.

Chunkserver Failure + Re-replication

When a chunkserver goes down (e.g., disk failure):

1. Master detects missing heartbeat from chunkserver CS-X
2. Master learns: CS-X held replicas for 2,000 chunks
3. Master checks replication levels:
     - Chunk A: 3→2 replicas (under-replicated)
     - Chunk B: 3→2 replicas
4. Master picks new targets (based on disk usage, rack diversity)
5. Master instructs an existing replica → replicate chunk A to CS-Y
6. Re-replication happens in background, prioritized:
     - Chunks with 1 replica (critical) first
     - Chunks with 2 replicas next
7. Master updates metadata: chunk A is now on CS-Y

Re-replication is throttled to avoid overwhelming the network.

Shadow Master: A Read-Only Hot Standby

The shadow master is a secondary process that mirrors the master’s state:

                    Master (primary)
                   /                \
            Operation Log     Checkpoint
                  |                |
            Shadow Master    Shadow Master
            (reads op log)   (reads op log)
                  |                |
              Read-only        Read-only
              queries          queries

The shadow master: (1) Reads the operation log (the append-only log of all metadata changes). (2) Applies mutations in order. (3) Serves read-only metadata queries, reducing load on the primary master. (4) Does NOT participate in mutations — only the primary master grants leases and re-replicates.

Master Failure

If the primary master crashes:

1. Master process dies (e.g., OOM, hardware fault)
2. GFS becomes read-only for metadata ops
3. Existing chunk leases continue until expiry
4. An operator detects the failure (via monitoring)
5. Operator starts the master binary with the operation log + checkpoint
6. Master replays the op log to reconstruct in-memory state
7. Master is back online. Clients start sending requests.
8. Stale chunk locations are cleaned up lazily

If the master’s disk is gone, an operator can promote a shadow master. But there’s a recentcy window — the shadow master’s state might lag by a few operations (tens to hundreds of milliseconds). Some metadata might be lost.

GFS’s Reliability Trade-offs

Component	Replication Strategy	Availability
Chunk data	3x cross-rack	99.9999%+
Master metadata	Op log + checkpoint (single machine)	Minutes of downtime/year
Client access	Retry with cached metadata	Continuous (if master up)

Google accepted this trade-off because: (1) Master failures were rare. (2) Monitoring detected failures quickly. (3) Manual recovery was acceptable for their use case.

Later systems (HDFS NameNode HA, Ceph) addressed this with hot standbys and automatic failover, but GFS’s simplicity-first approach was the right call at the time.

Check Your Understanding

If a chunkserver goes down, why does the master not immediately re-replicate every chunk it held?
A shadow master is serving reads. The primary master crashes. Can the shadow master immediately take over as primary? Why or why not?
GFS uses checksums on chunkservers but not checksums on the master’s metadata. Why is this asymmetry acceptable?

The “So What?”

GFS trades perfect availability for simplicity. The single master is a known SPOF, but the system design minimizes the impact by: (1) making the master’s state small enough to recover quickly, (2) re-replicating chunks automatically, and (3) accepting brief downtime for metadata operations. Modern systems like HDFS with NameNode HA removed this trade-off, but only after proving that hot standby for metadata could be built reliably.

✏️ Exercises

GFS: Exercises

Data pipeline design. In GFS, the client pushes data to all replicas before sending the write command to the primary. Why does the design separate data transfer from write ordering? What problem does this solve?
Lease failure behavior. A chunkserver P holds the lease for chunk C at a 50 MB offset in file F. P crashes after serving 50 writes but before its lease expires (lease was granted 40 seconds ago, 60-second lease). Describe exactly what happens next, including:
- How does the master learn P is down?
- When can a new lease be granted?
- Are the 50 writes that P ordered lost?
GFS vs. Consensus. GFS uses replication for durability (3x chunk copies) but does NOT use a consensus protocol like Paxos or Raft for coordinating writes. Instead it uses a lease-based primary. If Google had used Paxos for each chunk write, what would be the costs? Why did they choose leases?
Shadow master promotion. You are the SRE on call. The primary GFS master’s disk failed (hardware fault). A shadow master lagging by 200 operations exists. What do you do? Walk through the steps and describe what data could be lost.

👁️ View Solutions

GFS: Solutions

1. Data pipeline design

Answer. The separation of data transfer from write ordering decouples the slowest part of the operation (moving gigabytes of data over the network) from the part that must be serialized (assigning sequence numbers to writes). If data transfer and ordering were coupled, the primary would be a bottleneck: it would have to receive the data before it could assign the order, blocking all other writes to the same chunk. By pushing data to all replicas first, the data is already local by the time the primary serializes the write. The primary’s only job is ordering, which is fast (microseconds). Additionally, the pipeline topology (chain forwarding) minimizes the client’s upload bandwidth: the client sends the data once to the nearest replica, and the chain carries it to the others.

2. Lease failure behavior

Answer. Here’s the sequence:

The master detects P is down when it misses P’s heartbeat (typically 3 successive missed heartbeats, ~15-20 seconds).
The master checks the lease for chunk C. Since 40 of 60 seconds have elapsed, ~20 seconds remain.
The master waits for the lease to expire naturally. It cannot grant a new lease while the old one is still active — P could come back and cause conflicts.
After ~20 seconds, the lease expires. The master grants a new lease to S1 (one of the secondaries).
The 50 writes that P ordered are applied to the secondaries (they were forwarded by P before the crash). Assuming at least one secondary has all 50 writes, the data is not lost. However, if the client did not receive an acknowledgment from P before the crash, the client does not know whether the writes completed. The client will retry, and the new primary (S1) must apply the retried writes. This could cause duplicate appends (GFS relies on record-level idempotency for atomic record appends, but for regular writes, duplicate writes may occur and must be handled by the application).
Any writes that were “in flight” (data had been pushed but the write command hadn’t reached P) will be retried by the client with the new primary.

3. GFS vs. Consensus

Answer. Using Paxos for each chunk write would have these costs:

Latency. Each Paxos write requires 2-3 RTTs (prepare-promise, propose-accept, possibly commit). GFS lease writes require 0 extra RTTs beyond the data pipeline. At Google’s scale (millions of writes/second), this would be devastating for latency.
Throughput. Paxos requires a leader and a quorum. Every write would need a majority of replicas to agree. GFS’s lease lets the primary act as a sole decider — no quorum needed for each write.
Complexity. Paxos is notoriously hard to implement correctly (Google learned this with Chubby). Leases are simple: grant, wait for timeout, re-grant.
Scalability. Each chunk would need its own Paxos group. With millions of chunks, the state machine overhead would be enormous.

Google chose leases because GFS’s workload is dominated by large sequential writes (data processing pipelines), not small random writes. The lease-based approach gives them the ordering they need at a fraction of the cost of consensus.

4. Shadow master promotion

Answer. Steps as the SRE:

Stop the shadow master. Prevent it from reading further to preserve the state at the lagged point.
Check the operation log. The primary master’s disk failed, but hopefully the operation log was on a replicated storage or was backed up to a different disk. If the op log is lost entirely, some metadata is lost forever — specifically the operations between the last checkpoint and the crash.
Determine the lag. The shadow master is 200 operations behind. Those 200 operations may include: new file creations, chunk lease grants, chunk location changes, and re-replication metadata.
Promote the shadow master. Start it in primary mode. It will serve the stale metadata (missing the last 200 operations). Any files created in those 200 operations will be invisible. Any chunk leases granted in those 200 ops will be missing, but chunkservers will hear from the new master.
Reconcile. The new master contacts all chunkservers and learns current chunk locations (chunkservers report what they have). This reconciles any missing location metadata.
Accept data loss. Files created in the last ~200ms (approximately) are lost. The operation log gap means those namespace entries are gone. Files that already existed are fine — their chunk data is still on chunkservers.
Root cause and prevent. The disk failure should be investigated. Future improvements: replicate the operation log to a secondary disk or use RAID on the master.