Key Question
How does GFS maintain consistent ordering of mutations without a heavy distributed lock?
Deep Dive
Every mutation to a GFS chunk must be applied in the same order by every replica holding that chunk. Without consistent ordering, chunkservers would diverge and data would become corrupted. GFS solves this with leases — a time-bounded grant of authority from the Master to one chunkserver (the primary) for a specific chunk.
Lease Mechanics
Timeline for a 60-second chunk lease:
Master: |--LEASE GRANTED to P--|-----60 seconds------|--LEASE EXPIRED--|
Primary: |--can order mutations--| |
^ ^
Primary crash Master assigns
here new lease to S1
The Master grants a lease for chunk C to a primary chunkserver P. The lease has a 60-second timeout. During the lease period, P is the sole authority for mutation ordering on chunk C. Any client that wants to write to chunk C must go through P. P assigns a serial number to each mutation, and all secondaries apply mutations in that serial order.
If the lease expires and no one renews it, the Master grants a new lease — potentially to a different chunkserver.
Lease Renewal
The primary requests lease renewal from the Master periodically (when it’s actively serving writes). The Master typically grants the renewal. But if the Master wants to “revoke” the lease (e.g., for rebalancing), it lets the lease expire naturally rather than renewing — no explicit revocation message needed.
Primary Failure
If the primary crashes, here’s what happens:
1. Primary P holds lease for chunk C
2. P crashes
3. Client tries to write, gets no response from P
4. Client asks Master: "Who has the lease for C?"
5. Master checks: lease is still "active" (hasn't expired yet)
6. Master waits for the lease to expire (~20s remaining)
7. Lease expires. Master grants new lease to S1 (a secondary)
8. Client retries with new primary
The Master never preemptively revokes a lease during its term. It waits for expiration. This is a deliberate design choice: there’s no need for complex two-phase revocation protocols. Time-based expiration is simple and robust.
Why Not Just Use Paxos?
Each chunk mutation would need a Paxos round to agree on ordering. With thousands of concurrent mutations across millions of chunks, that’s an enormous overhead. Leases are far cheaper: one Master-granted lease per chunk per 60 seconds, vs. one Paxos round per mutation.
Comparison:
Paxos per write: Each write: [Propose] [Accept] [Commit] — 3 RTTs
× 10,000 writes/s = 30,000 RTTs/s of consensus
Lease per chunk: One lease grant (1 RTT) every 60 seconds
= 0.016 RTTs/s per chunk
All writes use the lease for ordering (local)
The lease gives the primary the authority to act as a temporary dictator for that chunk. As long as the lease hasn’t expired, all nodes trust the primary’s ordering. This is much cheaper than running a consensus protocol for every individual write.
Leases vs. Epochs
GFS leases are conceptually similar to Raft’s terms (leader elections create a term, the leader holds authority until the next election). Both use time-bounded authority. The key difference: Raft’s leader authority covers all log entries, while GFS leases are per-chunk. A GFS Master might grant a million separate leases across the cluster.
Check Your Understanding
-
Why does the Master wait for the lease to expire instead of actively revoking it when the primary is suspected dead?
-
A chunkserver holds leases for 10,000 chunks. How many messages per second does it need to exchange with the Master to maintain those leases?
-
What prevents a primary with an expired lease from continuing to order writes?
The “So What?”
Leases are a cheap alternative to consensus for ordering — they’re used in GFS, DynamoDB, and even Raft (leaders hold a lease on their term). When you need ordering in a distributed system, ask: “Can I use a lease instead of Paxos?” The answer is often yes, and it will be orders of magnitude cheaper.
✏️ Exercises
GFS: Exercises
-
Data pipeline design. In GFS, the client pushes data to all replicas before sending the write command to the primary. Why does the design separate data transfer from write ordering? What problem does this solve?
-
Lease failure behavior. A chunkserver P holds the lease for chunk C at a 50 MB offset in file F. P crashes after serving 50 writes but before its lease expires (lease was granted 40 seconds ago, 60-second lease). Describe exactly what happens next, including:
- How does the master learn P is down?
- When can a new lease be granted?
- Are the 50 writes that P ordered lost?
-
GFS vs. Consensus. GFS uses replication for durability (3x chunk copies) but does NOT use a consensus protocol like Paxos or Raft for coordinating writes. Instead it uses a lease-based primary. If Google had used Paxos for each chunk write, what would be the costs? Why did they choose leases?
-
Shadow master promotion. You are the SRE on call. The primary GFS master’s disk failed (hardware fault). A shadow master lagging by 200 operations exists. What do you do? Walk through the steps and describe what data could be lost.
👁️ View Solutions
GFS: Solutions
1. Data pipeline design
Answer. The separation of data transfer from write ordering decouples the slowest part of the operation (moving gigabytes of data over the network) from the part that must be serialized (assigning sequence numbers to writes). If data transfer and ordering were coupled, the primary would be a bottleneck: it would have to receive the data before it could assign the order, blocking all other writes to the same chunk. By pushing data to all replicas first, the data is already local by the time the primary serializes the write. The primary’s only job is ordering, which is fast (microseconds). Additionally, the pipeline topology (chain forwarding) minimizes the client’s upload bandwidth: the client sends the data once to the nearest replica, and the chain carries it to the others.
2. Lease failure behavior
Answer. Here’s the sequence:
- The master detects P is down when it misses P’s heartbeat (typically 3 successive missed heartbeats, ~15-20 seconds).
- The master checks the lease for chunk C. Since 40 of 60 seconds have elapsed, ~20 seconds remain.
- The master waits for the lease to expire naturally. It cannot grant a new lease while the old one is still active — P could come back and cause conflicts.
- After ~20 seconds, the lease expires. The master grants a new lease to S1 (one of the secondaries).
- The 50 writes that P ordered are applied to the secondaries (they were forwarded by P before the crash). Assuming at least one secondary has all 50 writes, the data is not lost. However, if the client did not receive an acknowledgment from P before the crash, the client does not know whether the writes completed. The client will retry, and the new primary (S1) must apply the retried writes. This could cause duplicate appends (GFS relies on record-level idempotency for atomic record appends, but for regular writes, duplicate writes may occur and must be handled by the application).
- Any writes that were “in flight” (data had been pushed but the write command hadn’t reached P) will be retried by the client with the new primary.
3. GFS vs. Consensus
Answer. Using Paxos for each chunk write would have these costs:
- Latency. Each Paxos write requires 2-3 RTTs (prepare-promise, propose-accept, possibly commit). GFS lease writes require 0 extra RTTs beyond the data pipeline. At Google’s scale (millions of writes/second), this would be devastating for latency.
- Throughput. Paxos requires a leader and a quorum. Every write would need a majority of replicas to agree. GFS’s lease lets the primary act as a sole decider — no quorum needed for each write.
- Complexity. Paxos is notoriously hard to implement correctly (Google learned this with Chubby). Leases are simple: grant, wait for timeout, re-grant.
- Scalability. Each chunk would need its own Paxos group. With millions of chunks, the state machine overhead would be enormous.
Google chose leases because GFS’s workload is dominated by large sequential writes (data processing pipelines), not small random writes. The lease-based approach gives them the ordering they need at a fraction of the cost of consensus.
4. Shadow master promotion
Answer. Steps as the SRE:
- Stop the shadow master. Prevent it from reading further to preserve the state at the lagged point.
- Check the operation log. The primary master’s disk failed, but hopefully the operation log was on a replicated storage or was backed up to a different disk. If the op log is lost entirely, some metadata is lost forever — specifically the operations between the last checkpoint and the crash.
- Determine the lag. The shadow master is 200 operations behind. Those 200 operations may include: new file creations, chunk lease grants, chunk location changes, and re-replication metadata.
- Promote the shadow master. Start it in primary mode. It will serve the stale metadata (missing the last 200 operations). Any files created in those 200 operations will be invisible. Any chunk leases granted in those 200 ops will be missing, but chunkservers will hear from the new master.
- Reconcile. The new master contacts all chunkservers and learns current chunk locations (chunkservers report what they have). This reconciles any missing location metadata.
- Accept data loss. Files created in the last ~200ms (approximately) are lost. The operation log gap means those namespace entries are gone. Files that already existed are fine — their chunk data is still on chunkservers.
- Root cause and prevent. The disk failure should be investigated. Future improvements: replicate the operation log to a secondary disk or use RAID on the master.