Real World Architecture · GFS

Key Question

How does a GFS client actually read and write data step by step?

Deep Dive

Let’s walk through a full write in GFS. The file is split into 64 MB chunks. Each chunk is replicated across three chunkservers (usually on different racks). One replica is designated the primary (holds the lease). The others are secondaries.

Write Flow

Client         Master         Primary (P)     Secondary (S1)  Secondary (S2)
  |              |               |                |               |
  |1. Request --->               |                |               |
  |   lease      |-- locate ---->|                |               |
  |              |               |                |               |
  |2. Primary ---|               |                |               |
  |   + 2ndaries |               |                |               |
  |              |               |                |               |
  |3. Push data ---------------------------------->|               |
  |   to buffer  |               |--------------->|               |
  |              |               |------------------------------->|
  |              |               |                |               |
  |4. Write cmd ---------------->|                |               |
  |              |               |                |               |
  |              |  5. Assign    |                |               |
  |              |  serial#      |                |               |
  |              |  6. Write --->|                |               |
  |              |  7. Forward -->|--------------->|               |
  |              |               |------------------------------->|
  |              |               |                |               |
  |              |  8. Reply <---|<--------------- |<--------------|
  |<-- 9. OK ---|               |                |               |

Here’s what happens step by step:

Step 1: Client contacts Master. The client asks the Master: “Who holds the lease for chunk C of file F?” If no lease exists, the Master grants one (typically to the chunkserver with the least load).

Step 2: Master responds. The Master returns the identity (primary + secondaries) to the client. The client caches this for future writes.

Step 3: Client pushes data to ALL replicas. The client sends the data to the nearest replica. Each replica forwards it to the next “closest” replica using a pipeline topology (not a tree, not all-to-all). Data flows along a chain: Client → Chunkserver A → Chunkserver B → Chunkserver C. Each replica stores the data in a buffer (not yet applied).

Step 4: Client sends write command to primary. After all data has been pushed, the client tells the primary: “apply the write at offset X.”

Step 5: Primary assigns a serial number. The primary serializes all concurrent writes from multiple clients by assigning a monotonically increasing serial number.

Step 6-7: Primary applies locally, forwards. The primary writes to its own chunk file, then forwards the serialized write to all secondaries.

Step 8-9: Secondaries reply, primary replies to client. Each secondary replies “done” to the primary. Once all secondaries have confirmed, the primary replies “ok” to the client.

Why the Data-Then-Control Pattern?

Data is pushed in Step 3 before the serialization in Step 5. This decouples data movement (slow, involves network transfer) from ordering (fast, involves only control messages). The data is already local to each replica by the time the write command arrives. This means the serialization bottleneck (the primary) is never blocked waiting for data to arrive.

If a write fails (e.g., one secondary is down), the primary reports the error to the client. The client may retry. The write will have been applied to a subset of replicas (inconsistent). GFS handles this at the application level via atomic record appends or by having the client simply retry until all replicas are consistent.

Read Flow

Reads are simpler: (1) Client asks Master for chunk locations. (2) Client reads from the nearest replica (by network distance). (3) If the read fails (checksum error), client tries the next replica.

Check Your Understanding

Why does the client push data to all replicas before sending the write command to the primary?
What happens if a secondary fails to acknowledge the write?
The Chain/Data pipeline topology: why does GFS use a chain (A→B→C) instead of client sending to all three simultaneously?

The “So What?”

Understanding GFS’s write pipeline — data push then control, chain replication within the data path, lease-based serialization — helps you understand why Hadoop and Spark are designed the way they are. GFS’s write protocol directly influenced HDFS’s pipeline write and block replication.

✏️ Exercises

GFS: Exercises

Data pipeline design. In GFS, the client pushes data to all replicas before sending the write command to the primary. Why does the design separate data transfer from write ordering? What problem does this solve?
Lease failure behavior. A chunkserver P holds the lease for chunk C at a 50 MB offset in file F. P crashes after serving 50 writes but before its lease expires (lease was granted 40 seconds ago, 60-second lease). Describe exactly what happens next, including:
- How does the master learn P is down?
- When can a new lease be granted?
- Are the 50 writes that P ordered lost?
GFS vs. Consensus. GFS uses replication for durability (3x chunk copies) but does NOT use a consensus protocol like Paxos or Raft for coordinating writes. Instead it uses a lease-based primary. If Google had used Paxos for each chunk write, what would be the costs? Why did they choose leases?
Shadow master promotion. You are the SRE on call. The primary GFS master’s disk failed (hardware fault). A shadow master lagging by 200 operations exists. What do you do? Walk through the steps and describe what data could be lost.

👁️ View Solutions

GFS: Solutions

1. Data pipeline design

Answer. The separation of data transfer from write ordering decouples the slowest part of the operation (moving gigabytes of data over the network) from the part that must be serialized (assigning sequence numbers to writes). If data transfer and ordering were coupled, the primary would be a bottleneck: it would have to receive the data before it could assign the order, blocking all other writes to the same chunk. By pushing data to all replicas first, the data is already local by the time the primary serializes the write. The primary’s only job is ordering, which is fast (microseconds). Additionally, the pipeline topology (chain forwarding) minimizes the client’s upload bandwidth: the client sends the data once to the nearest replica, and the chain carries it to the others.

2. Lease failure behavior

Answer. Here’s the sequence:

The master detects P is down when it misses P’s heartbeat (typically 3 successive missed heartbeats, ~15-20 seconds).
The master checks the lease for chunk C. Since 40 of 60 seconds have elapsed, ~20 seconds remain.
The master waits for the lease to expire naturally. It cannot grant a new lease while the old one is still active — P could come back and cause conflicts.
After ~20 seconds, the lease expires. The master grants a new lease to S1 (one of the secondaries).
The 50 writes that P ordered are applied to the secondaries (they were forwarded by P before the crash). Assuming at least one secondary has all 50 writes, the data is not lost. However, if the client did not receive an acknowledgment from P before the crash, the client does not know whether the writes completed. The client will retry, and the new primary (S1) must apply the retried writes. This could cause duplicate appends (GFS relies on record-level idempotency for atomic record appends, but for regular writes, duplicate writes may occur and must be handled by the application).
Any writes that were “in flight” (data had been pushed but the write command hadn’t reached P) will be retried by the client with the new primary.

3. GFS vs. Consensus

Answer. Using Paxos for each chunk write would have these costs:

Latency. Each Paxos write requires 2-3 RTTs (prepare-promise, propose-accept, possibly commit). GFS lease writes require 0 extra RTTs beyond the data pipeline. At Google’s scale (millions of writes/second), this would be devastating for latency.
Throughput. Paxos requires a leader and a quorum. Every write would need a majority of replicas to agree. GFS’s lease lets the primary act as a sole decider — no quorum needed for each write.
Complexity. Paxos is notoriously hard to implement correctly (Google learned this with Chubby). Leases are simple: grant, wait for timeout, re-grant.
Scalability. Each chunk would need its own Paxos group. With millions of chunks, the state machine overhead would be enormous.

Google chose leases because GFS’s workload is dominated by large sequential writes (data processing pipelines), not small random writes. The lease-based approach gives them the ordering they need at a fraction of the cost of consensus.

4. Shadow master promotion

Answer. Steps as the SRE:

Stop the shadow master. Prevent it from reading further to preserve the state at the lagged point.
Check the operation log. The primary master’s disk failed, but hopefully the operation log was on a replicated storage or was backed up to a different disk. If the op log is lost entirely, some metadata is lost forever — specifically the operations between the last checkpoint and the crash.
Determine the lag. The shadow master is 200 operations behind. Those 200 operations may include: new file creations, chunk lease grants, chunk location changes, and re-replication metadata.
Promote the shadow master. Start it in primary mode. It will serve the stale metadata (missing the last 200 operations). Any files created in those 200 operations will be invisible. Any chunk leases granted in those 200 ops will be missing, but chunkservers will hear from the new master.
Reconcile. The new master contacts all chunkservers and learns current chunk locations (chunkservers report what they have). This reconciles any missing location metadata.
Accept data loss. Files created in the last ~200ms (approximately) are lost. The operation log gap means those namespace entries are gone. Files that already existed are fine — their chunk data is still on chunkservers.
Root cause and prevent. The disk failure should be investigated. Future improvements: replicate the operation log to a secondary disk or use RAID on the master.