Key Question
How does Spanner combine sharding with Paxos replication for global scale?
Deep Dive
Spanner’s architecture is a two-layer stack: sharding (for scale) on top of Paxos replication (for fault tolerance). The key insight is that these layers share a common unit of management called a directory.
A directory is a contiguous set of key-value pairs with a common prefix — think of it as a “row family” or “partition.” Directories are small (~100s of MB) and are Spanner’s fundamental unit of both data movement and replication.
Here’s the full hierarchy from top to bottom:
Table
└── Directory (sibling keys grouped, ~100s MB)
└── Tablet (one or more directories, stored as B-tree-like LSM)
└── Paxos Group (replicates the tablet across datacenters)
└── Server (physical machine hosting one or more groups)
Each Paxos group owns one tablet and replicates it across 2–5 datacenters. Writes go through the Paxos leader (elected via Paxos). The leader assigns timestamps and coordinates cross-group transactions.
Directories are movable. If a server is overloaded, Spanner moves directories between servers transparently. If a datacenter fails, directories in that DC’s replica groups are re-replicated elsewhere. If access patterns change, hot directories are split or merged.
Before rebalancing: After rebalancing:
Server A Server A
├── Paxos Group 1 ├── Paxos Group 1
│ └── Dir [a-c] │ └── Dir [a-b]
├── Paxos Group 2 ├── Paxos Group 2
│ └── Dir [d-f] │ └── Dir [e-f]
│ └── Paxos Group 3 (new)
Server B └── Dir [c-d] ← moved here
├── Paxos Group 3
│ └── Dir [g-i]
Directories also enable data-locality grouping. Related data placed in the same directory ends up in the same Paxos group. This means transactions touching only that directory are single-Paxos-group operations — no 2PC overhead. The parent-child relationships in a schema (e.g., User and their Orders) are interleaved into the same directory by design.
Reads exploit this hierarchy for scalability:
- Strong reads: go to the Paxos leader (latest state, consistent).
- Stale reads: served by any replica, no Paxos round needed — just read at a timestamp
tthat the replica has already applied up to. This reads scale linearly with the number of replicas. - Snapshot reads: read at a timestamp in the past, no Paxos needed.
This architecture inspired CockroachDB (uses a similar range→Raft→node model with HLC instead of TrueTime) and YugabyteDB (uses DocDB tablets with Raft). Neither can provide the same external consistency guarantees without hardware clock assistance, but both borrow Spanner’s hierarchy.
Check Your Understanding
- What advantage does the “directory” abstraction provide over a simpler table→shard mapping?
- A read that goes to a follower replica — can it be externally consistent?
- Why does Spanner group related data into the same directory via interleaving?
The “So What?”
The directory hierarchy is Spanner’s scalability secret: it unifies sharding and replication under one abstraction, enables zero-downtime rebalancing, and exploits Paxos to serve reads from any replica. Later databases copied this layering but couldn’t replicate TrueTime — proving the hierarchy is necessary but not sufficient for global external consistency.
✏️ Exercises
Spanner: Exercises
Exercise 1: Commit Wait Math
TrueTime’s uncertainty ε is 7ms. A transaction’s prepare phase finishes at real time T. The Paxos leader calls TT.now() and gets the interval [T+2ms, T+9ms].
(a) What commit timestamp s does Spanner assign?
(b) At what real time does commit wait end (i.e., TT.now().earliest > s)?
(c) What was the total commit wait duration?
Exercise 2: Externally Consistent Reads
Can a Spanner read that does not involve a Paxos round (a follower read at snapshot timestamp) still be externally consistent? Explain why or why not, referencing TrueTime’s role.
Exercise 3: Read Scalability
Spanner writes go through a single Paxos leader per tablet group. This sounds like a bottleneck. How does Spanner achieve read scalability despite this apparent limitation? Name two mechanisms.
👁️ View Solutions
Spanner: Solutions
Exercise 1: Commit Wait Math
(a) The commit timestamp is TT.now().latest + 1 = (T + 9ms) + 1ms = T + 10ms.
(b) Commit wait ends when TT.now().earliest > T + 10ms. Since TT.now() always returns an interval of width ε (7ms), earliest > s when real time is at least s + 1ms = T + 11ms. At that point, the earliest possible clock reading is (T + 11ms) - 7ms = T + 4ms — wait, that’s not right.
Let’s think more carefully. earliest is the lower bound of TrueTime’s interval. At real time r, TrueTime returns [r - ε, r + ε] (using the best estimate). So earliest = r - ε. We need earliest > s: r - ε > T + 10 → r > T + 10 + ε = T + 17ms.
So commit wait ends at approximately T + 17ms.
(c) Commit wait started right after s was assigned at T + 9ms (when TT.now() returned [T+2, T+9]). It ends at T + 17ms. Total commit wait = 8ms (approx 1.14ε).
Note: the wait is roughly ε, not 2ε, because the commit timestamp s already incorporates the first ε (it uses latest). The second ε is the actual wait.
Exercise 2: Externally Consistent Reads
Yes, a follower read can still be externally consistent — if the read timestamp satisfies the external consistency condition.
The key: external consistency requires that if transaction T1 finishes before read R starts in real time, then R must see T1’s writes. This is guaranteed as long as R’s read timestamp t_read ≥ t_commit(T1).
Spanner assigns follower reads a timestamp t_read = TT.now().earliest. Since t_read is guaranteed to be ≤ the true time at the start of the read, and since any prior committed transaction has a commit timestamp ≤ the true time at its commit, the ordering holds.
However, a stale follower read (e.g., reading at a fixed past timestamp without consulting TrueTime) could violate external consistency. Externally consistent follower reads require the coordinator to set the read timestamp using TrueTime, even if the actual data is read from a follower.
Exercise 3: Read Scalability
Two mechanisms:
1. Follower reads (stale reads). Reads that tolerate small staleness (typically ≤ 10s) can be served by any Paxos follower, bypassing the leader entirely. Each follower replica independently maintains data up to its applied timestamp. Since most Spanner workloads are read-heavy, adding more replicas directly scales read throughput — no leader bottleneck.
2. Snapshot reads / time-bounded reads. Reads at a timestamp sufficiently in the past require no coordination. The replica simply returns the data at that timestamp from its local LSM storage. This is effectively free, since each replica already has the data.
These two mechanisms let Spanner serve read throughput proportional to the total number of replicas, not just the number of leaders. Writes remain bottlenecked on a single leader per Paxos group, but for read-heavy workloads (the common case), this architecture scales near-linearly.