Key Question
What’s the “tax” Spanner pays for using TrueTime?
Deep Dive
Commit wait is the unavoidable latency that Spanner pays for its global consistency guarantee. After a transaction’s writes are prepared on all Paxos replicas, Spanner cannot return success immediately — it must wait until it is absolutely certain that no future transaction could start with an overlapping TrueTime interval.
The mechanics: once a transaction reaches the commit phase, Spanner picks a commit timestamp s = TT.now().latest + 1. Then it enters the commit wait window: it waits until TT.now().earliest > s. At that point, any clock in any datacenter reads a time strictly greater than s.
Here’s the timeline:
Real Time: |----[prepare]----[commit ts s]----[commit wait]----[reply to client]---->
^ ^ ^
| | |
prepare done s = TT.now().latest + 1 TT.now().earliest > s
(commit complete)
Uncertainty: <--- ε ----><--- ε ---->
|<--- commit wait --->|
The commit wait is approximately 2 × ε, where ε is the TrueTime uncertainty bound. In Google’s production, ε is typically 1–7ms. So commit wait adds roughly 2–14ms of extra latency on every write transaction.
Why 2 × ε specifically?
- Phase 1 (ε): From when the Paxos leader assigned the prepare timestamp to when it computes
TT.now().latest. The commit timestampsis already at the end of that uncertainty window. - Phase 2 (ε): The wait from
suntilTT.now().earliest > s. Since the current time is approximatelyTT.now().earliest + ε, and we needearliest > s, this takes roughly anotherε.
Take a concrete ε = 4ms:
1. Client sends write → Paxos prepares (0ms)
2. Prepare acknowledged → compute s = TT.now().latest + 1 (~4ms)
- TT.now() returned [T+1, T+5], so s = T+6
3. Commit wait: loop until TT.now().earliest > T+6
- At T+8, TT.now() might be [T+8, T+12]. earliest = T+8 > T+6 ✓
4. Total commit wait: from T+4 to T+8 = 4ms = ~1ε
5. Total write latency: ~8ms
For read-only transactions, Spanner avoids commit wait entirely (it can assign a read timestamp using TT.now().earliest and read at any replica up to that timestamp). For writes, the tax is unavoidable.
Google paper reports commit wait adds about 10ms for cross-continent transactions. In practice, many Spanner workloads are read-heavy, so the overall impact is manageable. The sync replication latency (Paxos) often dominates for truly global writes anyway.
Check Your Understanding
- If TrueTime’s ε = 7ms, what’s the minimum commit wait? The maximum?
- Could Spanner avoid commit wait for single-datacenter transactions where clocks are tightly synchronized?
- Why can’t Spanner just return success immediately after the prepare phase?
The “So What?”
Commit wait is Spanner’s central tradeoff: ~10ms of extra latency per write in exchange for global strong consistency. No other system offers this combination. Understanding commit wait tells you why Spanner excels for read-heavy, globally distributed workloads (where the tax is small) but struggles for write-heavy, latency-sensitive apps (where every write pays 2ε). It’s a deliberate engineering choice, not a flaw.
✏️ Exercises
Spanner: Exercises
Exercise 1: Commit Wait Math
TrueTime’s uncertainty ε is 7ms. A transaction’s prepare phase finishes at real time T. The Paxos leader calls TT.now() and gets the interval [T+2ms, T+9ms].
(a) What commit timestamp s does Spanner assign?
(b) At what real time does commit wait end (i.e., TT.now().earliest > s)?
(c) What was the total commit wait duration?
Exercise 2: Externally Consistent Reads
Can a Spanner read that does not involve a Paxos round (a follower read at snapshot timestamp) still be externally consistent? Explain why or why not, referencing TrueTime’s role.
Exercise 3: Read Scalability
Spanner writes go through a single Paxos leader per tablet group. This sounds like a bottleneck. How does Spanner achieve read scalability despite this apparent limitation? Name two mechanisms.
👁️ View Solutions
Spanner: Solutions
Exercise 1: Commit Wait Math
(a) The commit timestamp is TT.now().latest + 1 = (T + 9ms) + 1ms = T + 10ms.
(b) Commit wait ends when TT.now().earliest > T + 10ms. Since TT.now() always returns an interval of width ε (7ms), earliest > s when real time is at least s + 1ms = T + 11ms. At that point, the earliest possible clock reading is (T + 11ms) - 7ms = T + 4ms — wait, that’s not right.
Let’s think more carefully. earliest is the lower bound of TrueTime’s interval. At real time r, TrueTime returns [r - ε, r + ε] (using the best estimate). So earliest = r - ε. We need earliest > s: r - ε > T + 10 → r > T + 10 + ε = T + 17ms.
So commit wait ends at approximately T + 17ms.
(c) Commit wait started right after s was assigned at T + 9ms (when TT.now() returned [T+2, T+9]). It ends at T + 17ms. Total commit wait = 8ms (approx 1.14ε).
Note: the wait is roughly ε, not 2ε, because the commit timestamp s already incorporates the first ε (it uses latest). The second ε is the actual wait.
Exercise 2: Externally Consistent Reads
Yes, a follower read can still be externally consistent — if the read timestamp satisfies the external consistency condition.
The key: external consistency requires that if transaction T1 finishes before read R starts in real time, then R must see T1’s writes. This is guaranteed as long as R’s read timestamp t_read ≥ t_commit(T1).
Spanner assigns follower reads a timestamp t_read = TT.now().earliest. Since t_read is guaranteed to be ≤ the true time at the start of the read, and since any prior committed transaction has a commit timestamp ≤ the true time at its commit, the ordering holds.
However, a stale follower read (e.g., reading at a fixed past timestamp without consulting TrueTime) could violate external consistency. Externally consistent follower reads require the coordinator to set the read timestamp using TrueTime, even if the actual data is read from a follower.
Exercise 3: Read Scalability
Two mechanisms:
1. Follower reads (stale reads). Reads that tolerate small staleness (typically ≤ 10s) can be served by any Paxos follower, bypassing the leader entirely. Each follower replica independently maintains data up to its applied timestamp. Since most Spanner workloads are read-heavy, adding more replicas directly scales read throughput — no leader bottleneck.
2. Snapshot reads / time-bounded reads. Reads at a timestamp sufficiently in the past require no coordination. The replica simply returns the data at that timestamp from its local LSM storage. This is effectively free, since each replica already has the data.
These two mechanisms let Spanner serve read throughput proportional to the total number of replicas, not just the number of leaders. Writes remain bottlenecked on a single leader per Paxos group, but for read-heavy workloads (the common case), this architecture scales near-linearly.