Key Question
How does TrueTime enable external consistency (linearizability across global datacenters)?
Deep Dive
External consistency is Spanner’s superpower: it means that if transaction A finishes before transaction B in real time, then A’s effects are visible to B. This is the strongest consistency guarantee you can get in a distributed database — it’s what you’d expect from a single-machine database, but Spanner delivers it across datacenters on different continents.
The formal name is strict serializability: linearizability + transactions. Every transaction appears to execute at a single point in time (atomic), and that point respects real-time order. Before Spanner, no geo-replicated database had achieved this at scale.
Spanner achieves this using TrueTime timestamps as commit timestamps. The rule is simple: a transaction T commits at timestamp s, where s > TT.now().latest at the time the transaction reached the commit phase. By guaranteeing the commit timestamp is strictly greater than the latest possible time at the moment of commit, Spanner creates a monotonic ordering that mirrors real time.
Let’s walk through an example:
Client A (Tokyo) Client B (London)
| |
|--- commit request --->| |
| | |
TT.now() = [9:59:59.998, |
| 10:00:00.007] |
| | |
commit ts = 10:00:00.008 |
(using TT.now().latest + 1) |
| | |
|<-- committed at ts ---| |
| | |
| |--- start tx --->|
| | |
| | TT.now() = [10:00:00.009,
| | 10:00:00.015]
| | |
| | safe read at ts >= 10:00:00.008
| | sees Client A's write ✓
| | |
Client A’s TrueTime interval is [09:59:59.998, 10:00:00.007]. The commit timestamp is 10:00:00.008 (latest + 1ms). When Client B starts its transaction later in London, its TT.now() interval is entirely past A’s commit timestamp. The monotonicity guarantee holds: A's timestamp < B's timestamp, and since B’s read must be at a timestamp ≥ B’s start time, B sees A’s write.
The proof works by contradiction. Suppose T1 finishes before T2 starts in real time but T1’s commit timestamp ≥ T2’s commit timestamp. T1 commits at s1 > TT.now().latest_1 (at T1’s commit time). T2 starts at real time t2_start, and since t2_start > t1_end, and t1_end > TT.now().latest_1, we know TT.now().earliest_2 > TT.now().latest_1. Therefore s1 > TT.now().latest_1 and s2 < TT.now().earliest_2 (because T2 picks a timestamp within its interval), so s1 < s2 necessarily. Timestamps preserve real-time order.
Check Your Understanding
- If transaction T1 starts at
TT.after()and T2 starts after T1 commits, can T1’s timestamp ever be larger than T2’s? - What happens to external consistency if the TrueTime error bound ε is violated (e.g., a clock jumps more than ε)?
- Why does Spanner use
TT.now().latest + 1instead ofTT.now().earliestfor commit timestamps?
The “So What?”
External consistency is the reason enterprise users trust Spanner with financial data across global deployments. Before Spanner, you had to choose between strong consistency (single region) and global scale (eventual consistency). TrueTime eliminates that tradeoff — you get both. This was a breakthrough that reshaped how the industry thinks about distributed databases.
✏️ Exercises
Spanner: Exercises
Exercise 1: Commit Wait Math
TrueTime’s uncertainty ε is 7ms. A transaction’s prepare phase finishes at real time T. The Paxos leader calls TT.now() and gets the interval [T+2ms, T+9ms].
(a) What commit timestamp s does Spanner assign?
(b) At what real time does commit wait end (i.e., TT.now().earliest > s)?
(c) What was the total commit wait duration?
Exercise 2: Externally Consistent Reads
Can a Spanner read that does not involve a Paxos round (a follower read at snapshot timestamp) still be externally consistent? Explain why or why not, referencing TrueTime’s role.
Exercise 3: Read Scalability
Spanner writes go through a single Paxos leader per tablet group. This sounds like a bottleneck. How does Spanner achieve read scalability despite this apparent limitation? Name two mechanisms.
👁️ View Solutions
Spanner: Solutions
Exercise 1: Commit Wait Math
(a) The commit timestamp is TT.now().latest + 1 = (T + 9ms) + 1ms = T + 10ms.
(b) Commit wait ends when TT.now().earliest > T + 10ms. Since TT.now() always returns an interval of width ε (7ms), earliest > s when real time is at least s + 1ms = T + 11ms. At that point, the earliest possible clock reading is (T + 11ms) - 7ms = T + 4ms — wait, that’s not right.
Let’s think more carefully. earliest is the lower bound of TrueTime’s interval. At real time r, TrueTime returns [r - ε, r + ε] (using the best estimate). So earliest = r - ε. We need earliest > s: r - ε > T + 10 → r > T + 10 + ε = T + 17ms.
So commit wait ends at approximately T + 17ms.
(c) Commit wait started right after s was assigned at T + 9ms (when TT.now() returned [T+2, T+9]). It ends at T + 17ms. Total commit wait = 8ms (approx 1.14ε).
Note: the wait is roughly ε, not 2ε, because the commit timestamp s already incorporates the first ε (it uses latest). The second ε is the actual wait.
Exercise 2: Externally Consistent Reads
Yes, a follower read can still be externally consistent — if the read timestamp satisfies the external consistency condition.
The key: external consistency requires that if transaction T1 finishes before read R starts in real time, then R must see T1’s writes. This is guaranteed as long as R’s read timestamp t_read ≥ t_commit(T1).
Spanner assigns follower reads a timestamp t_read = TT.now().earliest. Since t_read is guaranteed to be ≤ the true time at the start of the read, and since any prior committed transaction has a commit timestamp ≤ the true time at its commit, the ordering holds.
However, a stale follower read (e.g., reading at a fixed past timestamp without consulting TrueTime) could violate external consistency. Externally consistent follower reads require the coordinator to set the read timestamp using TrueTime, even if the actual data is read from a follower.
Exercise 3: Read Scalability
Two mechanisms:
1. Follower reads (stale reads). Reads that tolerate small staleness (typically ≤ 10s) can be served by any Paxos follower, bypassing the leader entirely. Each follower replica independently maintains data up to its applied timestamp. Since most Spanner workloads are read-heavy, adding more replicas directly scales read throughput — no leader bottleneck.
2. Snapshot reads / time-bounded reads. Reads at a timestamp sufficiently in the past require no coordination. The replica simply returns the data at that timestamp from its local LSM storage. This is effectively free, since each replica already has the data.
These two mechanisms let Spanner serve read throughput proportional to the total number of replicas, not just the number of leaders. Writes remain bottlenecked on a single leader per Paxos group, but for read-heavy workloads (the common case), this architecture scales near-linearly.