Distributed & Decentralized Systems Curriculum
Real World Architecture · Spanner

Key Question

How does Spanner know the “real” time with bounded uncertainty using GPS and atomic clocks?

Deep Dive

Every distributed database struggles with time. Wall clocks drift. NTP has variable latency. Network Time Protocol (NTP) can give you time within 1-100ms of UTC, but that’s a statistical bound, not a guarantee. For a global database that claims “externally consistent” transactions, Spanner needs something better.

Spanner’s answer: TrueTime, a time service backed by GPS receivers and atomic clocks.

The TrueTime API

TrueTime exposes a single, simple API:

TT.now() → [earliest, latest]

This is NOT a single timestamp. It’s an interval: “the true UTC time is somewhere between earliest and latest.”

Example:
  TT.now() returns [10:00:00.000, 10:00:00.007]
  
  This means: "I am confident that the current UTC time
  is between 10:00:00.000 and 10:00:00.007."
  
  The uncertainty ε = (latest - earliest) = 7ms.

The key insight: TrueTime doesn’t try to tell you the precise time. It tells you an interval that bounds the real time. It then gives you the tools to make decisions based on that interval.

How TrueTime Works

Each Spanner datacenter has a time master with both a GPS receiver and an atomic clock:

                   GPS Satellite

                  [GPS signal]

              ┌─────────▼─────────┐
              │   Time Master     │
              │  GPS + Atomic     │
              │  Clock            │
              └─────────┬─────────┘

               [Time sync protocol]

         ┌──────────────┼──────────────┐
         │              │              │
    ┌────▼────┐   ┌────▼────┐   ┌────▼────┐
    │Spanner  │   │Spanner  │   │Spanner  │
    │Server 1 │   │Server 2 │   │Server 3 │
    │(atomic  │   │(atomic  │   │(atomic  │
    │ clock)  │   │ clock)  │   │ clock)  │
    └─────────┘   └─────────┘   └─────────┘
  • GPS time provides sub-millisecond accuracy to UTC, but requires a clear view of the sky and is susceptible to RF interference.
  • Atomic clocks provide drift-bounded time — they drift slowly but don’t require external signals. Each Spanner server has its own atomic clock.
  • The time master runs a time daemon that compares the GPS time with the atomic clock, applying Marzullo’s algorithm to find the most accurate interval.

Why the Interval?

The interval is crucial for correctness. TrueTime’s guarantee is:

For any two events e1 and e2 where e1 finished before e2 started:

  TT.now() at e1: [e1_earliest, e1_latest]
  TT.now() at e2: [e2_earliest, e2_latest]

  If e1_latest < e2_earliest, then e1 definitely happened before e2.

This gives Spanner a reliable “happens-before” relationship using physical time, without needing to synchronize clocks tightly.

Typical Uncertainty

In practice, TrueTime’s uncertainty ε varies:

ConditionTypical εNotes
GPS-locked time master1-2msBest case, clear sky
Atomic clock only (GPS lost)4-10msDrifts over time, re-sync needed every ~30s
After long GPS blackout10-100ms+Uncertainty grows, Spanner slows down
Cross-datacenter2-7msNetwork + clock uncertainty combined

The interval grows and shrinks dynamically. When the time daemon re-syncs with GPS, ε shrinks. During GPS outages, ε grows as the atomic clock drifts.

Why GPS + Atomic Clocks?

“NTP can give us sub-millisecond accuracy most of the time. Why the expensive hardware?”

NTP’s problem: it can’t bound its error. If a network delay spike happens during NTP sync (which it will, in a large network), your clock can be off by 100ms without knowing it. NTP tells you “about what time it is,” not “the time is between X and Y with certainty.”

TrueTime’s guarantee is a bound, not an estimate. GPS + atomic clocks give Spanner a provable upper bound on clock uncertainty. This bound is what enables the external consistency guarantees described in the next lesson.

The So What?

TrueTime is Spanner’s secret weapon — it replaces complex distributed clock sync (like Lamport clocks, vector clocks, or NTP’s best-effort sync) with physics (GPS + atomic clocks). This is expensive (each machine needs an atomic clock), but it’s what allows Spanner to provide strong consistency across global datacenters. CockroachDB implements a software-only approximation (hybrid logical clocks) that gives similar semantics without the hardware cost.


✏️ Exercises

Spanner: Exercises

Exercise 1: Commit Wait Math

TrueTime’s uncertainty ε is 7ms. A transaction’s prepare phase finishes at real time T. The Paxos leader calls TT.now() and gets the interval [T+2ms, T+9ms].

(a) What commit timestamp s does Spanner assign? (b) At what real time does commit wait end (i.e., TT.now().earliest > s)? (c) What was the total commit wait duration?

Exercise 2: Externally Consistent Reads

Can a Spanner read that does not involve a Paxos round (a follower read at snapshot timestamp) still be externally consistent? Explain why or why not, referencing TrueTime’s role.

Exercise 3: Read Scalability

Spanner writes go through a single Paxos leader per tablet group. This sounds like a bottleneck. How does Spanner achieve read scalability despite this apparent limitation? Name two mechanisms.

👁️ View Solutions

Spanner: Solutions

Exercise 1: Commit Wait Math

(a) The commit timestamp is TT.now().latest + 1 = (T + 9ms) + 1ms = T + 10ms.

(b) Commit wait ends when TT.now().earliest > T + 10ms. Since TT.now() always returns an interval of width ε (7ms), earliest > s when real time is at least s + 1ms = T + 11ms. At that point, the earliest possible clock reading is (T + 11ms) - 7ms = T + 4ms — wait, that’s not right.

Let’s think more carefully. earliest is the lower bound of TrueTime’s interval. At real time r, TrueTime returns [r - ε, r + ε] (using the best estimate). So earliest = r - ε. We need earliest > s: r - ε > T + 10r > T + 10 + ε = T + 17ms.

So commit wait ends at approximately T + 17ms.

(c) Commit wait started right after s was assigned at T + 9ms (when TT.now() returned [T+2, T+9]). It ends at T + 17ms. Total commit wait = 8ms (approx 1.14ε).

Note: the wait is roughly ε, not 2ε, because the commit timestamp s already incorporates the first ε (it uses latest). The second ε is the actual wait.

Exercise 2: Externally Consistent Reads

Yes, a follower read can still be externally consistent — if the read timestamp satisfies the external consistency condition.

The key: external consistency requires that if transaction T1 finishes before read R starts in real time, then R must see T1’s writes. This is guaranteed as long as R’s read timestamp t_read ≥ t_commit(T1).

Spanner assigns follower reads a timestamp t_read = TT.now().earliest. Since t_read is guaranteed to be ≤ the true time at the start of the read, and since any prior committed transaction has a commit timestamp ≤ the true time at its commit, the ordering holds.

However, a stale follower read (e.g., reading at a fixed past timestamp without consulting TrueTime) could violate external consistency. Externally consistent follower reads require the coordinator to set the read timestamp using TrueTime, even if the actual data is read from a follower.

Exercise 3: Read Scalability

Two mechanisms:

1. Follower reads (stale reads). Reads that tolerate small staleness (typically ≤ 10s) can be served by any Paxos follower, bypassing the leader entirely. Each follower replica independently maintains data up to its applied timestamp. Since most Spanner workloads are read-heavy, adding more replicas directly scales read throughput — no leader bottleneck.

2. Snapshot reads / time-bounded reads. Reads at a timestamp sufficiently in the past require no coordination. The replica simply returns the data at that timestamp from its local LSM storage. This is effectively free, since each replica already has the data.

These two mechanisms let Spanner serve read throughput proportional to the total number of replicas, not just the number of leaders. Writes remain bottlenecked on a single leader per Paxos group, but for read-heavy workloads (the common case), this architecture scales near-linearly.