Distributed & Decentralized Systems Curriculum
Time Causality Β· Physical Clock Drift

Key Question

How does the internet synchronize billions of devices to within milliseconds of each other?

Deep Dive

The Network Time Protocol (NTP) is the backbone of time synchronization on the internet. It has been in continuous operation since before the web existed β€” the first specification was published in 1985 (RFC 958), and the current version (NTPv4, RFC 5905) dates to 2010. NTP is both a protocol (how machines exchange time information) and a hierarchical system of servers organized by β€œstratum” levels that determines accuracy.

The hierarchy looks like this:

Stratum 0:    [Atomic Clock] [GPS Receiver] [WWVB Radio]
                  |               |              |
Stratum 1:   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”
             β”‚        Primary Time Servers            β”‚
             β”‚   (Directly synced to Stratum 0)       β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          |
Stratum 2:   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚        Secondary Servers                β”‚
             β”‚   (Synced to Stratum 1)                 β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          |
Stratum 3:   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚        Clients & Lower Servers          β”‚
             β”‚   (Synced to Stratum 2)                 β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stratum 0 devices are the ultimate time sources. They do not participate in the network directly. These are atomic clocks (cesium or rubidium β€” expensive, room-sized, accurate to within one second in millions of years), GPS receivers (which decode the precise timing signals from GPS satellites, each of which carries multiple atomic clocks), and radio receivers that decode national time broadcasts like WWVB (US) or DCF77 (Germany). These are the physical clock sources that define β€œaccurate time.”

Stratum 1 servers are connected directly to a Stratum 0 device. They poll the atomic clock or GPS receiver every few seconds and serve time to lower-stratum machines. Stratum 1 servers are typically maintained by national laboratories, universities, and large technology companies. The NTP Pool Project provides a global DNS-based load-balanced pool of public Stratum 1 and 2 servers β€” pool.ntp.org resolves to different servers depending on the client’s geographic location.

Stratum 2 servers synchronize with one or more Stratum 1 servers. They are the β€œworkhorses” of the NTP hierarchy β€” most organizations run Stratum 2 servers as their internal time source. A Stratum 2 server typically polls multiple Stratum 1 servers and uses sophisticated algorithms (including the Marzullo algorithm, named after its inventor Keith Marzullo) to detect and discard faulty or inaccurate upstream servers.

Stratum 3 and below follow the same pattern β€” each stratum syncs from the one above. There is a hard limit of Stratum 15. Stratum 16 is considered β€œunsynchronized.” The protocol assigns a maximum of 15 strata because each level of synchronization adds inaccuracy.

Accuracy degradation. Each stratum adds about 10-100 microseconds of uncertainty under ideal conditions (low-latency, low-jitter links). Across the public internet, the typical accuracy is:

  • Stratum 1 client: within 1-10 microseconds (direct connect to reference)
  • Stratum 2 client: within 100 microseconds to 1 millisecond
  • Stratum 3-4 client: within 1-10 milliseconds
  • Stratum 5+ on congested links: 10-100 milliseconds or worse

This degradation means you cannot guarantee sub-millisecond accuracy across continents. The theoretical limit is the speed of light: a round trip between New York and Sydney takes about 80 ms in fiber. No amount of protocol optimization can beat physics.

Slew vs. Step. When a machine’s clock is wrong by a small amount (less than 128 ms), NTP adjusts the clock gradually β€” a technique called β€œslewing.” The clock’s rate is changed by a few parts per million so that it slowly converges to the correct time without any discontinuity. If the error is larger, NTP may β€œstep” the clock β€” set it forward or backward instantly. Stepping is dangerous for applications that assume time never moves backward, like databases that assign transaction IDs based on timestamps. If time jumps backward by 500 ms, a transaction that just completed at timestamp T may be followed by a transaction at timestamp T-500, breaking monotonicity. Modern NTP implementations (like ntpd and chronyd) prefer slewing and only step if the error is above a configurable threshold (typically 1 second).

NTP also accounts for the asymmetry of network paths using a more sophisticated model than Cristian’s algorithm. It maintains a history of timestamps and uses statistical techniques to estimate the true offset, rejecting samples with high jitter or asymmetry.

Check Your Understanding

  1. A company running a trading application needs sub-millisecond clock accuracy. What is the minimum stratum level they should use? What hardware would they need?
  2. Why does accuracy degrade as the stratum level increases?
  3. What is the difference between slewing and stepping, and when would a database administrator be concerned about stepping?
  4. Why can’t NTP guarantee better than 100 ms accuracy between servers on different continents?

The β€œSo What?”

NTP is the silent dependency of nearly every internet service. Database replication, authentication token validation, cache expiration, log correlation, and distributed consensus all assume clocks are within some bound β€” and NTP is what keeps them there. Understanding the hierarchy helps you diagnose accuracy problems: if your servers have NTP errors of 50ms, it may be acceptable. If they have errors of 5 seconds, your authentication system will reject valid tokens and your database will confuse the order of transactions. Know what stratum your servers use and how much error is acceptable for your application.


✏️ Exercises

Topic 2: Physical Clock Drift β€” Exercises

Exercise 1: Drift Accumulation

A server room experiences a cooling failure. The temperature rises by 15 degrees Celsius, causing the quartz crystals in the servers’ clocks to drift at 20 ppm instead of their normal 5 ppm. The cooling outage lasts 30 days.

How much clock error accumulates during this period? Express your answer in seconds.


Exercise 2: Cristian’s Algorithm Error Bound

A client uses Cristian’s algorithm to synchronize with a time server. The client makes three requests and gets these results:

RequestT1 (client send)T_server (server time)T2 (client receive)
1010,000,000120
250010,000,100540
3100010,000,2001040

(a) Calculate the estimated server time for each request. (b) Which request gives the most accurate estimate? Why? (c) If the actual one-way delay from client to server is 50ms and from server to client is 10ms for the best request, what is the true server time at T2, and what is the error of Cristian’s estimate?


Exercise 3: Berkeley Algorithm Arithmetic

A cluster of 5 machines runs the Berkeley Algorithm. The master polls the slaves and receives these time offsets (in seconds from the master’s perspective at the moment of polling):

MachineOffset (seconds)
Master+1.00
Slave 1+0.50
Slave 2-0.75
Slave 3+0.25
Slave 4+5.00

(a) Identify any outliers. What threshold would you use? (b) Compute the fault-tolerant average. (c) Compute the adjustment for each machine. (d) What would happen to the cluster’s time if the Berkeley Algorithm were NOT used and each machine relied on its own quartz crystal?


Exercise 4: NTP Slewing vs. Stepping

A database server generates timestamps for every write. Its clock is wrong by +800 ms (it is 800 ms ahead of UTC). The NTP daemon has two options:

Option A: Step the clock backward by 800 ms immediately. Option B: Slew the clock at a rate of 50 ppm (50 microseconds per second) until it reaches the correct time.

(a) How long does Option B take to correct the 800 ms error? Show the calculation. (b) During Option A, describe the exact problem that can occur with database timestamps. (c) Which option would you choose for a system that assigns monotonically increasing transaction IDs? Why?

πŸ‘οΈ View Solutions

Topic 2: Physical Clock Drift β€” Solutions

Solution 1

Drift rate: D = 20 ppm = 20 Γ— 10⁻⁢

Elapsed time: 30 days = 30 Γ— 24 Γ— 60 Γ— 60 = 2,592,000 seconds

Error = D Γ— 10⁻⁢ Γ— elapsed_time = 20 Γ— 10⁻⁢ Γ— 2,592,000 = 20 Γ— 2.592 = 51.84 seconds

Each server’s clock is off by about 52 seconds after 30 days. This is enough to cause authentication failures (Kerberos tickets typically have a 5-minute tolerance), SSL certificate validation failures, and confusing log timestamps.


Solution 2

(a)

Request 1: RTT = 120 - 0 = 120ms. Estimated = 10,000,000 + 120/2 = 10,000,060 Request 2: RTT = 540 - 500 = 40ms. Estimated = 10,000,100 + 40/2 = 10,000,120 Request 3: RTT = 1040 - 1000 = 40ms. Estimated = 10,000,200 + 40/2 = 10,000,220

(b) Request 2 and 3 both have RTT = 40ms β€” this ties for the smallest RTT. A smaller RTT suggests less network congestion and makes the symmetric delay assumption more plausible. Both Request 2 and 3 are equally β€œbest” by RTT. However, if we note that Request 3’s estimated time is 10,000,220 with the same RTT, while Request 2’s is 10,000,120, this is simply because the server’s clock advanced between the two requests. The accuracy of the estimate (error from true time) is the same for both since the RTT is the same.

To choose between them, one could compare the variances of previous RTT measurements or simply use the first one with the minimum RTT.

(c) For Request 2: RTT = 40ms, but the true asymmetry is 50ms outbound, 10ms inbound.

True server time at T2 = T_server + inbound_delay = 10,000,100 + 10ms = 10,000,110

Cristian’s estimate = 10,000,120

Error = 10,000,120 - 10,000,110 = 10ms

The estimate is 10ms ahead of the true time. The error is: |inbound_delay - outbound_delay| / 2 = |10 - 50| / 2 = 20ms

But the actual error is 10ms (because of additional processing time). The general formula: maximum error = RTT/2 - min_one_way = 20 - 10 = 10ms.


Solution 3

(a) Outlier detection: The offsets are +1.00, +0.50, -0.75, +0.25, and +5.00. Slave 4’s offset of +5.00 seconds is far outside the range of the other four (which span -0.75 to +1.00). A simple threshold would be: discard any value more than, say, 3 seconds from the median. Slave 4 is clearly faulty β€” maybe its battery has died.

(b) Fault-tolerant average (excluding Slave 4):

Average = (+1.00 + 0.50 + (-0.75) + 0.25) / 4 = 1.00 / 4 = +0.25 seconds

(c) Adjustments:

Master: +1.00 β†’ +0.25, adjust by -0.75 seconds (slew backward) Slave 1: +0.50 β†’ +0.25, adjust by -0.25 seconds (slew backward) Slave 2: -0.75 β†’ +0.25, adjust by +1.00 seconds (slew forward) Slave 3: +0.25 β†’ +0.25, adjust by 0 seconds (already at average) Slave 4: excluded, not adjusted (but should be diagnosed)

(d) Without the Berkeley Algorithm, each machine’s clock would drift independently. After 30 days:

Master at 5 ppm: off by 5 Γ— 10⁻⁢ Γ— 2,592,000 = 13.0 seconds Slave 1 at 5 ppm: off by 13.0 seconds (in the same direction, perhaps) Slave 2 at 5 ppm: off by 13.0 seconds But if one machine has a crystal with a different drift rate (say 20 ppm due to manufacturing variance), it could be off by 52 seconds β€” producing a 39-second disagreement between machines. Database timestamps would disagree, logs would be impossible to correlate, and any ordering based on timestamps would be wrong.


Solution 4

(a) Option B (slew at 50 ppm):

50 ppm = 50 Γ— 10⁻⁢ = 0.00005 seconds per second = 0.05 ms per second

To correct 800 ms:

time = 800 ms / (0.05 ms/s) = 16,000 seconds = 4 hours, 26 minutes, 40 seconds

(b) During Option A (step backward by 800ms):

At 10:00:00.000 (server time), the server writes transaction T1 with timestamp 10:00:00.000. NTP steps the clock back to 09:59:59.200 (correct UTC time). The next transaction T2 gets timestamp 09:59:59.200. Now T2 appears to β€œhappen before” T1, even though it came after. Any system that relies on timestamp monotonicity (write-ahead logs, replication slots, MVCC visibility) may fail or corrupt data because the β€œbefore” and β€œafter” ordering is reversed.

(c) For a system with monotonically increasing transaction IDs, Option B (slew) is strongly preferred. The 4.5-hour convergence time is acceptable for most applications β€” it prevents any discontinuity in timestamps. Option A risks data corruption or system crashes. If the error were extremely large (say, hours), the database might require a maintenance window for a stepped correction, but for 800ms, slewing is safe.