Distributed & Decentralized Systems Curriculum
Time Causality · Physical Clock Drift

Key Question

How do computers synchronize their clocks when no machine has an accurate time source?

Deep Dive

Cristian’s algorithm assumes the time server has an accurate clock (GPS or atomic clock). What if no machine in the system has a reliable external time reference? This is the problem the Berkeley Algorithm solves. Developed at UC Berkeley in the late 1980s, it runs on a local network where no machine has a UTC source, but all machines need to agree on a common time.

The algorithm uses a master node that polls the slave nodes for their local times, computes an average (discarding outliers), and tells each slave how much to adjust. No slave talks directly to any other slave — all communication goes through the master.

Step-by-step with a 5-machine example:

Initial state (times shown as offset from the "true" unknown time):
  Master:   +0.50 seconds (half a second ahead)
  Slave A:  +0.10 seconds
  Slave B:  -0.30 seconds (behind)
  Slave C:  +0.40 seconds
  Slave D:  +2.00 seconds (clearly faulty)

Step 1: Master polls all slaves

The master sends a request to each slave. Each slave responds with its current local time. The master records the response times, adjusting for estimated network delay (similar to Cristian’s algorithm).

  Master receives:
  Slave A:  +0.10  (at local time +0.51, adjusted to +0.10)
  Slave B:  -0.30  (at local time +0.52, adjusted to -0.30)
  Slave C:  +0.40  (at local time +0.49, adjusted to +0.40)
  Slave D:  +2.00  (at local time +0.53, adjusted to +2.00)

Step 2: Compute fault-tolerant average

The master computes the average, but first excludes outliers. Slave D’s time of +2.00 seconds is far outside the range of the others — something is wrong (perhaps a dying CMOS battery). The master discards it.

  Fault-tolerant average = (+0.50 + 0.10 + (-0.30) + 0.40) / 4
                        = 0.70 / 4
                        = +0.175 seconds

Step 3: Compute adjustments

Each slave is told how much to adjust its clock, relative to the fault-tolerant average. The adjustment is the difference between the slave’s time and the average, but the master also adjusts itself (the difference between its own time and the average).

  Master:  +0.50 → +0.175,  adjustment = -0.325 (set clock back by 325ms)
  Slave A: +0.10 → +0.175,  adjustment = +0.075 (set clock forward by 75ms)
  Slave B: -0.30 → +0.175,  adjustment = +0.475 (set clock forward by 475ms)
  Slave C: +0.40 → +0.175,  adjustment = -0.225 (set clock back by 225ms)

Step 4: Apply adjustments

The master sends each slave its adjustment value. Each slave adjusts its clock. Since the adjustment is typically small (less than a second), the slaves can slew the clock — change the rate of time progression gradually rather than jumping abruptly — to avoid problems with timestamps that were generated between the adjustment command and the adjustment execution.

Handling master failure. The Berkeley Algorithm makes the master a single point of failure. If the master crashes, no synchronization happens. In practice, a backup master can be elected if the primary fails, using a leader election protocol. The slaves detect the master’s failure through timeout and promote one of themselves (or a designated backup) to master.

The algorithm converges the entire cluster to a common time, even though no single machine knows the “true” time. The cluster drifts together, which is often good enough for internal coordination — timestamps will agree across machines, even if they disagree with the outside world.

Check Your Understanding

  1. Why does the Berkeley Algorithm discard outlier times before computing the average?
  2. If there are 3 machines with times +0.2s, -0.1s, and +0.8s, what is the fault-tolerant average? (Assume the outlier is discarded.)
  3. What happens if the master crashes during the synchronization round?

The “So What?”

The Berkeley Algorithm is used in clusters that lack external time sources. If you have ever deployed a database cluster without GPS or NTP access (on an isolated network, for instance), the machines must still agree on time for transaction ordering, cache coherency, and log timestamps. The Berkeley approach — master-driven, fault-tolerant averaging — is a practical fallback when UTC is unavailable, and the same “discard outliers” principle appears in modern distributed consensus protocols like Raft and Paxos.


✏️ Exercises

Topic 2: Physical Clock Drift — Exercises

Exercise 1: Drift Accumulation

A server room experiences a cooling failure. The temperature rises by 15 degrees Celsius, causing the quartz crystals in the servers’ clocks to drift at 20 ppm instead of their normal 5 ppm. The cooling outage lasts 30 days.

How much clock error accumulates during this period? Express your answer in seconds.


Exercise 2: Cristian’s Algorithm Error Bound

A client uses Cristian’s algorithm to synchronize with a time server. The client makes three requests and gets these results:

RequestT1 (client send)T_server (server time)T2 (client receive)
1010,000,000120
250010,000,100540
3100010,000,2001040

(a) Calculate the estimated server time for each request. (b) Which request gives the most accurate estimate? Why? (c) If the actual one-way delay from client to server is 50ms and from server to client is 10ms for the best request, what is the true server time at T2, and what is the error of Cristian’s estimate?


Exercise 3: Berkeley Algorithm Arithmetic

A cluster of 5 machines runs the Berkeley Algorithm. The master polls the slaves and receives these time offsets (in seconds from the master’s perspective at the moment of polling):

MachineOffset (seconds)
Master+1.00
Slave 1+0.50
Slave 2-0.75
Slave 3+0.25
Slave 4+5.00

(a) Identify any outliers. What threshold would you use? (b) Compute the fault-tolerant average. (c) Compute the adjustment for each machine. (d) What would happen to the cluster’s time if the Berkeley Algorithm were NOT used and each machine relied on its own quartz crystal?


Exercise 4: NTP Slewing vs. Stepping

A database server generates timestamps for every write. Its clock is wrong by +800 ms (it is 800 ms ahead of UTC). The NTP daemon has two options:

Option A: Step the clock backward by 800 ms immediately. Option B: Slew the clock at a rate of 50 ppm (50 microseconds per second) until it reaches the correct time.

(a) How long does Option B take to correct the 800 ms error? Show the calculation. (b) During Option A, describe the exact problem that can occur with database timestamps. (c) Which option would you choose for a system that assigns monotonically increasing transaction IDs? Why?

👁️ View Solutions

Topic 2: Physical Clock Drift — Solutions

Solution 1

Drift rate: D = 20 ppm = 20 × 10⁻⁶

Elapsed time: 30 days = 30 × 24 × 60 × 60 = 2,592,000 seconds

Error = D × 10⁻⁶ × elapsed_time = 20 × 10⁻⁶ × 2,592,000 = 20 × 2.592 = 51.84 seconds

Each server’s clock is off by about 52 seconds after 30 days. This is enough to cause authentication failures (Kerberos tickets typically have a 5-minute tolerance), SSL certificate validation failures, and confusing log timestamps.


Solution 2

(a)

Request 1: RTT = 120 - 0 = 120ms. Estimated = 10,000,000 + 120/2 = 10,000,060 Request 2: RTT = 540 - 500 = 40ms. Estimated = 10,000,100 + 40/2 = 10,000,120 Request 3: RTT = 1040 - 1000 = 40ms. Estimated = 10,000,200 + 40/2 = 10,000,220

(b) Request 2 and 3 both have RTT = 40ms — this ties for the smallest RTT. A smaller RTT suggests less network congestion and makes the symmetric delay assumption more plausible. Both Request 2 and 3 are equally “best” by RTT. However, if we note that Request 3’s estimated time is 10,000,220 with the same RTT, while Request 2’s is 10,000,120, this is simply because the server’s clock advanced between the two requests. The accuracy of the estimate (error from true time) is the same for both since the RTT is the same.

To choose between them, one could compare the variances of previous RTT measurements or simply use the first one with the minimum RTT.

(c) For Request 2: RTT = 40ms, but the true asymmetry is 50ms outbound, 10ms inbound.

True server time at T2 = T_server + inbound_delay = 10,000,100 + 10ms = 10,000,110

Cristian’s estimate = 10,000,120

Error = 10,000,120 - 10,000,110 = 10ms

The estimate is 10ms ahead of the true time. The error is: |inbound_delay - outbound_delay| / 2 = |10 - 50| / 2 = 20ms

But the actual error is 10ms (because of additional processing time). The general formula: maximum error = RTT/2 - min_one_way = 20 - 10 = 10ms.


Solution 3

(a) Outlier detection: The offsets are +1.00, +0.50, -0.75, +0.25, and +5.00. Slave 4’s offset of +5.00 seconds is far outside the range of the other four (which span -0.75 to +1.00). A simple threshold would be: discard any value more than, say, 3 seconds from the median. Slave 4 is clearly faulty — maybe its battery has died.

(b) Fault-tolerant average (excluding Slave 4):

Average = (+1.00 + 0.50 + (-0.75) + 0.25) / 4 = 1.00 / 4 = +0.25 seconds

(c) Adjustments:

Master: +1.00 → +0.25, adjust by -0.75 seconds (slew backward) Slave 1: +0.50 → +0.25, adjust by -0.25 seconds (slew backward) Slave 2: -0.75 → +0.25, adjust by +1.00 seconds (slew forward) Slave 3: +0.25 → +0.25, adjust by 0 seconds (already at average) Slave 4: excluded, not adjusted (but should be diagnosed)

(d) Without the Berkeley Algorithm, each machine’s clock would drift independently. After 30 days:

Master at 5 ppm: off by 5 × 10⁻⁶ × 2,592,000 = 13.0 seconds Slave 1 at 5 ppm: off by 13.0 seconds (in the same direction, perhaps) Slave 2 at 5 ppm: off by 13.0 seconds But if one machine has a crystal with a different drift rate (say 20 ppm due to manufacturing variance), it could be off by 52 seconds — producing a 39-second disagreement between machines. Database timestamps would disagree, logs would be impossible to correlate, and any ordering based on timestamps would be wrong.


Solution 4

(a) Option B (slew at 50 ppm):

50 ppm = 50 × 10⁻⁶ = 0.00005 seconds per second = 0.05 ms per second

To correct 800 ms:

time = 800 ms / (0.05 ms/s) = 16,000 seconds = 4 hours, 26 minutes, 40 seconds

(b) During Option A (step backward by 800ms):

At 10:00:00.000 (server time), the server writes transaction T1 with timestamp 10:00:00.000. NTP steps the clock back to 09:59:59.200 (correct UTC time). The next transaction T2 gets timestamp 09:59:59.200. Now T2 appears to “happen before” T1, even though it came after. Any system that relies on timestamp monotonicity (write-ahead logs, replication slots, MVCC visibility) may fail or corrupt data because the “before” and “after” ordering is reversed.

(c) For a system with monotonically increasing transaction IDs, Option B (slew) is strongly preferred. The 4.5-hour convergence time is acceptable for most applications — it prevents any discontinuity in timestamps. Option A risks data corruption or system crashes. If the error were extremely large (say, hours), the database might require a maintenance window for a stepped correction, but for 800ms, slewing is safe.