Distributed Transactions Coordination · Failure Detectors

Key Question

How do distributed systems detect that a remote node has crashed?

Deep Dive

A process in a distributed system can’t just “check” if a remote node is alive — there’s no shared memory, no shared bus, no hardware interrupt. The only way to know is communication: if a node stops responding to messages, is it dead or just slow?

This is the failure detection problem, and it’s fundamentally unsolvable in a purely asynchronous system. This isn’t a technology limitation — it’s a mathematical one (the Fischer-Lynch-Paterson impossibility result).

Timeline: Failure Detection at Node A

Node A                     Node B
  |                          |
  |-------- heartbeat ------>|  (T+0)
  |                          |
  |                          |  (B stops responding here)
  |                          |
  |-------- heartbeat ------>|  (T+1 — no response)
  |                          |
  |-------- heartbeat ------>|  (T+2 — no response)
  |                          |
  |    *** TIMEOUT ***       |  (T+3)
  |   Node B = SUSPECT       |
  |                          |

The fundamental problem: Node A sets a timeout of 3 seconds. If B hasn’t responded in 3s, A suspects B is dead. But what if:

B is processing a garbage collection pause (2.5s is OK in Java)
The network had a 10-second routing convergence
B is under heavy load and its heartbeat thread is starved

A timeout that’s too short causes false positives: you evict a healthy node from the cluster, triggering unnecessary leader elections, data rebalancing, or replication. A timeout that’s too long means the system stays unresponsive while waiting to detect a real failure.

This is why failure detectors are called unreliable — they can always be wrong. The insight from Chandra and Toueg (1996) was to treat failure detection not as a binary fact but as a suspicion level that applications can adapt to.

False Positives vs. Detection Latency

        Aggressive           Moderate           Conservative
        (short timeout)      (balanced)         (long timeout)
             |                    |                    |
  False +    High                 Medium               Low
  Detection  Fast                  Moderate            Slow
  Tradeoff:  Risk of chaos        Sweet spot          Risk of hanging

Modern systems expose the timeout as a tunable parameter and use gossip-based or accrual detectors to get the best of both worlds.

Check Your Understanding

Why can’t a distributed system distinguish a crashed node from a slow one in an asynchronous network?
What happens when a failure detector timeout is set too short? What happens when it’s set too long?
If Node A suspects Node B is dead but B is actually alive, what’s the worst that could happen?

The “So What?”

Every distributed system — from Cassandra to etcd to Kubernetes — relies on failure detectors. Understanding their fundamental unreliability explains why distributed systems have complex failure-handling logic, why you see things like “leader election timeouts” and “gossip protocols,” and why there’s always a tradeoff between availability and correctness when nodes fail.

✏️ Exercises

Failure Detectors: Exercises

Exercise 1: The Impossibility of Perfect Detection

Explain why a “Perfect” failure detector (one that never falsely suspects a correct node and always detects all crashes) cannot exist in a purely asynchronous distributed system. Use the FLP result and the timeout dilemma in your answer.

Exercise 2: Computing Phi Values

In the Φ-accrual failure detector, the random variable “inter-arrival time of heartbeats” follows a normal distribution with mean μ = 500ms and standard deviation σ = 50ms.

Given that a heartbeat was received at time T, and the current time is T + 750ms:

a) Approximately how many standard deviations is 750ms from the mean? b) What φ value (to one decimal place) corresponds to this gap?

Hint: φ = -log₂( P(gap > observed) )
P(gap > μ + 5σ) ≈ 0.00000029
P(gap > μ + 4σ) ≈ 0.0000317
P(gap > μ + 3σ) ≈ 0.00135
P(gap > μ + 2σ) ≈ 0.0228

Exercise 3: SWIM in Action

Suppose you have a 100-node cluster using SWIM. Node A arbitrarily selects Node 42 for its PING. Node 42 doesn’t respond to the direct PING.

a) What should Node A do next before declaring Node 42 suspect? b) Node A selects k=3 random nodes (Nodes 15, 60, and 88) to perform indirect pings. Nodes 15 and 60 report success reaching Node 42, but Node 88 does not. What should Node A conclude? c) After Node A declares Node 42 suspect, how does this information spread through the cluster? d) If Node 42 receives the gossip that it has been marked suspect, what action can it take?

👁️ View Solutions

Failure Detectors: Solutions

Solution 1: The Impossibility of Perfect Detection

A Perfect failure detector (P) requires two properties:

Strong completeness: Every crashed node is eventually suspected by every correct node.
Strong accuracy: No correct node is ever suspected by any correct node.

In a purely asynchronous system, message delays are unbounded — a message can take arbitrarily long to arrive. This means:

If Node A hasn’t heard from Node B, it cannot distinguish between “B crashed” (should suspect) and “B’s heartbeat is still in transit but delayed” (should NOT suspect).
Setting any finite timeout risks false positives (network delay exceeds timeout) or misses (B crashes but timeout hasn’t expired).
The FLP impossibility result proves that in an async system, you cannot deterministically reach consensus with even one faulty node. Since perfect failure detection would enable consensus (e.g., detect the faulty node and exclude it), perfect detection is also impossible.

The only way to achieve Perfect P is to introduce synchrony assumptions — bounds on message delay, processing time, or clock drift — which is why real systems use timeouts and accept unreliability.

Solution 2: Computing Phi Values

Given μ = 500ms, σ = 50ms, observed gap = 750ms.

a) The gap (750ms) is (750 - 500) / 50 = 5 standard deviations from the mean.

b) P(gap > μ + 5σ) ≈ 0.00000029

φ = -log₂(0.00000029) = -log₂(2.9 × 10⁻⁷) = -(-21.7) ≈ 21.7

This is an astronomically high suspicion level — far beyond any practical threshold. Even φ = 8 is considered “almost certainly dead.”

So: φ ≈ 21.7. The detector is extremely confident the node has crashed, because a 750ms gap is a 5-sigma event under the normal heartbeat distribution.

Solution 3: SWIM in Action

a) Before declaring Node 42 suspect, Node A should perform indirect probing: select k random nodes and ask each to PING Node 42. This avoids false positives caused by network congestion or packet loss on the direct path between A and 42.

b) Node A should conclude Node 42 is alive. Since at least one indirect ping succeeded (two out of three), the failure is likely on the direct A→42 path, not with Node 42 itself. Node A should NOT suspect Node 42.

c) Node A piggybacks the suspicion message (“Node 42 → SUSPECT”) onto its next gossip round. It tells a small subset of nodes (e.g., 2-3). Each of those nodes tells 2-3 more, and so on. This infection-style dissemination means the suspicion spreads exponentially: after O(log n) rounds, all n nodes know about the suspected failure.

d) Node 42 (if actually alive) receives the gossip and immediately sends a live acknowledgment — a message to the cluster saying “I’m alive, the suspicion is false.” Nodes that receive this ack mark Node 42 as Alive again. This is the self-healing aspect of SWIM’s suspicion mechanism.