Distributed Transactions Coordination · Failure Detectors

Key Question

What are the formal classes of failure detectors, and how do they relate to consensus solvability?

Deep Dive

Chandra and Toueg (1996) formalized failure detectors by defining two properties: completeness (do you detect all actual failures?) and accuracy (can you be wrong?).

Classification Matrix

              |  Strong Accuracy      |  Weak Accuracy
              |  (never suspect a     |  (at least one correct
              |   correct node)       |   node never suspected)
--------------+-----------------------+-------------------------------
Strong        |  Perfect P            |  Strong S
Completeness  |  Detects all crashes, |  Detects all crashes,
(never misses |  no false suspicions  |  may falsely suspect, but
a crash)      |                       |  at least one is never wrong
--------------+-----------------------+-------------------------------
Weak          |  Eventually Perfect   |  Eventually Strong
Completeness  |  ◇P                   |  ◇S
(may miss new |  Eventually no        |  Eventually at least one
crashes for   |  false suspicions     |  correct node is
a while)      |                       |  never suspected

Perfect P — detects every crash, never falsely suspects any correct node. Sounds ideal, but it’s impossible in a purely asynchronous system. Why? Because to never falsely suspect, you need to wait “long enough” to be sure — but in an async system, “long enough” doesn’t exist since messages have no upper bound on delay.

Eventually Perfect ◇P — there exists some unknown time after which no false suspicions occur. Before that time, it can falsely suspect all it wants. This is implementable: use a timeout that grows over time. Eventually, the timeout will be longer than any network delay, and false suspicions stop.

Strong S — detects all crashes and guarantees that at least one correct node is never suspected. So you might suspect 9 out of 10 correct nodes, but one is always trusted.

Eventually Strong ◇S — eventually, at least one correct node is never suspected.

The big result: ◇S and ◇P are sufficient to solve consensus in an asynchronous system with a majority of correct nodes. This is the Chandra-Toueg consensus algorithm, which uses ◇S to elect a leader and then runs a rotating coordinator protocol.

Why ◇P enables consensus

Phase 1: Chaos          Phase 2: Stability
(falsely suspecting)     (◇P has stabilized)
     |                         |
  A suspects B            A trusts B (correct)
  B suspects C            B trusts C (correct)  
  C suspects A            C trusts A (correct)
     |                         |
  NO CONSENSUS            CONSENSUS POSSIBLE

The practical takeaway: you don’t need a perfect detector. You just need one that eventually stops making mistakes. That’s why real systems like Paxos and Raft work — they use failure detectors that are aggressive early (fast detection, some false positives) but stabilize over time.

Check Your Understanding

What’s the difference between Strong S and Perfect P failure detectors?
Why is Perfect P impossible in an asynchronous system?
Which failure detector class is the minimum needed to solve consensus?

The “So What?”

The Chandra-Toueg classification tells us two things: (1) perfect failure detection is impossible, so stop trying to build it, and (2) consensus only needs a detector that’s eventually accurate. This justifies why real systems use aggressive timeouts with retries and elections — you can afford false positives as long as the system eventually converges.

✏️ Exercises

Failure Detectors: Exercises

Exercise 1: The Impossibility of Perfect Detection

Explain why a “Perfect” failure detector (one that never falsely suspects a correct node and always detects all crashes) cannot exist in a purely asynchronous distributed system. Use the FLP result and the timeout dilemma in your answer.

Exercise 2: Computing Phi Values

In the Φ-accrual failure detector, the random variable “inter-arrival time of heartbeats” follows a normal distribution with mean μ = 500ms and standard deviation σ = 50ms.

Given that a heartbeat was received at time T, and the current time is T + 750ms:

a) Approximately how many standard deviations is 750ms from the mean? b) What φ value (to one decimal place) corresponds to this gap?

Hint: φ = -log₂( P(gap > observed) )
P(gap > μ + 5σ) ≈ 0.00000029
P(gap > μ + 4σ) ≈ 0.0000317
P(gap > μ + 3σ) ≈ 0.00135
P(gap > μ + 2σ) ≈ 0.0228

Exercise 3: SWIM in Action

Suppose you have a 100-node cluster using SWIM. Node A arbitrarily selects Node 42 for its PING. Node 42 doesn’t respond to the direct PING.

a) What should Node A do next before declaring Node 42 suspect? b) Node A selects k=3 random nodes (Nodes 15, 60, and 88) to perform indirect pings. Nodes 15 and 60 report success reaching Node 42, but Node 88 does not. What should Node A conclude? c) After Node A declares Node 42 suspect, how does this information spread through the cluster? d) If Node 42 receives the gossip that it has been marked suspect, what action can it take?

👁️ View Solutions

Failure Detectors: Solutions

Solution 1: The Impossibility of Perfect Detection

A Perfect failure detector (P) requires two properties:

Strong completeness: Every crashed node is eventually suspected by every correct node.
Strong accuracy: No correct node is ever suspected by any correct node.

In a purely asynchronous system, message delays are unbounded — a message can take arbitrarily long to arrive. This means:

If Node A hasn’t heard from Node B, it cannot distinguish between “B crashed” (should suspect) and “B’s heartbeat is still in transit but delayed” (should NOT suspect).
Setting any finite timeout risks false positives (network delay exceeds timeout) or misses (B crashes but timeout hasn’t expired).
The FLP impossibility result proves that in an async system, you cannot deterministically reach consensus with even one faulty node. Since perfect failure detection would enable consensus (e.g., detect the faulty node and exclude it), perfect detection is also impossible.

The only way to achieve Perfect P is to introduce synchrony assumptions — bounds on message delay, processing time, or clock drift — which is why real systems use timeouts and accept unreliability.

Solution 2: Computing Phi Values

Given μ = 500ms, σ = 50ms, observed gap = 750ms.

a) The gap (750ms) is (750 - 500) / 50 = 5 standard deviations from the mean.

b) P(gap > μ + 5σ) ≈ 0.00000029

φ = -log₂(0.00000029) = -log₂(2.9 × 10⁻⁷) = -(-21.7) ≈ 21.7

This is an astronomically high suspicion level — far beyond any practical threshold. Even φ = 8 is considered “almost certainly dead.”

So: φ ≈ 21.7. The detector is extremely confident the node has crashed, because a 750ms gap is a 5-sigma event under the normal heartbeat distribution.

Solution 3: SWIM in Action

a) Before declaring Node 42 suspect, Node A should perform indirect probing: select k random nodes and ask each to PING Node 42. This avoids false positives caused by network congestion or packet loss on the direct path between A and 42.

b) Node A should conclude Node 42 is alive. Since at least one indirect ping succeeded (two out of three), the failure is likely on the direct A→42 path, not with Node 42 itself. Node A should NOT suspect Node 42.

c) Node A piggybacks the suspicion message (“Node 42 → SUSPECT”) onto its next gossip round. It tells a small subset of nodes (e.g., 2-3). Each of those nodes tells 2-3 more, and so on. This infection-style dissemination means the suspicion spreads exponentially: after O(log n) rounds, all n nodes know about the suspected failure.

d) Node 42 (if actually alive) receives the gossip and immediately sends a live acknowledgment — a message to the cluster saying “I’m alive, the suspicion is false.” Nodes that receive this ack mark Node 42 as Alive again. This is the self-healing aspect of SWIM’s suspicion mechanism.