Distributed Transactions Coordination · Failure Detectors

Key Question

How do modern failure detectors like Cassandra’s Phi Accrual work differently from simple timeouts?

Deep Dive

Simple timeout-based detectors have a binary output: “alive” or “dead.” But the world isn’t binary — a node that missed one heartbeat might be 99% alive or 0.1% alive. Accrual failure detectors replace the binary decision with a continuous suspicion level and let the application decide the threshold.

Cassandra uses the Φ-accrual failure detector (Hayashibara et al., 2004). Here’s how it works:

Step 1: Track heartbeat inter-arrival times

Heartbeat #      Arrival Time    Gap (ms)
    1               T+100          -
    2               T+500          400
    3               T+950          450
    4               T+1350         400
    5               T+1800         450
    6               T+2200         400

Step 2: Model the distribution (e.g., Normal: μ=420ms, σ=28ms)

Step 3: Compute φ when a heartbeat is overdue

  Now = T+2800 (600ms since last heartbeat)
  φ = -log₂( P( gap > 600ms | μ=420, σ=28 ) )
    = -log₂( ~0.0000001 )
    = ~23.3

  Interpretation: φ=23 means "extremely unlikely Node B is alive"

The formula: φ = -log₂(1 - CDF(time_since_last_heartbeat)), where CDF is the cumulative distribution function of the observed heartbeat gaps.

Phi Value Interpretation

  φ        Probability Node Is Alive    Meaning
  ----     -------------------------    -------
  φ=0      50%                          No information
  φ=1      10%                          Mild suspicion
  φ=2      1%                           Probable failure
  φ=3      0.1%                         Strong suspicion
  φ=5      0.001%                       Very confident
  φ=8      0.003%                       Almost certainly dead
  φ=16     ~0.0015%                     Dead beyond doubt

Visual: Phi Curve Over Time

Suspicion Level (φ)
  ^
  |                        φ curve
  |                     /
  |                  /
  |               /
  |            /
  |         /
  |      /
  |   /
  | /
  +---------------------------------> Time since last heartbeat
  |<-- normal gap range -->|<-- suspicion grows -->

The key advantage: different components in the same system can use different φ thresholds. The gossip layer might use φ=1 (fast detection, tolerates false positives) while the read coordinator uses φ=8 (conservative, avoids unnecessary repairs). No need to reconfigure timeouts — just change the threshold.

Cassandra also adapts the heartbeat interval dynamically: if a node has been stable for hours, the φ value adjusts to expect consistent heartbeats, making it even more sensitive to anomalies.

Check Your Understanding

What φ value corresponds to “99.99% sure the node is dead”?
Why does Φ-accrual model the distribution of heartbeat gaps rather than using a fixed timeout?
How could two different applications in the same system use different φ thresholds?

The “So What?”

Accrual failure detectors decouple detection from interpretation. The detector just provides a suspicion level; the application decides what to do with it. This is a cleaner design than fixed timeouts and is why Cassandra can handle variable network conditions without manual tuning — the detector learns what “normal” looks like for each node.

✏️ Exercises

Failure Detectors: Exercises

Exercise 1: The Impossibility of Perfect Detection

Explain why a “Perfect” failure detector (one that never falsely suspects a correct node and always detects all crashes) cannot exist in a purely asynchronous distributed system. Use the FLP result and the timeout dilemma in your answer.

Exercise 2: Computing Phi Values

In the Φ-accrual failure detector, the random variable “inter-arrival time of heartbeats” follows a normal distribution with mean μ = 500ms and standard deviation σ = 50ms.

Given that a heartbeat was received at time T, and the current time is T + 750ms:

a) Approximately how many standard deviations is 750ms from the mean? b) What φ value (to one decimal place) corresponds to this gap?

Hint: φ = -log₂( P(gap > observed) )
P(gap > μ + 5σ) ≈ 0.00000029
P(gap > μ + 4σ) ≈ 0.0000317
P(gap > μ + 3σ) ≈ 0.00135
P(gap > μ + 2σ) ≈ 0.0228

Exercise 3: SWIM in Action

Suppose you have a 100-node cluster using SWIM. Node A arbitrarily selects Node 42 for its PING. Node 42 doesn’t respond to the direct PING.

a) What should Node A do next before declaring Node 42 suspect? b) Node A selects k=3 random nodes (Nodes 15, 60, and 88) to perform indirect pings. Nodes 15 and 60 report success reaching Node 42, but Node 88 does not. What should Node A conclude? c) After Node A declares Node 42 suspect, how does this information spread through the cluster? d) If Node 42 receives the gossip that it has been marked suspect, what action can it take?

👁️ View Solutions

Failure Detectors: Solutions

Solution 1: The Impossibility of Perfect Detection

A Perfect failure detector (P) requires two properties:

Strong completeness: Every crashed node is eventually suspected by every correct node.
Strong accuracy: No correct node is ever suspected by any correct node.

In a purely asynchronous system, message delays are unbounded — a message can take arbitrarily long to arrive. This means:

If Node A hasn’t heard from Node B, it cannot distinguish between “B crashed” (should suspect) and “B’s heartbeat is still in transit but delayed” (should NOT suspect).
Setting any finite timeout risks false positives (network delay exceeds timeout) or misses (B crashes but timeout hasn’t expired).
The FLP impossibility result proves that in an async system, you cannot deterministically reach consensus with even one faulty node. Since perfect failure detection would enable consensus (e.g., detect the faulty node and exclude it), perfect detection is also impossible.

The only way to achieve Perfect P is to introduce synchrony assumptions — bounds on message delay, processing time, or clock drift — which is why real systems use timeouts and accept unreliability.

Solution 2: Computing Phi Values

Given μ = 500ms, σ = 50ms, observed gap = 750ms.

a) The gap (750ms) is (750 - 500) / 50 = 5 standard deviations from the mean.

b) P(gap > μ + 5σ) ≈ 0.00000029

φ = -log₂(0.00000029) = -log₂(2.9 × 10⁻⁷) = -(-21.7) ≈ 21.7

This is an astronomically high suspicion level — far beyond any practical threshold. Even φ = 8 is considered “almost certainly dead.”

So: φ ≈ 21.7. The detector is extremely confident the node has crashed, because a 750ms gap is a 5-sigma event under the normal heartbeat distribution.

Solution 3: SWIM in Action

a) Before declaring Node 42 suspect, Node A should perform indirect probing: select k random nodes and ask each to PING Node 42. This avoids false positives caused by network congestion or packet loss on the direct path between A and 42.

b) Node A should conclude Node 42 is alive. Since at least one indirect ping succeeded (two out of three), the failure is likely on the direct A→42 path, not with Node 42 itself. Node A should NOT suspect Node 42.

c) Node A piggybacks the suspicion message (“Node 42 → SUSPECT”) onto its next gossip round. It tells a small subset of nodes (e.g., 2-3). Each of those nodes tells 2-3 more, and so on. This infection-style dissemination means the suspicion spreads exponentially: after O(log n) rounds, all n nodes know about the suspected failure.

d) Node 42 (if actually alive) receives the gossip and immediately sends a live acknowledgment — a message to the cluster saying “I’m alive, the suspicion is false.” Nodes that receive this ack mark Node 42 as Alive again. This is the self-healing aspect of SWIM’s suspicion mechanism.