Key Question
How do modern failure detectors like Cassandra’s Phi Accrual work differently from simple timeouts?
Deep Dive
Simple timeout-based detectors have a binary output: “alive” or “dead.” But the world isn’t binary — a node that missed one heartbeat might be 99% alive or 0.1% alive. Accrual failure detectors replace the binary decision with a continuous suspicion level and let the application decide the threshold.
Cassandra uses the Φ-accrual failure detector (Hayashibara et al., 2004). Here’s how it works:
Step 1: Track heartbeat inter-arrival times
Heartbeat # Arrival Time Gap (ms)
1 T+100 -
2 T+500 400
3 T+950 450
4 T+1350 400
5 T+1800 450
6 T+2200 400
Step 2: Model the distribution (e.g., Normal: μ=420ms, σ=28ms)
Step 3: Compute φ when a heartbeat is overdue
Now = T+2800 (600ms since last heartbeat)
φ = -log₂( P( gap > 600ms | μ=420, σ=28 ) )
= -log₂( ~0.0000001 )
= ~23.3
Interpretation: φ=23 means "extremely unlikely Node B is alive"
The formula: φ = -log₂(1 - CDF(time_since_last_heartbeat)), where CDF is the cumulative distribution function of the observed heartbeat gaps.
Phi Value Interpretation
φ Probability Node Is Alive Meaning
---- ------------------------- -------
φ=0 50% No information
φ=1 10% Mild suspicion
φ=2 1% Probable failure
φ=3 0.1% Strong suspicion
φ=5 0.001% Very confident
φ=8 0.003% Almost certainly dead
φ=16 ~0.0015% Dead beyond doubt
Visual: Phi Curve Over Time
Suspicion Level (φ)
^
| φ curve
| /
| /
| /
| /
| /
| /
| /
| /
+---------------------------------> Time since last heartbeat
|<-- normal gap range -->|<-- suspicion grows -->
The key advantage: different components in the same system can use different φ thresholds. The gossip layer might use φ=1 (fast detection, tolerates false positives) while the read coordinator uses φ=8 (conservative, avoids unnecessary repairs). No need to reconfigure timeouts — just change the threshold.
Cassandra also adapts the heartbeat interval dynamically: if a node has been stable for hours, the φ value adjusts to expect consistent heartbeats, making it even more sensitive to anomalies.
Check Your Understanding
- What φ value corresponds to “99.99% sure the node is dead”?
- Why does Φ-accrual model the distribution of heartbeat gaps rather than using a fixed timeout?
- How could two different applications in the same system use different φ thresholds?
The “So What?”
Accrual failure detectors decouple detection from interpretation. The detector just provides a suspicion level; the application decides what to do with it. This is a cleaner design than fixed timeouts and is why Cassandra can handle variable network conditions without manual tuning — the detector learns what “normal” looks like for each node.
✏️ Exercises
Failure Detectors: Exercises
Exercise 1: The Impossibility of Perfect Detection
Explain why a “Perfect” failure detector (one that never falsely suspects a correct node and always detects all crashes) cannot exist in a purely asynchronous distributed system. Use the FLP result and the timeout dilemma in your answer.
Exercise 2: Computing Phi Values
In the Φ-accrual failure detector, the random variable “inter-arrival time of heartbeats” follows a normal distribution with mean μ = 500ms and standard deviation σ = 50ms.
Given that a heartbeat was received at time T, and the current time is T + 750ms:
a) Approximately how many standard deviations is 750ms from the mean? b) What φ value (to one decimal place) corresponds to this gap?
- Hint: φ = -log₂( P(gap > observed) )
- P(gap > μ + 5σ) ≈ 0.00000029
- P(gap > μ + 4σ) ≈ 0.0000317
- P(gap > μ + 3σ) ≈ 0.00135
- P(gap > μ + 2σ) ≈ 0.0228
Exercise 3: SWIM in Action
Suppose you have a 100-node cluster using SWIM. Node A arbitrarily selects Node 42 for its PING. Node 42 doesn’t respond to the direct PING.
a) What should Node A do next before declaring Node 42 suspect? b) Node A selects k=3 random nodes (Nodes 15, 60, and 88) to perform indirect pings. Nodes 15 and 60 report success reaching Node 42, but Node 88 does not. What should Node A conclude? c) After Node A declares Node 42 suspect, how does this information spread through the cluster? d) If Node 42 receives the gossip that it has been marked suspect, what action can it take?
👁️ View Solutions
Failure Detectors: Solutions
Solution 1: The Impossibility of Perfect Detection
A Perfect failure detector (P) requires two properties:
- Strong completeness: Every crashed node is eventually suspected by every correct node.
- Strong accuracy: No correct node is ever suspected by any correct node.
In a purely asynchronous system, message delays are unbounded — a message can take arbitrarily long to arrive. This means:
- If Node A hasn’t heard from Node B, it cannot distinguish between “B crashed” (should suspect) and “B’s heartbeat is still in transit but delayed” (should NOT suspect).
- Setting any finite timeout risks false positives (network delay exceeds timeout) or misses (B crashes but timeout hasn’t expired).
- The FLP impossibility result proves that in an async system, you cannot deterministically reach consensus with even one faulty node. Since perfect failure detection would enable consensus (e.g., detect the faulty node and exclude it), perfect detection is also impossible.
The only way to achieve Perfect P is to introduce synchrony assumptions — bounds on message delay, processing time, or clock drift — which is why real systems use timeouts and accept unreliability.
Solution 2: Computing Phi Values
Given μ = 500ms, σ = 50ms, observed gap = 750ms.
a) The gap (750ms) is (750 - 500) / 50 = 5 standard deviations from the mean.
b) P(gap > μ + 5σ) ≈ 0.00000029
φ = -log₂(0.00000029) = -log₂(2.9 × 10⁻⁷) = -(-21.7) ≈ 21.7
This is an astronomically high suspicion level — far beyond any practical threshold. Even φ = 8 is considered “almost certainly dead.”
So: φ ≈ 21.7. The detector is extremely confident the node has crashed, because a 750ms gap is a 5-sigma event under the normal heartbeat distribution.
Solution 3: SWIM in Action
a) Before declaring Node 42 suspect, Node A should perform indirect probing: select k random nodes and ask each to PING Node 42. This avoids false positives caused by network congestion or packet loss on the direct path between A and 42.
b) Node A should conclude Node 42 is alive. Since at least one indirect ping succeeded (two out of three), the failure is likely on the direct A→42 path, not with Node 42 itself. Node A should NOT suspect Node 42.
c) Node A piggybacks the suspicion message (“Node 42 → SUSPECT”) onto its next gossip round. It tells a small subset of nodes (e.g., 2-3). Each of those nodes tells 2-3 more, and so on. This infection-style dissemination means the suspicion spreads exponentially: after O(log n) rounds, all n nodes know about the suspected failure.
d) Node 42 (if actually alive) receives the gossip and immediately sends a live acknowledgment — a message to the cluster saying “I’m alive, the suspicion is false.” Nodes that receive this ack mark Node 42 as Alive again. This is the self-healing aspect of SWIM’s suspicion mechanism.