Consensus Fault Tolerance · FLP Impossibility

Key Question

What do “safety” and “liveness” mean, and how do they relate to consensus?

Deep Dive

Safety means “nothing bad happens.” In consensus, safety requires:

Agreement: No two correct nodes decide different values.
Validity: Any decided value was proposed by some node.

Safety is a reachability property — if violated, you can point to a specific moment where it broke. It’s like “don’t cross the double yellow line.” You can check if anyone ever crossed it.

Liveness means “something good eventually happens.” In consensus:

Termination: Every correct node eventually decides a value.

Liveness is a convergence property — you cannot point to a moment where it’s violated because it might still happen in the future. It’s like “you will eventually reach your destination.” At any point during the trip, it’s not yet violated — you might still arrive.

The tension between them is fundamental:

Safety (must be preserved at all costs):
  Node A: |--decide(1)--|          |--decide(1)--| ✓
  Node B: |----decide(1)----|      |--decide(1)--| ✓

  Node A: |--decide(1)--|          |--decide(0)--| ✗ SAFETY VIOLATED!
  Node B: |----decide(0)----|      |----decide(0)--|

FLP’s core finding: in an asynchronous system with even one possible crash, you cannot guarantee liveness while preserving safety. Any algorithm that always terminates (liveness) can be forced into a contradiction (safety violation) by scheduling delays.

A consensus algorithm can be:

Safe but not live: Nodes wait forever for a response that never comes. The system blocks. No decision is ever made. Agreement is preserved (no disagreement) but nothing ever happens.
Live but not safe: Nodes decide quickly, but due to incomplete information, different nodes decide different values. Progress happens, but the result is garbage.

Real systems choose safety over liveness — it’s better to stop making progress than to silently disagree.

Check Your Understanding

If a consensus algorithm blocks forever because a network partition never heals, is safety violated, liveness violated, or both?
Can you tell at any finite point in time whether a liveness property has been violated? Why or why not?
Why do production systems prefer violating liveness over violating safety?

The “So What?”

The FLP impossibility is specifically about liveness (guaranteed termination), not safety. We can always keep the system safe — just not guaranteed to always make progress. When your Raft cluster refuses to elect a leader during a network partition, that’s liveness being sacrificed to protect safety.

✏️ Exercises

FLP Impossibility: Exercises

Exercise 1: Synchrony and FLP

If a system can guarantee message delivery within 100ms, does FLP apply? Explain why or why not.

Exercise 2: Valency

What is the difference between a univalent configuration and a bivalent configuration? Give a concrete example of each in a 3-node system trying to agree on 0 or 1.

Exercise 3: Timeouts and the Async Model

How does using a timeout violate the pure asynchronous model? If a timeout fires, what can you conclude — and what can’t you conclude?

👁️ View Solutions

FLP Impossibility: Solutions

Exercise 1

If a system guarantees message delivery within 100ms, FLP does not apply as originally stated because the system is synchronous (bounded message delay), not asynchronous. FLP assumes no upper bound on message delay. With a 100ms bound, you can use timeouts safely — if you don’t hear back in 100ms, the node must be crashed.

However, in practice, the 100ms bound might not hold during garbage collection pauses, network congestion, or overload. So while the theoretical FLP doesn’t apply to this system, the practical concern remains — you must be certain the bound always holds.

Exercise 2

Univalent: A configuration where all possible executions lead to the same decision value.

Example of 1-valent: All three nodes have exchanged messages indicating they’ve accepted proposal value 1. No matter what messages are delivered next, the system will decide 1.

Bivalent: A configuration where both 0 and 1 are still possible outcomes depending on future message deliveries.

Example of bivalent: Two nodes proposed 0, one node proposed 1, and no majority has formed. If the first two nodes communicate first, they’ll decide 0. If the third node makes a counter-proposal to one of the 0-proposers before that node hears from the other 0-proposer, the system might shift toward 1. Both outcomes are still possible.

Exercise 3

A timeout assumes an upper bound on message delivery time. If a message hasn’t arrived within the timeout period, you conclude the recipient is faulty. But in the pure asynchronous model, the message might arrive in the next microsecond — there’s no way to distinguish “slow” from “crashed.”

When a timeout fires, you can conclude: “I haven’t received a response yet.” You cannot conclude: “The node has crashed.” This is why timeouts are called “unreliable failure detectors” — they can suspect correct processes and fail to suspect crashed ones. The pure async model forbids even unreliable failure detection because it assumes no timing information is available at all.