Consensus Fault Tolerance · FLP Impossibility

Key Question

If FLP proves consensus is impossible, how do systems like Raft and Paxos work?

Deep Dive

FLP proves impossibility under three specific assumptions: (1) pure asynchrony, (2) deterministic algorithms, (3) crash failures. Real systems bypass FLP by relaxing at least one of these assumptions.

Approach 1: Partial synchrony (used by Paxos, Raft). Assume the system is asynchronous initially but becomes synchronous after some unknown time GST (Global Stabilization Time). This is the Dwork, Lynch, Stockmeyer model. During asynchronous periods, consensus may not make progress — but once the network stabilizes, it completes. This matches reality: networks are usually fast, with occasional hiccups.

Normal:  |---<10ms---|---<10ms---|---<10ms---|---<10ms---| 
                          NETWORK PARTITION
Async:   |---<10ms---|---~~>30s~~---|---<10ms---|---<10ms---| 
                                               ^
                                          Consensus completes
                                          once synchrony returns

Raft uses timeouts. When a follower doesn’t hear from the leader within its timeout, it starts an election. This timeout is effectively saying: “If I haven’t heard anything in 150ms, the network is probably synchronous enough to try an election.” FLP doesn’t apply because timeouts are a relaxation of the pure async model — Raft can make progress when the network is fast enough.

Approach 2: Failure detectors. Add an unreliable failure detector — a module that suspects processes have crashed. The failure detector can be wrong (suspect a correct process), but it must eventually suspect all crashed processes (strong completeness). With a failure detector, consensus becomes solvable (the Chandra-Toueg result). In practice, timeouts are a simple failure detector.

Approach 3: Randomization (Ben-Or, 1983). Use coin flips to break bivalent states. In Ben-Or’s algorithm, if nodes can’t agree, they flip a random coin and try again. With probability 1, they eventually agree. This randomizes the liveness guarantee — instead of “always terminates,” you get “terminates with probability 1.” Some blockchain consensus protocols use this approach.

Ben-Or round:
  Node A: proposes 0, sees conflict --> flips coin --> got 1 --> propose 1
  Node B: proposes 1, sees conflict --> flips coin --> got 1 --> propose 1
  Node C: proposes 0, sees conflict --> flips coin --> got 0 --> propose 0

  (A and B align on 1, C on 0 — repeat until all match)

Each of these approaches relaxes a specific FLP assumption:

Partial synchrony: relaxes pure asynchrony
Failure detectors: adds timing information (effectively partial synchrony)
Randomization: relaxes determinism

Check Your Understanding

What FLP assumption does using a timeout violate?
In the partial synchrony model, does the algorithm know when GST has occurred?
How does randomization guarantee eventual consensus despite FLP?

The “So What?”

The FLP impossibility is why consensus algorithms rely on timeouts and cannot guarantee progress during severe network issues. When your Raft cluster splits into partitions, both sides continue running but neither can commit new entries — that’s FLP in action. The system is safe (no two nodes decide different things) but not live (no progress).

✏️ Exercises

FLP Impossibility: Exercises

Exercise 1: Synchrony and FLP

If a system can guarantee message delivery within 100ms, does FLP apply? Explain why or why not.

Exercise 2: Valency

What is the difference between a univalent configuration and a bivalent configuration? Give a concrete example of each in a 3-node system trying to agree on 0 or 1.

Exercise 3: Timeouts and the Async Model

How does using a timeout violate the pure asynchronous model? If a timeout fires, what can you conclude — and what can’t you conclude?

👁️ View Solutions

FLP Impossibility: Solutions

Exercise 1

If a system guarantees message delivery within 100ms, FLP does not apply as originally stated because the system is synchronous (bounded message delay), not asynchronous. FLP assumes no upper bound on message delay. With a 100ms bound, you can use timeouts safely — if you don’t hear back in 100ms, the node must be crashed.

However, in practice, the 100ms bound might not hold during garbage collection pauses, network congestion, or overload. So while the theoretical FLP doesn’t apply to this system, the practical concern remains — you must be certain the bound always holds.

Exercise 2

Univalent: A configuration where all possible executions lead to the same decision value.

Example of 1-valent: All three nodes have exchanged messages indicating they’ve accepted proposal value 1. No matter what messages are delivered next, the system will decide 1.

Bivalent: A configuration where both 0 and 1 are still possible outcomes depending on future message deliveries.

Example of bivalent: Two nodes proposed 0, one node proposed 1, and no majority has formed. If the first two nodes communicate first, they’ll decide 0. If the third node makes a counter-proposal to one of the 0-proposers before that node hears from the other 0-proposer, the system might shift toward 1. Both outcomes are still possible.

Exercise 3

A timeout assumes an upper bound on message delivery time. If a message hasn’t arrived within the timeout period, you conclude the recipient is faulty. But in the pure asynchronous model, the message might arrive in the next microsecond — there’s no way to distinguish “slow” from “crashed.”

When a timeout fires, you can conclude: “I haven’t received a response yet.” You cannot conclude: “The node has crashed.” This is why timeouts are called “unreliable failure detectors” — they can suspect correct processes and fail to suspect crashed ones. The pure async model forbids even unreliable failure detection because it assumes no timing information is available at all.