Consensus Fault Tolerance · FLP Impossibility

Key Question

What do we mean by an “asynchronous” distributed system, and why does it make consensus hard?

Deep Dive

Distributed systems textbooks define three models for timing. Synchronous: messages arrive within a known bound, processes execute steps at a known speed, and there’s a global clock. Asynchronous: none of these guarantees exist. Messages can take arbitrarily long, processes can pause arbitrarily, and there is no global clock. Partially synchronous: the system is asynchronous initially but becomes synchronous after some unknown time (the most realistic model).

In the asynchronous model, you cannot tell the difference between a crashed node and a slow node. Consider this timeline:

Node A:  |--run--|--run--|--send--|~~~~~wait~~~~~|--timeout--|  "B crashed!"
Node B:  |--run--|--send--|~~~~~GARBAGE COLLECTION PAUSE (45s)~~~~~~|--recv--|

Node A sends a message to B and waits. Node B pauses for 45 seconds due to a garbage collection cycle. From A’s perspective, B might be crashed — or it might be slow. In a synchronous model, you’d know for certain. In the asynchronous model, you cannot distinguish these cases.

Why does this matter? Consensus requires nodes to make decisions. If a node must wait for a response from every other node, one slow node blocks the entire system. If it uses timeouts, it risks making a decision without hearing from a node that might disagree — violating safety.

The asynchronous model is not a theoretical toy — it describes real systems. Network congestion can delay packets by seconds. VM scheduling can pause a process for hundreds of milliseconds. A garbage collector can stop-the-world for multiple seconds. Modern production systems operate in an asynchronous world.

Check Your Understanding

What is the key difference between a synchronous and asynchronous model in terms of message delivery?
If you run a consensus algorithm across machines in the same datacenter connected by a 10Gbps switch, is the system “synchronous” in the theoretical sense? Why or why not?
Why does the inability to distinguish a crashed node from a slow node make consensus harder?

The “So What?”

Almost all real distributed systems are asynchronous in practice. When you build consensus systems, you are designing for a world where you can never be certain whether a node has failed or is just taking a nap. This is why timeouts are a heuristic, not a guarantee.

✏️ Exercises

FLP Impossibility: Exercises

Exercise 1: Synchrony and FLP

If a system can guarantee message delivery within 100ms, does FLP apply? Explain why or why not.

Exercise 2: Valency

What is the difference between a univalent configuration and a bivalent configuration? Give a concrete example of each in a 3-node system trying to agree on 0 or 1.

Exercise 3: Timeouts and the Async Model

How does using a timeout violate the pure asynchronous model? If a timeout fires, what can you conclude — and what can’t you conclude?

👁️ View Solutions

FLP Impossibility: Solutions

Exercise 1

If a system guarantees message delivery within 100ms, FLP does not apply as originally stated because the system is synchronous (bounded message delay), not asynchronous. FLP assumes no upper bound on message delay. With a 100ms bound, you can use timeouts safely — if you don’t hear back in 100ms, the node must be crashed.

However, in practice, the 100ms bound might not hold during garbage collection pauses, network congestion, or overload. So while the theoretical FLP doesn’t apply to this system, the practical concern remains — you must be certain the bound always holds.

Exercise 2

Univalent: A configuration where all possible executions lead to the same decision value.

Example of 1-valent: All three nodes have exchanged messages indicating they’ve accepted proposal value 1. No matter what messages are delivered next, the system will decide 1.

Bivalent: A configuration where both 0 and 1 are still possible outcomes depending on future message deliveries.

Example of bivalent: Two nodes proposed 0, one node proposed 1, and no majority has formed. If the first two nodes communicate first, they’ll decide 0. If the third node makes a counter-proposal to one of the 0-proposers before that node hears from the other 0-proposer, the system might shift toward 1. Both outcomes are still possible.

Exercise 3

A timeout assumes an upper bound on message delivery time. If a message hasn’t arrived within the timeout period, you conclude the recipient is faulty. But in the pure asynchronous model, the message might arrive in the next microsecond — there’s no way to distinguish “slow” from “crashed.”

When a timeout fires, you can conclude: “I haven’t received a response yet.” You cannot conclude: “The node has crashed.” This is why timeouts are called “unreliable failure detectors” — they can suspect correct processes and fail to suspect crashed ones. The pure async model forbids even unreliable failure detection because it assumes no timing information is available at all.