Consensus Fault Tolerance · Byzantine Fault Tolerance

Key Question

What happens when nodes don’t just crash, but actively lie or behave maliciously?

Deep Dive

In 1982, Lamport, Shostak, and Pease posed a problem: N generals surround a city, each commanding a portion of the army. They must agree on a plan — either attack or retreat. Generals communicate only via messengers. Some generals are traitors who will actively try to prevent the loyal generals from reaching agreement.

This is different from a crash failure. A crashed node stops. A Byzantine node actively misbehaves: it can send different messages to different recipients, lie about what it received, collude with other traitors, or send messages designed to confuse the protocol.

Scenario: 3 generals, 1 traitor

General A (loyal): "Let's attack!" 
  sends to B -> "attack"
  sends to C -> "attack"

General B (traitor): receives "attack" from A
  sends to C -> "retreat" (LIES)
  sends to A -> "retreat" (LIES)

General C (loyal): receives "attack" from A, "retreat" from B
  What does C think? Mixed messages!
  C can't tell who is the traitor.

General A: receives "retreat" from B
  A thinks: "B says retreat. C hasn't reported yet (message delayed)."
  A doesn't know B is the traitor either!

Crash vs Byzantine failure:

Crash failure:
  Node A: sends "attack" → Node B: ✓ received
  Node A: sends "attack" → Node C: ✓ received
  Node B: [CRASH] → stops sending anything
  
  Result: Everyone knows B is down. Easy to handle.

Byzantine failure:
  Node A: sends "attack" → Node B ✓
  Node A: sends "attack" → Node C:  ✓
  Node B (traitor): tells A "C says retreat," tells C "A says retreat"
  
  Result: A thinks C is the traitor, C thinks A is the traitor.
  The traitor (B) has successfully divided the loyal generals.

Why this matters for distributed systems: In permissionless systems (blockchains), anyone can join as a validator. A malicious validator can run modified software designed to subvert consensus. Even in permissioned settings, a compromised machine (rootkit, supply chain attack) becomes Byzantine. BFT protects against the worst case: active attackers, not just random crashes.

The Byzantine Generals Problem is considered solved when:

All loyal generals agree on the same plan.
The plan they agree on was proposed by a loyal general.

These correspond to consensus’s Agreement and Validity properties — applied in a world where nodes can lie.

Check Your Understanding

In the 3-generals example with 1 traitor, is it possible for the loyal generals to reach agreement? Why or why not?
What additional capabilities does a Byzantine faulty node have compared to a crash-faulty node?
Could a node with a hardware bug (bit flips in memory) be considered Byzantine?

The “So What?”

BFT is the threat model for blockchains, aerospace, and any system where nodes might be actively compromised. Satellite constellations (where radiation can corrupt memory) and financial systems (where fraud is the adversary) must handle Byzantine faults. Understanding BFT is essential for designing systems where trust is distributed and adversaries are active.

✏️ Exercises

Byzantine Fault Tolerance: Exercises

Exercise 1: 3f+1 Bound

How many nodes are needed to tolerate 3 Byzantine faulty nodes? Show the math.

Exercise 2: PBFT Phases

In PBFT, why is the Prepare phase needed if we already have Pre-prepare? What problem would occur with only Pre-prepare and Commit?

Exercise 3: Proof-of-Work as BFT

How does Proof-of-Work act as a probabilistic BFT mechanism? What happens if an attacker controls 40% of the total hashrate?

👁️ View Solutions

Byzantine Fault Tolerance: Solutions

Exercise 1

N ≥ 3f + 1 = 3(3) + 1 = 10 nodes.

For f = 3, you need at least 10 nodes. With 10 nodes, you can tolerate up to 3 Byzantine faults. With only 9 nodes (3f, but 3f+1 is 10), you’d have 3 correct nodes per “group” — any 3 faulty nodes could group together and confuse the protocol since honest nodes need 2f+1 replies but can’t distinguish liars from honest ones in the worst case.

Compare to crash faults: 2f+1 = 7 nodes needed for f=3 crash tolerance. BFT requires 3 more nodes.

Exercise 2

Without the Prepare phase, a single faulty primary could send different Pre-prepare messages to different replicas, and those replicas would commit different values in the Commit phase — violating agreement.

The Prepare phase ensures within-view consistency: replicas share their Pre-prepare receipts with each other. When a replica collects 2f matching Prepare messages (including its own Pre-prepare), it knows that at least f+1 correct replicas received the same Pre-prepare. Even if the primary is Byzantine, the correct replicas have agreed on the ordering.

The Commit phase then ensures cross-view consistency: that the decision survives view changes.

Without Prepare, a Byzantine primary could send Pre-prepare(“attack”) to half the replicas and Pre-prepare(“retreat”) to the other half. The Commit phase alone can’t detect this because replicas don’t cross-check what they received from the primary.

Exercise 3

Proof-of-Work acts as probabilistic BFT because:

Identity is expensive: Each block requires computational work. An adversary must expend compute power proportional to their hashrate to produce blocks.
Chain selection rule: The longest (most work) chain is considered canonical. To reverse a transaction, the adversary must outpace the honest chain — requiring >50% of total hashrate.
Probabilistic finality: Each confirmation block reduces the probability that an adversary can rewrite history.

With 40% hashrate, an attacker has a significant but not dominant position. The probability of a successful double-spend attack decreases exponentially with confirmations:

After 1 block: ~40% chance (if attacker has pre-mined)
After 3 blocks: ~10% chance
After 6 blocks: ~0.8% chance
After 10 blocks: ~0.02% chance

The attacker can temporarily overtake the honest chain, but the honest chain grows faster on average. The system recommends waiting for enough confirmations (typically 6 for Bitcoin) to make reversal economically infeasible.

This contrasts with classical BFT where finality is instant and deterministic — once 2f+1 nodes commit, the decision is final regardless of adversarial hashrate.