Consensus Fault Tolerance · Paxos

Key Question

How do we optimize Paxos to agree on a sequence of values (a log)?

Deep Dive

Single-instance Paxos requires two rounds of communication per decision. For a replicated log with millions of entries, that’s 2 million rounds — far too expensive. Multi-Paxos optimizes by electing a stable leader who can skip Phase 1 for most decisions.

The insight: Phase 1 (Prepare/Promise) establishes that no old proposal can interfere. Once a leader is stable, no new leader will appear, so Phase 1 is unnecessary for subsequent proposals. The leader reuses the same proposal number or increments it slightly.

Multi-Paxos flow:

Phase 1 (one-time, expensive):
  Leader L sends Prepare(100) to all acceptors.
  Gets majority promises. Leader holds the "leadership slot."

Phase 2 (repeated, cheap):
  Entry 1: Leader sends Accept(101, "SET X=3"), gets majority → decided
  Entry 2: Leader sends Accept(102, "SET Y=5"), gets majority → decided
  Entry 3: Leader sends Accept(103, "SET Z=7"), gets majority → decided
  ...

Leader lease: The leader maintains a “lease” — a period during which it’s the only active proposer. Acceptors promise to reject Pre/Accept from other proposers with numbers below a threshold. If the leader crashes, a new leader must run Phase 1 again before proposing.

Log as sequence of Paxos instances:

Log position:     [1]       [2]       [3]       [4]       [5]       [6]
                 +---+     +---+     +---+     +---+     +---+     +---+
Paxos instance:  | 1 | ... | 2 | ... | 3 | ... | 4 | ... | 5 | ... | 6 | ...
                 +---+     +---+     +---+     +---+     +---+     +---+
Decided value:  "X=3"    "Y=5"    "Z=7"    "W=1"    "X=9"    "Q=0"

Each log entry is an independent Paxos instance with its own proposal numbers and acceptances. But with Multi-Paxos, they all share the same leader and skip Phase 1.

Leader election cost: When a new leader takes over, it runs Phase 1 for all log positions simultaneously. It sends Prepare to all acceptors, which respond with their entire log of accepted values for all positions. The new leader learns all previously decided values and fills any gaps.

Practical considerations:

The leader number (ballot number) increases with each leader change.
Acceptors track the highest ballot they’ve promised, across all log positions.
A new leader might not know which entries are committed — it must conservatively assume entries from the previous leader might be committed.

Check Your Understanding

How many rounds of Phase 1/Phase 2 are needed for 1000 log entries with a stable leader?
When must Phase 1 be repeated in Multi-Paxos?
What information must a new leader collect during its Phase 1 to safely resume operation?
What prevents two proposers from “livelocking” each other in basic Paxos?

The “So What?”

Multi-Paxos is the basis for most production consensus systems, including Google’s Chubby lock service and Spanner. The optimization from 2 rounds per entry to 1 round per entry (amortized) is what makes consensus practical for high-throughput systems. Amazon’s DynamoDB, Google’s Bigtable, and many other systems rely on this pattern.

✏️ Exercises

Paxos: Exercises

Exercise 1: Majority Decision

A proposer sends Prepare(5) to 3 acceptors and gets promises from 2 of them. What happens next? What if the proposer only gets 1 promise?

Exercise 2: Prior Acceptance

What does an acceptor include in its Promise if it already accepted value X under proposal number 3, and now receives Prepare(7)?

Exercise 3: Multi-Paxos

In Multi-Paxos, when must Phase 1 be repeated? Describe the exact scenario.

Exercise 4: Livelock

What prevents two proposers from “livelocking” each other — where P1 sends Prepare(5), P2 sends Prepare(6), P1 retries with Prepare(7), P2 retries with Prepare(8), ad infinitum?

👁️ View Solutions

Paxos: Solutions

Exercise 1

With 2 promises out of 3, the proposer has a majority and can proceed to Phase 2. It sends Accept(5, value) to all acceptors, where value is determined by the highest-numbered prior acceptance (if any) from the promises.

With only 1 promise out of 3, the proposer does NOT have a majority and cannot proceed to Phase 2. It must wait, pick a higher proposal number, and retry Phase 1.

Exercise 2

The acceptor includes its accepted state in the Promise. Since it accepted value X under proposal 3, the response is: Promise(7, accepted_proposal=3, accepted_value=X)

This tells the proposer: “I promise not to accept anything below 7. By the way, I previously accepted X under proposal 3. Use this value if no higher-numbered acceptance exists.”

Exercise 3

Phase 1 must be repeated when:

The current leader crashes or becomes unreachable (detected via timeout).
A new leader must be elected to take over.

The new leader sends Prepare with a higher ballot number to all acceptors. Acceptors respond with all their accepted values (for all log positions), allowing the new leader to fill in gaps and continue where the old leader left off.

Phase 1 is NOT needed between individual log entries while the leader is stable — that’s the core Multi-Paxos optimization.

Exercise 4

Two practical mechanisms prevent livelock:

Backoff: In practice, proposers use randomized backoff when their proposals fail. After a failed attempt, a proposer waits a random delay before retrying, giving the other proposer time to complete.
Leader election: Real systems elect a single leader. Only the leader acts as proposer for most operations. Other proposers defer to the leader. This eliminates concurrent proposals in normal operation.

Without these mechanisms, basic Paxos can livelock — two aggressive proposers can keep outbidding each other forever. This is why practical Paxos implementations always have leader election.