Consensus Fault Tolerance · Raft

Key Question

What prevents Raft from electing a leader with an incomplete log, which could overwrite committed entries?

Deep Dive

Raft’s most important safety property: Leader Completeness. Once a log entry is committed, it appears in the logs of all future leaders. This is guaranteed by the election restriction.

Election restriction: A candidate must have a log that is at least as up-to-date as a majority of the cluster to win an election.

How log freshness is compared:

Compare the term of the last log entry. Higher term = more up-to-date.
If terms are equal, compare the index of the last log entry. Higher index = more up-to-date.

This comparison means a candidate with fewer total entries but newer entries can beat a candidate with more entries but from older terms.

Concrete scenario:

Node A (Candidate): 
  log: [1:"a"] [1:"b"] [2:"c"] [2:"d"]   last_term=2, last_index=4

Node B (Voter):
  log: [1:"a"] [1:"b"] [1:"e"] [1:"f"] [1:"g"] [3:"h"]   last_term=3, last_index=6

Node C (Voter):
  log: [1:"a"] [1:"b"] [2:"c"]   last_term=2, last_index=3

Node D (Voter):
  log: [1:"a"] [2:"c"]   last_term=2, last_index=2

In this scenario:

Node A (Candidate, term 2, index 4) sends RequestVote to B, C, D.
Node B compares: B’s last term = 3 > A’s last term = 2. B rejects (A is less up-to-date).
Node C compares: C’s last term = 2 == A’s last term = 2, C’s last index = 3 < A’s last index = 4. C accepts.
Node D compares: D’s last term = 2 == A’s last term = 2, D’s last index = 2 < A’s last index = 4. D accepts.
A has 2 votes (self + C + D = so far 3). Wait, A voted for itself too so A+C+D = 3. That’s a majority of 3 out of 4 (A counts itself + C + D). So A could win.

Wait, let me reconsider. In a 4-node cluster (A, B, C, D), majority = 3. A votes for itself (1), C votes yes (2), D votes yes (3). B votes no. A wins with 3/4.

But wait: if A wins with last_term=2, and there was a committed entry in term 3 at index 6 — could that happen? No, because an entry at index 6 in term 3 would need to be committed by a term 3 leader, and that leader would have replicated it to a majority. But A’s log doesn’t have it. The vote from C and D (both have term 2 as their last term) confirms they haven’t seen term 3 entries either. But B has! However, B is only 1 node — not a majority.

The subtlety: A can win because a different majority (C, D) hasn’t seen the term 3 entries. But commit requires a majority in the term that created the entry. If the term 3 entry at index 6 was committed, it must be on a majority of nodes. But here only B has it — so it WASN’T committed. It was uncommitted. So A’s election is safe — it won’t overwrite committed data.

Why this works: An entry committed in term T is stored on a majority of nodes. A candidate must get votes from a majority. By the pigeonhole principle, any majority of voters intersects with the majority that stored the committed entry. The intersection node(s) will have the committed entry and will reject the candidate if it’s less up-to-date than them. But the up-to-date comparison specifically compares last entry, not the committed entry. This is why Raft is subtle: the comparison ensures the candidate has all committed entries because the intersection node’s last entry is at least as recent as the committed entry, which forces the candidate’s log to contain it (via the Log Matching Property).

Check Your Understanding

A candidate’s last log entry is (term 4, index 5). A voter’s last entry is (term 4, index 7). Should the voter grant the vote?
Why can’t a leader with a less complete log accidentally overwrite a committed entry?
What happens to uncommitted log entries when a leader steps down?

The “So What?”

The election restriction guarantees that committed entries survive leader failures. This is the foundation of Raft’s safety. When a new leader takes over, you can be confident that every command a client received a confirmation for is still in the log. Without this guarantee, distributed storage would be impossible.

✏️ Exercises

Raft: Exercises

Exercise 1: Quorum

A Raft cluster has 5 nodes. Two followers crash. Can the remaining 3 nodes continue to operate? Can they commit new entries? What if 3 nodes crash — what happens to the remaining 2?

Exercise 2: Election and Log Freshness

Candidate A has log [(term 1), (term 1), (term 2)]. Candidate B has log [(term 1), (term 1)]. Both are running for election. Who should win based on Raft’s election restriction, and why?

Exercise 3: Uncommitted Entries After Leader Crash

A leader commits entry 5 and has entries 6 and 7 uncommitted when it crashes. A new leader is elected. What happens to entries 6 and 7?

Exercise 4: Randomized Timeouts

Why are election timeouts randomized in Raft? What problem would occur if all nodes used the same 200ms timeout?

👁️ View Solutions

Raft: Solutions

Exercise 1

With 5 nodes, majority = 3. If 2 followers crash, 3 nodes remain — that’s a majority. The cluster can elect a leader, commit entries, and continue operating normally.

If 3 nodes crash, only 2 remain. 2 is not a majority of 5. The remaining 2 nodes cannot elect a leader (no candidate can get 3 votes). They also cannot commit new entries. However, if they were previously part of a cluster with a leader, the leader (if among the 2) can continue sending heartbeats but cannot commit anything. The cluster is effectively read-only until connectivity to other nodes is restored.

This is why Raft clusters are typically deployed with odd numbers of nodes (3, 5, 7) — to maximize fault tolerance.

Exercise 2

Candidate A should win. The election restriction compares last log entries:

Candidate A: last entry is (term 2, index 3)
Candidate B: last entry is (term 1, index 2)

Comparison: term 2 > term 1. A’s log is more up-to-date. Voters will vote for A over B.

This is true even though B’s log (index 2) is shorter than A’s (index 3). The term comparison takes priority. A’s log has entries from a newer term, which might be needed for correctness.

Exercise 3

Entries 6 and 7 become uncommitted and will be overwritten by the new leader. The new leader starts with its own log (which doesn’t include entries 6 and 7, since they weren’t on a majority of nodes). The new leader will:

Accept new client requests.
Fill any gaps in its log through AppendEntries.
Eventually, entries 6 and 7’s slots will be filled with new commands from the new leader’s term.

This is safe because entries 6 and 7 were never committed — no client received confirmation for them. If a client sent those commands and didn’t get a response, it will retry with the new leader.

Exercise 4

If all nodes used the same 200ms timeout, they would all time out simultaneously and all become candidates at the same time. This would cause an election where every node votes for itself — resulting in a guaranteed split vote with no majority. The election fails, all timers reset and immediately fire again (since they’re still synchronized), creating an infinite loop of failed elections.

Randomization (150-300ms) ensures that one node almost always times out first, giving it the chance to start an election and win before others become candidates. The expected wait for the first timeout in a 5-node cluster is ~30ms — far shorter than any single timeout value.