Consensus Fault Tolerance · Raft

Key Question

Once a leader is elected, how does it keep followers synchronized?

Deep Dive

Raft’s log is the authoritative record of commands. Each log entry has an index (position) and a term number. The leader ensures all followers have identical logs.

The AppendEntries flow:

Client sends "SET X=3" to Leader.

Step 1: Leader appends entry to its local log.
  Leader's log: [1:noop] [2:"SET X=3"]   (indices 1, 2)

Step 2: Leader sends AppendEntries RPC to all followers.
  RPC includes: leader_term=5, prev_log_index=1, prev_log_term=1,
                entries=[{index=2, term=5, cmd="SET X=3"}],
                leader_commit=0

Step 3: Each follower checks consistency.
  Follower A's log: [1:noop]
  Does entry at index 1 have term 1? YES → Append succeeds.
  Follower A appends: [1:noop] [2:"SET X=3"]
  Sends success to leader.

  Follower B's log: [1:"WRONG"]
  Does entry at index 1 have term 1? NO → Append fails.
  Follower B sends failure. Leader will decrement and retry.

Step 4: Leader receives majority of successful Appends.
  (3 out of 5 nodes successfully appended).
  Entry 2 is now "committed."
  Leader applies to state machine: X = 3.

Step 5: Leader notifies followers in next heartbeat.
  leader_commit = 2. Followers apply entry 2 to their state machines.

Log state diagram across nodes:

Leader (term 5):
  index:    1        2        3        4
  term:     1        1        5        5        
  cmd:    [noop]  ["SET X=3"] ["SET Y=5"] ["SET Z=7"]
                        ^
                  committed (replicated to majority)

Follower A (up-to-date):
  index:    1        2        3        4
  term:     1        1        5        5        
  cmd:    [noop]  ["SET X=3"] ["SET Y=5"] ["SET Z=7"]

Follower B (slightly behind):
  index:    1        2        3
  term:     1        1        5
  cmd:    [noop]  ["SET X=3"] ["SET Y=5"]

Follower C (far behind / just joined):
  index:    1
  term:     1
  cmd:    [noop]

Consistency check: Raft guarantees the Log Matching Property: if two logs have an entry at the same index with the same term, then all entries up to that index are identical. This is enforced by the prev_log_index and prev_log_term check in AppendEntries.

Commitment rule: An entry is committed only when the leader knows that a majority of the cluster has replicated it. Crucially, entries from a previous term are only committed by counting replicas in the current term — this prevents stale leaders from committing uncommitted entries.

Check Your Understanding

A follower’s log entry at index 5 has a different term than the leader’s. What does the leader do?
How does a leader know when an entry is committed?
What happens to uncommitted log entries if the leader crashes?

The “So What?”

Log replication is Raft’s core mechanism — it ensures all correct nodes execute the same commands in the same order, which is the definition of state machine replication. The AppendEntries consistency check is Raft’s safety guarantee: it’s a linearizability check that prevents divergent logs.

✏️ Exercises

Raft: Exercises

Exercise 1: Quorum

A Raft cluster has 5 nodes. Two followers crash. Can the remaining 3 nodes continue to operate? Can they commit new entries? What if 3 nodes crash — what happens to the remaining 2?

Exercise 2: Election and Log Freshness

Candidate A has log [(term 1), (term 1), (term 2)]. Candidate B has log [(term 1), (term 1)]. Both are running for election. Who should win based on Raft’s election restriction, and why?

Exercise 3: Uncommitted Entries After Leader Crash

A leader commits entry 5 and has entries 6 and 7 uncommitted when it crashes. A new leader is elected. What happens to entries 6 and 7?

Exercise 4: Randomized Timeouts

Why are election timeouts randomized in Raft? What problem would occur if all nodes used the same 200ms timeout?

👁️ View Solutions

Raft: Solutions

Exercise 1

With 5 nodes, majority = 3. If 2 followers crash, 3 nodes remain — that’s a majority. The cluster can elect a leader, commit entries, and continue operating normally.

If 3 nodes crash, only 2 remain. 2 is not a majority of 5. The remaining 2 nodes cannot elect a leader (no candidate can get 3 votes). They also cannot commit new entries. However, if they were previously part of a cluster with a leader, the leader (if among the 2) can continue sending heartbeats but cannot commit anything. The cluster is effectively read-only until connectivity to other nodes is restored.

This is why Raft clusters are typically deployed with odd numbers of nodes (3, 5, 7) — to maximize fault tolerance.

Exercise 2

Candidate A should win. The election restriction compares last log entries:

Candidate A: last entry is (term 2, index 3)
Candidate B: last entry is (term 1, index 2)

Comparison: term 2 > term 1. A’s log is more up-to-date. Voters will vote for A over B.

This is true even though B’s log (index 2) is shorter than A’s (index 3). The term comparison takes priority. A’s log has entries from a newer term, which might be needed for correctness.

Exercise 3

Entries 6 and 7 become uncommitted and will be overwritten by the new leader. The new leader starts with its own log (which doesn’t include entries 6 and 7, since they weren’t on a majority of nodes). The new leader will:

Accept new client requests.
Fill any gaps in its log through AppendEntries.
Eventually, entries 6 and 7’s slots will be filled with new commands from the new leader’s term.

This is safe because entries 6 and 7 were never committed — no client received confirmation for them. If a client sent those commands and didn’t get a response, it will retry with the new leader.

Exercise 4

If all nodes used the same 200ms timeout, they would all time out simultaneously and all become candidates at the same time. This would cause an election where every node votes for itself — resulting in a guaranteed split vote with no majority. The election fails, all timers reset and immediately fire again (since they’re still synchronized), creating an infinite loop of failed elections.

Randomization (150-300ms) ensures that one node almost always times out first, giving it the chance to start an election and win before others become candidates. The expected wait for the first timeout in a 5-node cluster is ~30ms — far shorter than any single timeout value.

🎮 Interactive Raft

Scroll to load interactive visualization…