Key Question
How does Raft choose a single leader to manage the cluster?
Deep Dive
Raft divides time into terms, and each term has at most one leader. Nodes are in one of three states:
- Follower: Passive. Expects periodic heartbeats from the leader. If a follower receives no communication for its election timeout (150-300ms randomly chosen), it becomes a Candidate.
- Candidate: Initiates an election. Votes for itself, asks peers to vote, and if it gets a majority, becomes the leader.
- Leader: Handles all client requests, sends heartbeats to followers, and manages log replication.
Starts up
|
+----v----+
| FOLLOWER| <--------------+
+---------+ |
| |
Election timeout |
(no heartbeat) |
| |
+----v----+ +-----+----+
|CANDIDATE|--------->| LEADER |
+---------+ wins +----------+
| election |
| | discovers higher term
| | or loses contact
loses election |
or discovers |
higher term |
| |
+--------------------+
Election process step by step (5-node cluster):
Time T0: Leader crashes.
Node A (follower) timeout = 180ms Node D (follower) timeout = 270ms
Node B (follower) timeout = 220ms Node E (follower) timeout = 150ms
Node C (follower) timeout = 310ms
Time T0+150ms: Node E's timer fires first.
Node E becomes Candidate.
Term = 1 (new term).
Votes for itself: 1.
Sends RequestVote to A, B, C, D.
Time T0+150ms to T0+180ms:
Nodes A, B, C, D receive RequestVote from E.
They check: E's log is at least as up-to-date as theirs (yes, same log).
They vote for E. E now has 5 votes (itself + 4 others). Majority = 3.
Time T0+180ms: Node E is now Leader.
Sends heartbeat (AppendEntries with no entries) to all nodes.
Other nodes' election timers reset. System stable.
Randomization prevents split votes: The election timeout is randomly chosen from 150-300ms. This makes it extremely unlikely that two nodes time out simultaneously. If a split vote occurs (no candidate gets majority), each candidate’s followers timeout again at different times, and the fastest wins.
What happens during a partition:
Before partition: [A]---[B]---[C]---[D]---[E] Leader = C
Partition: [A] [B] [C] [D] [E]
\___Partition___/ \___Partition___/
Side 1 (A,B) Side 2 (C,D,E)
Side 2 (majority, 3 nodes): C still leader. Continues normally.
Side 1 (minority, 2 nodes): No heartbeats. A and B hold election.
A becomes leader of term 2. But only has 2 votes — can't commit anything.
When partition heals, A learns of C's higher term and steps down.
Check Your Understanding
- In a Raft cluster of 5 nodes, 2 followers crash. Can the remaining 3 nodes elect a leader?
- What happens if two candidates get exactly 2 votes each in a 5-node cluster? (Split vote)
- Why are election timeouts randomized rather than fixed?
The “So What?”
Raft’s leader election is designed to be fast and understandable — the randomness prevents split votes efficiently. The simplicity of “one leader, one term” is Raft’s main contribution over Paxos. When operating a Raft cluster, election timeout configuration is the most important tuning knob: too short causes useless elections, too long causes long downtime after leader failure.
✏️ Exercises
Raft: Exercises
Exercise 1: Quorum
A Raft cluster has 5 nodes. Two followers crash. Can the remaining 3 nodes continue to operate? Can they commit new entries? What if 3 nodes crash — what happens to the remaining 2?
Exercise 2: Election and Log Freshness
Candidate A has log [(term 1), (term 1), (term 2)]. Candidate B has log [(term 1), (term 1)]. Both are running for election. Who should win based on Raft’s election restriction, and why?
Exercise 3: Uncommitted Entries After Leader Crash
A leader commits entry 5 and has entries 6 and 7 uncommitted when it crashes. A new leader is elected. What happens to entries 6 and 7?
Exercise 4: Randomized Timeouts
Why are election timeouts randomized in Raft? What problem would occur if all nodes used the same 200ms timeout?
👁️ View Solutions
Raft: Solutions
Exercise 1
With 5 nodes, majority = 3. If 2 followers crash, 3 nodes remain — that’s a majority. The cluster can elect a leader, commit entries, and continue operating normally.
If 3 nodes crash, only 2 remain. 2 is not a majority of 5. The remaining 2 nodes cannot elect a leader (no candidate can get 3 votes). They also cannot commit new entries. However, if they were previously part of a cluster with a leader, the leader (if among the 2) can continue sending heartbeats but cannot commit anything. The cluster is effectively read-only until connectivity to other nodes is restored.
This is why Raft clusters are typically deployed with odd numbers of nodes (3, 5, 7) — to maximize fault tolerance.
Exercise 2
Candidate A should win. The election restriction compares last log entries:
- Candidate A: last entry is (term 2, index 3)
- Candidate B: last entry is (term 1, index 2)
Comparison: term 2 > term 1. A’s log is more up-to-date. Voters will vote for A over B.
This is true even though B’s log (index 2) is shorter than A’s (index 3). The term comparison takes priority. A’s log has entries from a newer term, which might be needed for correctness.
Exercise 3
Entries 6 and 7 become uncommitted and will be overwritten by the new leader. The new leader starts with its own log (which doesn’t include entries 6 and 7, since they weren’t on a majority of nodes). The new leader will:
- Accept new client requests.
- Fill any gaps in its log through AppendEntries.
- Eventually, entries 6 and 7’s slots will be filled with new commands from the new leader’s term.
This is safe because entries 6 and 7 were never committed — no client received confirmation for them. If a client sent those commands and didn’t get a response, it will retry with the new leader.
Exercise 4
If all nodes used the same 200ms timeout, they would all time out simultaneously and all become candidates at the same time. This would cause an election where every node votes for itself — resulting in a guaranteed split vote with no majority. The election fails, all timers reset and immediately fire again (since they’re still synchronized), creating an infinite loop of failed elections.
Randomization (150-300ms) ensures that one node almost always times out first, giving it the chance to start an election and win before others become candidates. The expected wait for the first timeout in a 5-node cluster is ~30ms — far shorter than any single timeout value.