Consensus Fault Tolerance · Raft

Key Question

How does Raft prevent the log from growing infinitely long?

Deep Dive

A Raft log grows with every command. After millions of operations, the log consumes enormous memory and disk space. Worse, when a new or restarted node joins the cluster, it must replay the entire log — which could take hours. Snapshots solve this.

What is a snapshot?

The leader periodically freezes its state machine, serializes the entire state (e.g., all key-value pairs), saves it to stable storage, and discards all log entries up to that point.

Before snapshot:
  Log: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] ... [1000000]
       |--- entries 1-1000000 ----|
                        |
                  State machine: X=3, Y=7, Z=42, ...

After snapshot at index 1000000:
  Log:                               [1000001] [1000002] ...
       ^----- entries discarded -----^
  
  Snapshot: last_included_index=1000000, last_included_term=42
            State: { X=3, Y=7, Z=42, ... }

InstallSnapshot RPC: When a follower is far behind (or just joined), the leader sends the snapshot via InstallSnapshot. The follower:

Receives snapshot chunks.
Installs the snapshot (replaces its state machine).
Sets its log to start at last_included_index + 1.
Discards old log entries.

Leader                          Slow follower
  |                                  |
  |--- AppendEntries (rejected,     |
  |    prev_log_index out of date)  |
  |--- AppendEntries (rejected)     |
  |--- InstallSnapshot(chunk 1) --->|
  |--- InstallSnapshot(chunk 2) --->|
  |--- InstallSnapshot(done) ------>|
  |                                  |  installs snapshot
  |--- AppendEntries (now works!) ->|  log matches

Snapshot frequency trade-off:

Snapshot too often (every 100 entries):
  + Low memory usage, fast recovery for new nodes
  - High I/O overhead from frequent snapshots
  - Short-lived state changes cause wasted writes

Snapshot too rarely (every 1M entries):
  + Low overhead during normal operation
  - High memory usage
  - Very slow recovery for new nodes
  - Expensive if snapshot fails (must redo)

Production Raft implementations typically snapshot based on log size (e.g., every N megabytes) rather than entry count.

Safety of snapshots: Snapshots are safe because the state machine reflects only committed entries. Uncommitted entries are NOT included — they would make the snapshot unreliable. After installing a snapshot, the follower may have missed some committed entries it hasn’t applied yet, but the snapshot state is at least as current as the follower’s own state machine.

Check Your Understanding

What information must a snapshot include besides the state machine data?
What happens if a snapshot fails partway through (e.g., leader crashes during InstallSnapshot)?
Why can’t uncommitted entries be included in a snapshot?

The “So What?”

Log compaction is what makes Raft practical for long-running systems. Production Raft implementations (etcd, Consul, MongoDB) snapshot frequently — typically every few thousand entries or when the log exceeds a size threshold. Without snapshots, a node that’s been down for a day would need to replay millions of operations, which is impractical. The snapshot is also essential for Raft-based databases to support point-in-time recovery.

✏️ Exercises

Raft: Exercises

Exercise 1: Quorum

A Raft cluster has 5 nodes. Two followers crash. Can the remaining 3 nodes continue to operate? Can they commit new entries? What if 3 nodes crash — what happens to the remaining 2?

Exercise 2: Election and Log Freshness

Candidate A has log [(term 1), (term 1), (term 2)]. Candidate B has log [(term 1), (term 1)]. Both are running for election. Who should win based on Raft’s election restriction, and why?

Exercise 3: Uncommitted Entries After Leader Crash

A leader commits entry 5 and has entries 6 and 7 uncommitted when it crashes. A new leader is elected. What happens to entries 6 and 7?

Exercise 4: Randomized Timeouts

Why are election timeouts randomized in Raft? What problem would occur if all nodes used the same 200ms timeout?

👁️ View Solutions

Raft: Solutions

Exercise 1

With 5 nodes, majority = 3. If 2 followers crash, 3 nodes remain — that’s a majority. The cluster can elect a leader, commit entries, and continue operating normally.

If 3 nodes crash, only 2 remain. 2 is not a majority of 5. The remaining 2 nodes cannot elect a leader (no candidate can get 3 votes). They also cannot commit new entries. However, if they were previously part of a cluster with a leader, the leader (if among the 2) can continue sending heartbeats but cannot commit anything. The cluster is effectively read-only until connectivity to other nodes is restored.

This is why Raft clusters are typically deployed with odd numbers of nodes (3, 5, 7) — to maximize fault tolerance.

Exercise 2

Candidate A should win. The election restriction compares last log entries:

Candidate A: last entry is (term 2, index 3)
Candidate B: last entry is (term 1, index 2)

Comparison: term 2 > term 1. A’s log is more up-to-date. Voters will vote for A over B.

This is true even though B’s log (index 2) is shorter than A’s (index 3). The term comparison takes priority. A’s log has entries from a newer term, which might be needed for correctness.

Exercise 3

Entries 6 and 7 become uncommitted and will be overwritten by the new leader. The new leader starts with its own log (which doesn’t include entries 6 and 7, since they weren’t on a majority of nodes). The new leader will:

Accept new client requests.
Fill any gaps in its log through AppendEntries.
Eventually, entries 6 and 7’s slots will be filled with new commands from the new leader’s term.

This is safe because entries 6 and 7 were never committed — no client received confirmation for them. If a client sent those commands and didn’t get a response, it will retry with the new leader.

Exercise 4

If all nodes used the same 200ms timeout, they would all time out simultaneously and all become candidates at the same time. This would cause an election where every node votes for itself — resulting in a guaranteed split vote with no majority. The election fails, all timers reset and immediately fire again (since they’re still synchronized), creating an infinite loop of failed elections.

Randomization (150-300ms) ensures that one node almost always times out first, giving it the chance to start an election and win before others become candidates. The expected wait for the first timeout in a 5-node cluster is ~30ms — far shorter than any single timeout value.