Key Question
How can you capture a consistent global state of a distributed system without stopping all processes?
Deep Dive
A “global snapshot” of a distributed system is the combination of (a) each process’s local state and (b) every message currently in transit across channels. The naive way: pause every process, copy their state, drain the channels. This is a global checkpoint — and it works, but it stops all work.
We want a snapshot taken while processes keep running. The challenge: if you take each process’s snapshot at a different “real” time, you can get a bogus result.
Inconsistent snapshot:
P1: sends message M to P2
|
| P1's snapshot taken here: state shows "M has been sent"
v
~~~~~~~~ network ~~~~~~~~
|
| P2's snapshot taken here: state does NOT show "M received"
v
P2: hasn't received M yet
Result: M appears to have been created out of thin air!
This violates physical causality. If a message receipt is included in the snapshot, the corresponding send must also be included. Conversely, if a send is included but the receipt is excluded, the message “appears” from nowhere.
The formal condition: a consistent cut — for every message receipt recorded in the snapshot, the corresponding message send is also recorded. Equivalently: if event e happened before event f (in Lamport’s happens-before relation), and f is in the snapshot, then e must also be in the snapshot.
Check Your Understanding
- What two things make up a global snapshot?
- What is an inconsistent snapshot?
- What is the formal definition of a consistent cut?
The “So What?”
Without consistent snapshots, distributed debugging is useless (you’d see impossible states), distributed checkpointing causes corruption on restore, and stream processors like Flink cannot guarantee exactly-once semantics. The Chandy-Lamport algorithm (next lesson) solves this without stopping the system.
✏️ Exercises
Exercises: Chandy-Lamport Snapshots
-
Marker Reception: In the Chandy-Lamport algorithm, process P receives a marker message on channel C. P has NOT yet recorded its local state. What does P do? Be specific.
-
In-Transit Messages: Process P has already recorded its local state. It now receives a marker on channel C. Why must P record all application messages received on C between P’s state recording and this marker? What would go wrong if it didn’t?
-
Consistent Snapshot vs. Checkpoint: A single-machine system takes a “checkpoint” by pausing all threads, writing their stacks and heap to disk, and resuming. A distributed system uses Chandy-Lamport. What is the fundamental difference in approach, and why is Chandy-Lamport’s approach necessary in the distributed setting?
👁️ View Solutions
Solutions: Chandy-Lamport Snapshots
Exercise 1
P does the following:
- Records its current local state (all variables, data structures, etc.).
- Marks channel C as “recorded” with no in-transit messages (the channel the marker arrived on is considered finished).
- Sends a
MARKERmessage on every outgoing channel. - From now on, P records all application messages received on channels it has already snapped (other than C) until a marker arrives on that channel.
Exercise 2
These messages were sent after the sender took its snapshot (because the marker arrived after P’s snapshot, and markers are sent after the sender’s snapshot). But they were received before the marker on C (since they arrived before it). Without recording them, they would be “lost” — included in neither P’s state (recorded too early) nor the sender’s state (sent too late). This would violate the consistent cut property: a message was sent but is missing from the snapshot, making the snapshot inconsistent.
Exercise 3
Single-machine checkpoint: Pauses all threads globally, then captures state. Nothing moves while you capture — this is a “stop-the-world” approach. It assumes shared memory and global visibility.
Distributed snapshot (Chandy-Lamport): Processes continue running while the snapshot is taken. Markers are used to delineate the cut without global coordination. Each process records its state at a different “logical” time (when it first receives a marker), not the same wall-clock time.
Why Chandy-Lamport is necessary: In a distributed system, there is no global clock, no shared memory, and you cannot pause all processes atomically (the pause message itself would take unpredictable time to arrive). Stop-the-world would require synchronizing across an asynchronous network, which is impossible. Chandy-Lamport’s marker-based approach works without any of these assumptions.