Key Question
What can you do with a consistent global snapshot once you have it?
Deep Dive
A consistent snapshot is a freeze-frame of the entire distributed system. Three major applications:
1. Distributed Debugging You want to check a global invariant — e.g., “total money in the banking system is conserved.” An inconsistent snapshot could show $100 “lost” because P1’s snapshot included “sent $100” but P2’s snapshot didn’t include “received $100.” A consistent snapshot guarantees the invariant holds if it held before any transfers.
Consistent snapshot: P1={A: $50}, P2={B: $150}, in-transit={}
Total: $200 ✓
Inconsistent snapshot: P1={A: $50}, P2={B: $50}, in-transit={sent $100}
Total: $100 ✗ (money "lost")
2. Checkpointing for Fault Recovery Periodically capture consistent snapshots. If a node fails, restart all processes from the last snapshot and replay logged messages. The consistency guarantee means no message is replayed that was already delivered, and no message is missed.
3. Distributed Garbage Collection A snapshot of object references across nodes reveals which objects are reachable. Objects not referenced in the snapshot (directly or transitively) can be garbage collected. This is used in distributed actor frameworks (e.g., Orleans, Akka).
Example: In a banking system, a consistent snapshot can be used to verify that all transfers balance out. The snapshot shows account balances on each shard plus in-transit transfers. If the total matches the expected sum, the system is in a correct global state.
Check Your Understanding
- Why does an inconsistent snapshot make it look like money was “lost” in a banking system?
- How does consistent checkpointing help with fault recovery?
- What distributed system property does garbage collection rely on?
The “So What?”
Distributed debugging without consistent snapshots is like debugging a single-threaded program with a debugger that shows different threads at different points in time — chaos. Snapshots give operators and developers a reliable window into the global state, making distributed systems debuggable.
✏️ Exercises
Exercises: Chandy-Lamport Snapshots
-
Marker Reception: In the Chandy-Lamport algorithm, process P receives a marker message on channel C. P has NOT yet recorded its local state. What does P do? Be specific.
-
In-Transit Messages: Process P has already recorded its local state. It now receives a marker on channel C. Why must P record all application messages received on C between P’s state recording and this marker? What would go wrong if it didn’t?
-
Consistent Snapshot vs. Checkpoint: A single-machine system takes a “checkpoint” by pausing all threads, writing their stacks and heap to disk, and resuming. A distributed system uses Chandy-Lamport. What is the fundamental difference in approach, and why is Chandy-Lamport’s approach necessary in the distributed setting?
👁️ View Solutions
Solutions: Chandy-Lamport Snapshots
Exercise 1
P does the following:
- Records its current local state (all variables, data structures, etc.).
- Marks channel C as “recorded” with no in-transit messages (the channel the marker arrived on is considered finished).
- Sends a
MARKERmessage on every outgoing channel. - From now on, P records all application messages received on channels it has already snapped (other than C) until a marker arrives on that channel.
Exercise 2
These messages were sent after the sender took its snapshot (because the marker arrived after P’s snapshot, and markers are sent after the sender’s snapshot). But they were received before the marker on C (since they arrived before it). Without recording them, they would be “lost” — included in neither P’s state (recorded too early) nor the sender’s state (sent too late). This would violate the consistent cut property: a message was sent but is missing from the snapshot, making the snapshot inconsistent.
Exercise 3
Single-machine checkpoint: Pauses all threads globally, then captures state. Nothing moves while you capture — this is a “stop-the-world” approach. It assumes shared memory and global visibility.
Distributed snapshot (Chandy-Lamport): Processes continue running while the snapshot is taken. Markers are used to delineate the cut without global coordination. Each process records its state at a different “logical” time (when it first receives a marker), not the same wall-clock time.
Why Chandy-Lamport is necessary: In a distributed system, there is no global clock, no shared memory, and you cannot pause all processes atomically (the pause message itself would take unpredictable time to arrive). Stop-the-world would require synchronizing across an asynchronous network, which is impossible. Chandy-Lamport’s marker-based approach works without any of these assumptions.