Distributed & Decentralized Systems Curriculum
Distributed Transactions Coordination · Chandy Lamport Snapshots

Key Question

How do Chandy-Lamport use marker messages to capture a consistent snapshot?

Deep Dive

The Chandy-Lamport algorithm uses special marker messages that flow along normal communication channels. Markers are separate from application messages and serve as “snapshot now” signals.

Algorithm rules:

  1. Initiator: Any process can start a snapshot. It records its own local state and sends a MARKER message on every outgoing channel.

  2. On receiving a MARKER on channel C:

    • If this process has NOT recorded its state yet: it records its local state now, marks channel C as “empty” (no messages to record for C), and sends MARKER on all its outgoing channels.
    • If this process HAS already recorded its state: it records all messages received on channel C after it recorded its state and before receiving this marker. These are “in-transit” messages.

Two-process example:

P1 ---[channel A]---> P2
P1 <---[channel B]--- P2

Step 1: P1 initiates. Records state S1. Sends MARKER on A.
Step 2: P2 receives MARKER on A. Records state S2. Sends MARKER on B.
         P2 also records messages on A received before the marker (none here).
Step 3: P1 receives MARKER on B. P1 has already snapped, so it records all
         messages received on B since its snapshot → these are in-transit.

Why this works: The marker message acts as a “cut” in the message stream. By the time a marker reaches a process, that process knows the sender has already taken its snapshot. Any messages that cross the cut after the sender’s snapshot but before the receiver’s snapshot must be recorded as in-transit — and the marker messages on each channel ensure this happens correctly.

Check Your Understanding

  1. What triggers a process to record its local state?
  2. What does a process do if it receives a marker after already recording its state?
  3. In the two-process example, which messages end up as “in-transit”?

The “So What?”

Chandy-Lamport is the foundation for almost every distributed snapshot protocol in production: Apache Flink checkpoints, Apache Kafka MirrorMaker, Microsoft’s Orleans virtual actors, and Google’s Percolator. The marker concept appears wherever you need consistent state without pausing.


✏️ Exercises

Exercises: Chandy-Lamport Snapshots

  1. Marker Reception: In the Chandy-Lamport algorithm, process P receives a marker message on channel C. P has NOT yet recorded its local state. What does P do? Be specific.

  2. In-Transit Messages: Process P has already recorded its local state. It now receives a marker on channel C. Why must P record all application messages received on C between P’s state recording and this marker? What would go wrong if it didn’t?

  3. Consistent Snapshot vs. Checkpoint: A single-machine system takes a “checkpoint” by pausing all threads, writing their stacks and heap to disk, and resuming. A distributed system uses Chandy-Lamport. What is the fundamental difference in approach, and why is Chandy-Lamport’s approach necessary in the distributed setting?

👁️ View Solutions

Solutions: Chandy-Lamport Snapshots

Exercise 1

P does the following:

  1. Records its current local state (all variables, data structures, etc.).
  2. Marks channel C as “recorded” with no in-transit messages (the channel the marker arrived on is considered finished).
  3. Sends a MARKER message on every outgoing channel.
  4. From now on, P records all application messages received on channels it has already snapped (other than C) until a marker arrives on that channel.

Exercise 2

These messages were sent after the sender took its snapshot (because the marker arrived after P’s snapshot, and markers are sent after the sender’s snapshot). But they were received before the marker on C (since they arrived before it). Without recording them, they would be “lost” — included in neither P’s state (recorded too early) nor the sender’s state (sent too late). This would violate the consistent cut property: a message was sent but is missing from the snapshot, making the snapshot inconsistent.

Exercise 3

Single-machine checkpoint: Pauses all threads globally, then captures state. Nothing moves while you capture — this is a “stop-the-world” approach. It assumes shared memory and global visibility.

Distributed snapshot (Chandy-Lamport): Processes continue running while the snapshot is taken. Markers are used to delineate the cut without global coordination. Each process records its state at a different “logical” time (when it first receives a marker), not the same wall-clock time.

Why Chandy-Lamport is necessary: In a distributed system, there is no global clock, no shared memory, and you cannot pause all processes atomically (the pause message itself would take unpredictable time to arrive). Stop-the-world would require synchronizing across an asynchronous network, which is impossible. Chandy-Lamport’s marker-based approach works without any of these assumptions.