Real World Architecture · ZooKeeper

Key Question

How does ZooKeeper’s ZAB protocol differ from Paxos and Raft?

Deep Dive

ZAB = ZooKeeper Atomic Broadcast. It is not a consensus algorithm like Paxos or Raft. Consensus decides on one value. ZAB guarantees total order broadcast — all messages are delivered to all nodes in the same order, and the same leader drives all writes.

                        ┌──────────────┐
                        │   Leader     │
                        │  (epoch 42)  │
                        └──────┬───────┘
                 ┌──────────────┼──────────────┐
                 │              │              │
            ┌────▼───┐    ┌────▼───┐    ┌────▼───┐
            │Follower│    │Follower│    │Follower│
            │   A     │    │   B     │    │   C     │
            └────────┘    └────────┘    └────────┘

Key differences from Raft:

Property	Raft	ZAB
Purpose	Consensus (one value at a time)	Atomic broadcast (ordered stream)
Leader term	Term number	Epoch number
Discovery	Leader pushes heartbeats	Peers discover via “peer” protocol
Ordering	Log index per entry	Primary-order: two-phase broadcast
Recovery	Leader election → log repl.	Leader election → epoch sync → catch-up

The “peer” discovery — ZAB followers don’t just wait for leader heartbeats. They actively probe peers to find the current leader and epoch. This makes recovery more robust at the cost of a slightly longer discovery phase.

Primary-order guarantee: The leader broadcasts a PROPOSAL to all followers. When a quorum acknowledges, the leader sends a COMMIT. Followers deliver transactions in the exact order the leader sent them. This gives ZooKeeper its linearizable writes — once a write is committed, every subsequent read sees it.

Leader                    Follower
  │                          │
  ├── PROPOSE(txn) ────────►│
  │                          │
  │◄───── ACK ──────────────┤
  │                          │
  ├── COMMIT(txn) ─────────►│  (delivered when quorum ACKs)
  │                          │

Epoch numbers serve the same role as Raft terms, but ZAB uses them differently during recovery. After a leader election, the new leader establishes a unique epoch and forces all followers to synchronize to its transaction log. Any uncommitted proposals from the previous epoch are discarded.

Check Your Understanding

Why does ZAB guarantee atomic broadcast instead of just single-value consensus?
What problem does the “epoch number” solve during leader recovery?
What happens to uncommitted proposals from a previous leader’s epoch?

The “So What?”

ZAB is not “Raft, but different” — it solves a harder problem. ZooKeeper needs every transaction delivered in order to every node so that the ZNode tree stays consistent across the cluster. If ZooKeeper used vanilla Paxos or Raft, it would need to wrap an additional broadcast layer on top. ZAB bakes ordering into the protocol itself.

✏️ Exercises

ZooKeeper: Exercises

Exercise 1

A ZooKeeper lock is held by Client A, which creates an ephemeral ZNode /lock/lock_0000005. Client B is watching, waiting for the lock. Client A’s machine suddenly loses power. Walk through exactly what happens — which ZNodes get deleted, how does Client B learn about it, and what guarantee does ZooKeeper provide that the lock is released?

Exercise 2

A developer argues: “Watches should be persistent — I don’t want to re-register them after every notification. It’s just extra code.” Explain why ZooKeeper uses one-shot watches instead of persistent ones. What failure scenarios does one-shot semantics protect against?

Exercise 3

A team decides to store user profiles (name, email, avatar URL, preferences JSON — about 400KB per profile) in ZooKeeper instead of a database. Why is this a bad idea? Reference ZooKeeper’s design constraints, ZAB protocol behavior, and use cases.

👁️ View Solutions

ZooKeeper: Exercise Solutions

Exercise 1 — Solution

Client A’s machine loses power → ZooKeeper detects the session timeout (no heartbeats).
ZooKeeper’s session management automatically deletes all ephemeral ZNodes owned by Client A’s session, including /lock/lock_0000005.
When /lock/lock_0000005 is deleted, Client B (which had set a watch on that node) receives a watch notification.
Client B calls getChildren("/lock") to list remaining lock contenders. If its ZNode now has the smallest sequence number, it acquires the lock.
Guarantee: ZooKeeper provides no false-positive lock retention — the ephemeral node cannot survive the session. The session timeout bounds the worst-case lock release delay. Network partitions may delay detection, but the lock will be released once the session expires, bounded by the configured session timeout.

Exercise 2 — Solution

One-shot watches protect against the stale-watch problem:

Scenario: A client sets a persistent watch on /config. The config changes rapidly 10 times. If the client’s notification handler is slow or blocked on garbage collection, old notifications queue up. When the client finally processes them, it acts on stale data — or worse, acts on every intermediate change instead of the latest state.
One-shot fix: The client gets one notification, then must re-register. By the time it calls get("/config", watch=true), it atomically reads the current state. It never acts on old cached data.
Another failure: A crash between notification and processing. With persistent watches, the crash loses watch state on the server but the client doesn’t know. The client restarts thinking it has active watches — it doesn’t. With one-shot semantics, the client must re-register everything on startup anyway.

Exercise 3 — Solution

Four reasons this fails:

Size limit: ZooKeeper’s hard limit is 1MB per ZNode, and 1KB is recommended. A 400KB profile saturates the ZNode, and ZAB broadcasts every write to every follower. Profile updates would consume enormous network bandwidth in the ZooKeeper ensemble just to replicate one user’s avatar URL.
Read throughput: ZooKeeper is optimized for reads (not writes), but every read goes through the leader for linearizability. With thousands of user profiles being read constantly, the leader becomes a bottleneck — and ZooKeeper is not designed for high-throughput key-value storage.
Write amplification: ZAB broadcasts every write to a quorum (at least 2 out of 3 nodes). Each write is flushed to disk. For coordination metadata (a few bytes), this is fine. For 400KB blobs updated every time a user changes their email, the disk I/O and network cost destroy performance.
Wrong tool: ZooKeeper solves coordination (leader election, service discovery, configuration). Databases solve storage (querying, indexing, replication, backups). Using ZooKeeper as a database is like using a car engine as a paperweight — it works, but you’re paying for expensive capabilities you don’t need and missing the ones you do.