Real World Architecture · ZooKeeper

Key Question

How do you build distributed locks and leader election using only ZooKeeper primitives?

Deep Dive

Leader Election Recipe

The algorithm is elegant and tiny:

All candidate clients create an ephemeral sequential ZNode under /election. Client A creates /election/candidate_0000001, Client B creates /election/candidate_0000002.
Each client calls getChildren("/election") to list all candidates.
The client with the smallest sequence number is the leader.
Every other client sets a watch on the ZNode just before theirs. Client B watches /election/candidate_0000001. Client C watches /election/candidate_0000002.
When the leader crashes, its ephemeral ZNode vanishes. The next client gets the watch notification.
Repeat from step 2.

/election
/election/candidate_0000001   ← leader (Client A)
/election/candidate_0000002   ← watches 0000001 (Client B)
/election/candidate_0000003   ← watches 0000002 (Client C)

Why watch the previous node instead of the leader? To avoid a herd effect — if all clients watched the leader, a leader crash would notify every client simultaneously, flooding ZooKeeper. Chaining watches means only one client gets notified per crash.

Distributed Lock Recipe (Write Lock)

Same idea, different goal:

Client creates an ephemeral sequential ZNode under /lock.
Client calls getChildren("/lock").
If its ZNode has the smallest sequence number, it holds the lock.
Otherwise, it watches the ZNode with the next smaller sequence number.
When that ZNode is deleted (lock released), the client gets notified and retries.

/lock
/lock/lock_0000001   ← holds lock (Client X)
/lock/lock_0000002   ← watches 0000001 (Client Y, waiting)
/lock/lock_0000003   ← watches 0000002 (Client Z, waiting)

Read/Write Lock: Read locks check only for preceding write locks. Write locks check for all preceding locks. Apache Curator (the standard ZooKeeper recipe library) implements this pattern.

ZooKeeper’s lock is not as fast as Redis’s Redlock for short critical sections, but it makes no timing assumptions — it’s safe under network partitions. This is why Kafka, Solr, and HBase rely on ZooKeeper for leader election.

Check Your Understanding

In the leader election recipe, why do non-leader clients watch the previous node instead of the leader node?
What guarantees that only one client holds the lock at a time?
How does ZooKeeper’s lock differ from a database-based lock?

The “So What?”

ZooKeeper’s create/get/set/delete are deliberately minimal. The Recipes pattern proves that from those four operations, you can build leader election (who’s in charge?), distributed locks (who gets the resource?), and group membership (who’s alive?). Apache Curator wraps these recipes into a library used by Kafka, Solr, HBase, and more — all running on exactly this logic.

✏️ Exercises

ZooKeeper: Exercises

Exercise 1

A ZooKeeper lock is held by Client A, which creates an ephemeral ZNode /lock/lock_0000005. Client B is watching, waiting for the lock. Client A’s machine suddenly loses power. Walk through exactly what happens — which ZNodes get deleted, how does Client B learn about it, and what guarantee does ZooKeeper provide that the lock is released?

Exercise 2

A developer argues: “Watches should be persistent — I don’t want to re-register them after every notification. It’s just extra code.” Explain why ZooKeeper uses one-shot watches instead of persistent ones. What failure scenarios does one-shot semantics protect against?

Exercise 3

A team decides to store user profiles (name, email, avatar URL, preferences JSON — about 400KB per profile) in ZooKeeper instead of a database. Why is this a bad idea? Reference ZooKeeper’s design constraints, ZAB protocol behavior, and use cases.

👁️ View Solutions

ZooKeeper: Exercise Solutions

Exercise 1 — Solution

Client A’s machine loses power → ZooKeeper detects the session timeout (no heartbeats).
ZooKeeper’s session management automatically deletes all ephemeral ZNodes owned by Client A’s session, including /lock/lock_0000005.
When /lock/lock_0000005 is deleted, Client B (which had set a watch on that node) receives a watch notification.
Client B calls getChildren("/lock") to list remaining lock contenders. If its ZNode now has the smallest sequence number, it acquires the lock.
Guarantee: ZooKeeper provides no false-positive lock retention — the ephemeral node cannot survive the session. The session timeout bounds the worst-case lock release delay. Network partitions may delay detection, but the lock will be released once the session expires, bounded by the configured session timeout.

Exercise 2 — Solution

One-shot watches protect against the stale-watch problem:

Scenario: A client sets a persistent watch on /config. The config changes rapidly 10 times. If the client’s notification handler is slow or blocked on garbage collection, old notifications queue up. When the client finally processes them, it acts on stale data — or worse, acts on every intermediate change instead of the latest state.
One-shot fix: The client gets one notification, then must re-register. By the time it calls get("/config", watch=true), it atomically reads the current state. It never acts on old cached data.
Another failure: A crash between notification and processing. With persistent watches, the crash loses watch state on the server but the client doesn’t know. The client restarts thinking it has active watches — it doesn’t. With one-shot semantics, the client must re-register everything on startup anyway.

Exercise 3 — Solution

Four reasons this fails:

Size limit: ZooKeeper’s hard limit is 1MB per ZNode, and 1KB is recommended. A 400KB profile saturates the ZNode, and ZAB broadcasts every write to every follower. Profile updates would consume enormous network bandwidth in the ZooKeeper ensemble just to replicate one user’s avatar URL.
Read throughput: ZooKeeper is optimized for reads (not writes), but every read goes through the leader for linearizability. With thousands of user profiles being read constantly, the leader becomes a bottleneck — and ZooKeeper is not designed for high-throughput key-value storage.
Write amplification: ZAB broadcasts every write to a quorum (at least 2 out of 3 nodes). Each write is flushed to disk. For coordination metadata (a few bytes), this is fine. For 400KB blobs updated every time a user changes their email, the disk I/O and network cost destroy performance.
Wrong tool: ZooKeeper solves coordination (leader election, service discovery, configuration). Databases solve storage (querying, indexing, replication, backups). Using ZooKeeper as a database is like using a car engine as a paperweight — it works, but you’re paying for expensive capabilities you don’t need and missing the ones you do.