Distributed & Decentralized Systems Curriculum
Real World Architecture Β· ZooKeeper

Key Question

How does ZooKeeper notify clients of changes without polling?

Deep Dive

A watch is a one-time trigger. A client sets a watch on a ZNode by calling get("/path", watch=true). When that ZNode changes (data updated, node deleted, child added), ZooKeeper sends the client a notification. One notification, then the watch is gone.

Watches are delivered BEFORE the change data is visible. The client receives the notification, and then must call get() again (re-registering the watch) to see the new data. This guarantees the client never misses a change between reading and watching.

Timeline:

Client A                     ZooKeeper                   Client B
  β”‚                              β”‚                          β”‚
  β”œβ”€β”€ get(/config, watch) ──────►│                          β”‚
  β”‚                              β”‚                          β”‚
  β”‚                              │◄──── set(/config) ───────
  β”‚                              β”‚                          β”‚
  │◄──── watch notification ──────                          β”‚
  β”‚                              β”‚                          β”‚
  β”œβ”€β”€ get(/config, watch) ──────►│  (re-register watch)     β”‚
  │◄──── new config data ─────────                          β”‚

Service discovery in action:

  1. Worker starts, creates ephemeral ZNode /workers/worker_42
  2. Master sets a watch on /workers (children watch)
  3. Worker crashes β†’ ephemeral ZNode auto-deletes
  4. ZooKeeper notifies Master: /workers children changed
  5. Master re-registers the watch, lists remaining workers

Since the watch is one-shot, the Master must re-register every time. This is deliberate: it prevents stale watches from piling up and forces the client to acknowledge every state change.

Before crash:                    After crash:
  /workers                         /workers
  /workers/worker_41               /workers/worker_41
  /workers/worker_42  ← crash──→  (deleted)
              ↑
         Master watches /workers

Check Your Understanding

  1. Why are watches one-shot instead of persistent?
  2. If a client sets a watch, gets notified, and re-registers β€” could it miss a change that happened between the notification and re-registration?
  3. What happens to watches on an ephemeral ZNode when the client session that owns the node crashes?

The β€œSo What?”

Without watches, clients would need to poll β€” flooding ZooKeeper with get() calls every few seconds. Watches turn this into a push model: clients sit idle until something interesting happens. This is how systems like Apache Kafka (using ZooKeeper) detect broker failures in under a second without wasting CPU on busy-waiting.


✏️ Exercises

ZooKeeper: Exercises

Exercise 1

A ZooKeeper lock is held by Client A, which creates an ephemeral ZNode /lock/lock_0000005. Client B is watching, waiting for the lock. Client A’s machine suddenly loses power. Walk through exactly what happens β€” which ZNodes get deleted, how does Client B learn about it, and what guarantee does ZooKeeper provide that the lock is released?


Exercise 2

A developer argues: β€œWatches should be persistent β€” I don’t want to re-register them after every notification. It’s just extra code.” Explain why ZooKeeper uses one-shot watches instead of persistent ones. What failure scenarios does one-shot semantics protect against?


Exercise 3

A team decides to store user profiles (name, email, avatar URL, preferences JSON β€” about 400KB per profile) in ZooKeeper instead of a database. Why is this a bad idea? Reference ZooKeeper’s design constraints, ZAB protocol behavior, and use cases.

πŸ‘οΈ View Solutions

ZooKeeper: Exercise Solutions

Exercise 1 β€” Solution

  1. Client A’s machine loses power β†’ ZooKeeper detects the session timeout (no heartbeats).
  2. ZooKeeper’s session management automatically deletes all ephemeral ZNodes owned by Client A’s session, including /lock/lock_0000005.
  3. When /lock/lock_0000005 is deleted, Client B (which had set a watch on that node) receives a watch notification.
  4. Client B calls getChildren("/lock") to list remaining lock contenders. If its ZNode now has the smallest sequence number, it acquires the lock.
  5. Guarantee: ZooKeeper provides no false-positive lock retention β€” the ephemeral node cannot survive the session. The session timeout bounds the worst-case lock release delay. Network partitions may delay detection, but the lock will be released once the session expires, bounded by the configured session timeout.

Exercise 2 β€” Solution

One-shot watches protect against the stale-watch problem:

  • Scenario: A client sets a persistent watch on /config. The config changes rapidly 10 times. If the client’s notification handler is slow or blocked on garbage collection, old notifications queue up. When the client finally processes them, it acts on stale data β€” or worse, acts on every intermediate change instead of the latest state.
  • One-shot fix: The client gets one notification, then must re-register. By the time it calls get("/config", watch=true), it atomically reads the current state. It never acts on old cached data.
  • Another failure: A crash between notification and processing. With persistent watches, the crash loses watch state on the server but the client doesn’t know. The client restarts thinking it has active watches β€” it doesn’t. With one-shot semantics, the client must re-register everything on startup anyway.

Exercise 3 β€” Solution

Four reasons this fails:

  1. Size limit: ZooKeeper’s hard limit is 1MB per ZNode, and 1KB is recommended. A 400KB profile saturates the ZNode, and ZAB broadcasts every write to every follower. Profile updates would consume enormous network bandwidth in the ZooKeeper ensemble just to replicate one user’s avatar URL.

  2. Read throughput: ZooKeeper is optimized for reads (not writes), but every read goes through the leader for linearizability. With thousands of user profiles being read constantly, the leader becomes a bottleneck β€” and ZooKeeper is not designed for high-throughput key-value storage.

  3. Write amplification: ZAB broadcasts every write to a quorum (at least 2 out of 3 nodes). Each write is flushed to disk. For coordination metadata (a few bytes), this is fine. For 400KB blobs updated every time a user changes their email, the disk I/O and network cost destroy performance.

  4. Wrong tool: ZooKeeper solves coordination (leader election, service discovery, configuration). Databases solve storage (querying, indexing, replication, backups). Using ZooKeeper as a database is like using a car engine as a paperweight β€” it works, but you’re paying for expensive capabilities you don’t need and missing the ones you do.