Distributed & Decentralized Systems Curriculum
Real World Architecture · ZooKeeper

Key Question

What’s the ZooKeeper data model and why does it look like a filesystem?

Deep Dive

ZooKeeper stores data in ZNodes (ZooKeeper data nodes), arranged in a hierarchical tree — exactly like a filesystem. The root is /, and every path like /app/config/db_url names a single ZNode. But unlike a filesystem, every node (not just leaves) can hold data.

Three kinds of ZNodes:

  1. Persistent — lives until explicitly deleted. Great for configuration data that should survive restarts.
  2. Ephemeral — auto-deleted when the client session that created it ends (or crashes). Perfect for liveness: if the client dies, the ZNode vanishes.
  3. Sequential — a monotonically-increasing counter is appended to the name. /election/lock_0000000001, /election/lock_0000000002. Usually combined with ephemeral.

Each ZNode stores small data (hard limit 1MB, 1KB recommended). ZooKeeper is not a database — it’s a coordination service. You store metadata, not blobs.

                    / (root)
                    |
            +-------+--------+
            |                 |
         /app              /zookeeper
      /app/config       /zookeeper/quota
      /app/config/db_url
      /app/workers
      /app/workers/worker_0000001   (ephemeral, sequential)
      /app/workers/worker_0000002   (ephemeral, sequential)

To solve complex coordination problems, you only need four operations: create, get, set, delete. That’s it. No locks, no transactions, no queues in the API — those are built on top of these primitives.

Check Your Understanding

  1. What happens to an ephemeral ZNode when the client that created it crashes?
  2. Why is 1KB the recommended max data size per ZNode?
  3. What’s the difference between a persistent-sequential and ephemeral-sequential ZNode?

The “So What?”

ZooKeeper’s filesystem-like model is not an accident — it’s the simplest possible abstraction that lets you model membership, configuration, and locks using nothing but create/get/set/delete. Every distributed system problem becomes a path in a tree.


✏️ Exercises

ZooKeeper: Exercises

Exercise 1

A ZooKeeper lock is held by Client A, which creates an ephemeral ZNode /lock/lock_0000005. Client B is watching, waiting for the lock. Client A’s machine suddenly loses power. Walk through exactly what happens — which ZNodes get deleted, how does Client B learn about it, and what guarantee does ZooKeeper provide that the lock is released?


Exercise 2

A developer argues: “Watches should be persistent — I don’t want to re-register them after every notification. It’s just extra code.” Explain why ZooKeeper uses one-shot watches instead of persistent ones. What failure scenarios does one-shot semantics protect against?


Exercise 3

A team decides to store user profiles (name, email, avatar URL, preferences JSON — about 400KB per profile) in ZooKeeper instead of a database. Why is this a bad idea? Reference ZooKeeper’s design constraints, ZAB protocol behavior, and use cases.

👁️ View Solutions

ZooKeeper: Exercise Solutions

Exercise 1 — Solution

  1. Client A’s machine loses power → ZooKeeper detects the session timeout (no heartbeats).
  2. ZooKeeper’s session management automatically deletes all ephemeral ZNodes owned by Client A’s session, including /lock/lock_0000005.
  3. When /lock/lock_0000005 is deleted, Client B (which had set a watch on that node) receives a watch notification.
  4. Client B calls getChildren("/lock") to list remaining lock contenders. If its ZNode now has the smallest sequence number, it acquires the lock.
  5. Guarantee: ZooKeeper provides no false-positive lock retention — the ephemeral node cannot survive the session. The session timeout bounds the worst-case lock release delay. Network partitions may delay detection, but the lock will be released once the session expires, bounded by the configured session timeout.

Exercise 2 — Solution

One-shot watches protect against the stale-watch problem:

  • Scenario: A client sets a persistent watch on /config. The config changes rapidly 10 times. If the client’s notification handler is slow or blocked on garbage collection, old notifications queue up. When the client finally processes them, it acts on stale data — or worse, acts on every intermediate change instead of the latest state.
  • One-shot fix: The client gets one notification, then must re-register. By the time it calls get("/config", watch=true), it atomically reads the current state. It never acts on old cached data.
  • Another failure: A crash between notification and processing. With persistent watches, the crash loses watch state on the server but the client doesn’t know. The client restarts thinking it has active watches — it doesn’t. With one-shot semantics, the client must re-register everything on startup anyway.

Exercise 3 — Solution

Four reasons this fails:

  1. Size limit: ZooKeeper’s hard limit is 1MB per ZNode, and 1KB is recommended. A 400KB profile saturates the ZNode, and ZAB broadcasts every write to every follower. Profile updates would consume enormous network bandwidth in the ZooKeeper ensemble just to replicate one user’s avatar URL.

  2. Read throughput: ZooKeeper is optimized for reads (not writes), but every read goes through the leader for linearizability. With thousands of user profiles being read constantly, the leader becomes a bottleneck — and ZooKeeper is not designed for high-throughput key-value storage.

  3. Write amplification: ZAB broadcasts every write to a quorum (at least 2 out of 3 nodes). Each write is flushed to disk. For coordination metadata (a few bytes), this is fine. For 400KB blobs updated every time a user changes their email, the disk I/O and network cost destroy performance.

  4. Wrong tool: ZooKeeper solves coordination (leader election, service discovery, configuration). Databases solve storage (querying, indexing, replication, backups). Using ZooKeeper as a database is like using a car engine as a paperweight — it works, but you’re paying for expensive capabilities you don’t need and missing the ones you do.