Real World Architecture · Dynamo

Key Question

How does Dynamo distribute data across nodes without rehashing everything when nodes join or leave?

Deep Dive

Traditional hash-based partitioning (key % N) is simple but fragile: when N changes (a node joins or leaves), almost every key is remapped to a different node. For a system with petabytes of data, moving every key is catastrophic.

Consistent hashing solves this. Both keys and nodes are hashed onto a fixed circular space (0 to 2^160 - 1 for SHA-1). Each key is stored on the first node encountered when moving clockwise from the key’s hash. When a node joins, it only “steals” keys from its immediate neighbor. When a node leaves, its keys are taken over by its neighbor.

The Ring

                    Node C
               ┌───────────┐
               │  SHA-1    │
               │  space    │
               │           │
    Node D     │    ●      │     Node B
       ●       │   / \     │        ●
        \      │  /   \    │       /
         \     │ /     \   │      /
          ──────●───────●───────
              Node E     Node A

  Key K1 (hash = 0x4A...) ──► stored on Node A
  Key K2 (hash = 0x8F...) ──► stored on Node B
  Key K3 (hash = 0xC2...) ──► stored on Node C
  Key K4 (hash = 0x1D...) ──► stored on Node E

  (Going clockwise from key hash to first node)

Lookup is simple: hash the key, walk clockwise until you find a node. Returns the node.

Node Join

When a new node joins, it inserts itself at some position on the ring:

Before: Keys between Node D and Node A go to Node A.

    Node D (hash=0x10)       Node A (hash=0x50)
       ●────────────────────────●
                        ^
                        Keys here go to A

After: Node X joins at hash=0x30.

    Node D (0x10)    Node X (0x30)    Node A (0x50)
       ●─────────────────●───────────────●

     Keys (0x10 to 0x30) → Node X (NEW)
     Keys (0x30 to 0x50) → Node A (unchanged)

Node X only takes over keys in the range (D, X] — i.e., keys that previously belonged to Node A. No other nodes are affected. In a cluster of 1000 nodes, a single node join moves only ~0.1% of keys. Compare this to key % N where ≈ (N-1)/N of keys would move (99.9%).

Node Leave

When Node X leaves (failure or decommission):

    Node D (0x10)    Node X (0x30)    Node A (0x50)
       ●─────────────────●───────────────●

  X leaves → keys (D, X] are reassigned to Node A
  Only Node A is affected. Other nodes are unchanged.

Virtual Nodes

Physical machines have different capacities. And with few nodes on the ring, key distribution can be unbalanced (some nodes get many keys, others get few).

Dynamo’s solution: virtual nodes. Each physical node maps to multiple positions on the ring (tokens):

Physical node P1 → virtual nodes V1, V4, V7, V10...
Physical node P2 → virtual nodes V2, V5, V8, V11...
Physical node P3 → virtual nodes V3, V6, V9, V12...

Ring:
  V1 (P1) → V2 (P2) → V3 (P3) → V4 (P1) → V5 (P2) → V6 (P3) → ...

Key K → first virtual node clockwise → physical node

With 100+ virtual nodes per physical node, the law of large numbers ensures balanced key distribution even if a few virtual nodes get unlucky. When a physical node joins, its virtual nodes are spread across the ring, taking a proportional share of keys from each neighbor.

Lookup Efficiency

Each node maintains a “finger table” (or routing table) mapping segments of the ring to nodes. Lookups are O(log N) in the number of virtual nodes, using binary search on the sorted ring positions.

Dynamo vs. Cassandra

Cassandra inherited consistent hashing from Dynamo but calls it “partitioning.” Both use the ring with virtual nodes. The key difference: Cassandra allows operators to assign tokens manually, while Dynamo handles it automatically.

Check Your Understanding

A ring has 100 nodes. One node joins. What fraction of keys are remapped (on average)?
What happens if the hash function produces very unbalanced key distribution even with virtual nodes? Can consistent hashing still cause hot spots?
Why are virtual nodes important? What’s the worst-case load imbalance without them?

The “So What?”

Consistent hashing is the foundation of DynamoDB, Cassandra, Riak, and many other NoSQL systems. When you create a DynamoDB table and choose your partition key, you’re relying on consistent hashing to distribute data across partitions. The ring + virtual node pattern is also used in load balancers (e.g., consistent hashing for HTTP caching) and CDNs.

✏️ Exercises

Dynamo: Exercises

Quorum consistency. N = 3, R = 1, W = 3. Under what conditions can a read return stale data? What happens to the latency of writes and reads in this configuration?
Hinted handoff durability. In a sloppy quorum scenario, node A is down. The coordinator writes key K to B, C, D (D holds hinted data for A). Then D crashes before handing off to A. B and C remain healthy. Node A recovers. How many copies of key K exist? Is any data lost?
Merkle tree depth. A Dynamo node stores 16 million keys. It builds a Merkle tree of depth 16 (65,536 leaves). What is the average number of keys per bucket? If the root hashes differ between two nodes, how many hash comparisons are needed in the worst case to find the differing bucket?
Dynamo vs. GFS. GFS uses a single master with lease-based replication. Dynamo uses a fully distributed ring with quorums. Why did Amazon choose the Dynamo architecture instead of a GFS-like master-based replication? What constraints drove this decision?

👁️ View Solutions

Dynamo: Solutions

1. Quorum consistency

Answer. N = 3, R = 1, W = 3. With R = 1, the read quorum is a single node. With W = 3, the write quorum is all three nodes (A, B, C). Since R + W = 4 > N = 3, any read quorum overlaps with any write quorum. The read is guaranteed to see the latest write because the single node that responds (R=1) was part of the write quorum of 3 (W=3). So the read always returns the latest data.

Write latency: the coordinator must wait for all 3 replicas to acknowledge (W=3). Write latency = max(A, B, C) — the slowest replica. Writes are slower.

Read latency: the coordinator returns as soon as any 1 replica responds (R=1). Read latency = min(A, B, C). Reads are fast. However, the data could be stale if the coordinator doesn’t pick the right replica (but with W=3, all replicas are up-to-date, so this is not an issue).

2. Hinted handoff durability

Answer. Timeline:

A is down. Write goes to B, C, D (D holds hinted copy for A).
D crashes. D’s local store (including the hinted data for A) is lost.
B and C have the data. A is back up but doesn’t have the latest write.

Number of copies: B and C have the data (2 copies). The hinted copy on D is lost. So only 2 copies remain. If B or C also fails, the data could be lost permanently.

Is data lost? Yes, the latest write is partially lost — specifically, the copy intended for A (which was hinted to D) is gone when D crashed. However, B and C still have the data, so the data is not completely lost. The system has N=3 but only 2 copies survive. Durability is temporarily reduced. When anti-entropy runs (or hinted handoff from another node), it will detect that A is missing the latest version and repair it.

This is the trade-off: sloppy quorums improve availability (writes always succeed) at the cost of temporarily reduced durability during cascading failures.

3. Merkle tree depth

Answer.

Average keys per bucket: 16,000,000 / 65,536 ≈ 244 keys per bucket.

Worst-case hash comparisons to find differing bucket: at each level of the tree, we compare 2 hashes (the two children of the differing node). Depth 16 means 16 levels. At each level, we compare 2 hashes. So 16 × 2 = 32 hash comparisons in the worst case. This is O(log L) where L = 65,536 leaves.

After finding the bucket, we transfer all keys in that bucket (≈244 keys) to identify the exact differing key-value pairs.

Contrast with a full comparison: without Merkle trees, we’d need to transfer or compare all 16 million keys (each key-value pair sent over the network or compared locally). The Merkle tree reduces this to 32 hashes + 244 keys — a 99.998% reduction in data transfer.

4. Dynamo vs. GFS

Answer. Amazon chose the Dynamo architecture for these reasons:

Availability requirements. Amazon’s shopping cart must always accept writes. If a customer adds an item to their cart, it must succeed. GFS’s single master creates an availability bottleneck. Dynamo achieves 99.999% write availability through its fully distributed, masterless design.
No single point of failure. Dynamo has no master. GFS has a single master (SPOF). For Amazon’s SLA-driven services, any SPOF is unacceptable. The Dynamo ring means any node can fail without affecting availability — the system degrades gracefully.
Operational simplicity. There’s no master to elect, promote, or failover. Nodes join and leave the ring independently. This makes operations (deployments, scaling) much simpler than managing a GFS-like master.
Workload characteristics. GFS is optimized for large files and sequential access (search index, MapReduce). Dynamo is optimized for small objects and random access (shopping cart, session state). Dynamo’s quorum system is better suited to key-value workloads than GFS’s chunk-based approach.
Consistency model flexibility. Dynamo’s tunable consistency (N, R, W) lets each service pick its trade-off. GFS always provides strong consistency for single-chunk operations. Not every Amazon service needs strong consistency.
No cross-rack file system dependencies. Dynamo works directly on local disks. GFS depends on a shared file system abstraction. Dynamo’s simpler local-storage model is more portable and easier to manage at Amazon’s scale.

In short: GFS optimizes for throughput (large sequential operations). Dynamo optimizes for availability (small random operations). They were designed for different workloads.

🎮 Interactive Consistent Hashing

Scroll to load interactive visualization…