Real World Architecture · Dynamo

Key Question

How does Dynamo efficiently detect differences between replicas without comparing every key?

Deep Dive

Replicas diverge over time. Hinted handoff might miss some writes. Network partitions cause inconsistencies. Clock skew produces conflicting versions. Dynamo needs an anti-entropy mechanism to find and repair these differences.

The naive approach: node A sends every key-value pair to node B, and B compares them. For a node storing 1 billion keys, that’s terabytes of data transferred. This doesn’t scale.

The efficient solution: Merkle trees (also called hash trees).

What’s a Merkle Tree?

A Merkle tree is a binary tree where:

Each leaf is the hash of a key-value pair
Each internal node is the hash of its two children
The root is the hash of the entire tree

Merkle Tree (depth = 3, 8 leaves):

            Root = H(H01 || H23)
          /                    \
     H01 = H(H0 || H1)     H23 = H(H2 || H3)
       /        \            /        \
    H0        H1          H2          H3
   /  \      /  \        /  \        /  \
 KV0  KV1  KV2  KV3    KV4  KV5    KV6  KV7

  H0 = H(KV0)      H1 = H(KV1)      ...
  H01 = H( H(H0, H1) )              ...
  Root = H( H(H01, H23) )

If any leaf changes (e.g., KV3 is updated), then H3 changes, which means H23 changes, which means Root changes. The change propagates from leaf to root.

How Dynamo Uses Merkle Trees

Each Dynamo node builds a Merkle tree for the key range it hosts. The tree is built by:

Partitioning the key range into buckets (e.g., 256 buckets)
Hashing all keys in a bucket → leaf hash
Building the tree bottom-up

┌─────────────────────────────────────────────────────────┐
│                     Node A                               │
│                                                          │
│  Key range: [0x0000 - 0xFFFF]                            │
│  ┌─────────────────────────────────────────────────┐    │
│  │ Merkle Tree                                      │    │
│  │  Root: 0x4F2A...                                 │    │
│  │    ├── H0_1: 0xA1B2...                           │    │
│  │    │    ├── Bucket 0: 0x3C4D... (keys A-C)       │    │
│  │    │    └── Bucket 1: 0x8E9F... (keys D-G)       │    │
│  │    └── H2_3: 0x5263...                           │    │
│  │         ├── Bucket 2: 0x7148... (keys H-L)       │    │
│  │         └── Bucket 3: 0x0B1C... (keys M-P)       │    │
│  └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Anti-Entropy Protocol: Step by Step

Node A and Node B want to compare their key ranges:

Step 1: A sends its Merkle tree root hash to B.

  A ──Root_A──► B

Step 2: B compares Root_A with Root_B.

  Case 1: Root_A == Root_B → trees are identical → no difference → DONE.
  Case 2: Root_A != Root_B → trees differ → need to find WHERE.

Step 3: B asks A for children of the root.

  A ──H0, H1──► B

Step 4: B compares H0_A vs H0_B, H1_A vs H1_B.

  Suppose H0 matches but H1 differs.
  The difference is in the H1 subtree.

Step 5: B asks A for children of H1.

  A ──H2, H3──► B

Step 6: B compares H2_A vs H2_B, H3_A vs H3_B.

  Suppose H3 differs. H3 corresponds to Bucket 3.

Step 7: B asks A for all keys in Bucket 3.

  A ──[keys KV6, KV7]──► B

Step 8: B identifies exactly which keys differ
        and initiates repair.

Total data transferred: Root (32 bytes) + 2 hashes (64 bytes) + 2 hashes (64 bytes) + Bucket 3 keys (N bytes). This is O(log L) where L is the number of leaves, compared to O(L) for a full comparison.

Concrete Example

Node A has 1 million keys. Node B has 1 million keys. They differ on exactly one key.

Full comparison: Transfer 1M key-value pairs (hundreds of MB or GB). Read all 1M entries from each node’s local storage.
Merkle tree: Transfer ~4KB of hashes in log steps. Read only the affected bucket’s keys from local storage.

The difference is 5+ orders of magnitude.

Practical Considerations

Rebuilding cost. Building the Merkle tree requires reading all keys and computing hashes. This is CPU-intensive but occurs infrequently (periodic anti-entropy runs every few hours).
Incremental updates. When a key is written, the node doesn’t immediately rebuild the entire tree. It lazily recomputes the affected branches.
Depth selection. The tree depth determines the trade-off: deeper trees = more granular diff (smaller buckets) but more hashes to transfer during comparison.
Key range changes. When virtual nodes move (node joins/leaves), Merkle trees must be rebuilt.

Why Not Just Use Version Vectors?

Dynamo also uses version vectors (vector clocks) for reconciling concurrent writes during reads. Merkle trees serve a different purpose: detecting deep inconsistencies between replicas that version vectors alone can’t fix. If a replica misses many writes (e.g., after a long partition), version vectors show the total order but not which keys are missing. Merkle trees pinpoint the exact missing keys efficiently.

Check Your Understanding

Node A and Node B have Merkle trees of depth 10 (1024 leaves). Their root hashes differ. How many hash comparisons are needed in the worst case to find the exact differing leaf bucket?
Why doesn’t Dynamo use Merkle trees for every anti-entropy check? What’s the cost of building a Merkle tree?
Two nodes have identical data but build their Merkle trees with different hash functions or bucket boundaries. Can the anti-entropy protocol still work?

The “So What?”

Merkle trees make anti-entropy practical at scale — without them, Dynamo would need to compare every key with every replica. This same technique is used by Cassandra, Riak, Amazon DynamoDB, Bitcoin (for transaction verification), and Git (for commit graph comparison). Whenever you have two large datasets that might differ, think: “Can I use a Merkle tree to find the difference efficiently?”

✏️ Exercises

Dynamo: Exercises

Quorum consistency. N = 3, R = 1, W = 3. Under what conditions can a read return stale data? What happens to the latency of writes and reads in this configuration?
Hinted handoff durability. In a sloppy quorum scenario, node A is down. The coordinator writes key K to B, C, D (D holds hinted data for A). Then D crashes before handing off to A. B and C remain healthy. Node A recovers. How many copies of key K exist? Is any data lost?
Merkle tree depth. A Dynamo node stores 16 million keys. It builds a Merkle tree of depth 16 (65,536 leaves). What is the average number of keys per bucket? If the root hashes differ between two nodes, how many hash comparisons are needed in the worst case to find the differing bucket?
Dynamo vs. GFS. GFS uses a single master with lease-based replication. Dynamo uses a fully distributed ring with quorums. Why did Amazon choose the Dynamo architecture instead of a GFS-like master-based replication? What constraints drove this decision?

👁️ View Solutions

Dynamo: Solutions

1. Quorum consistency

Answer. N = 3, R = 1, W = 3. With R = 1, the read quorum is a single node. With W = 3, the write quorum is all three nodes (A, B, C). Since R + W = 4 > N = 3, any read quorum overlaps with any write quorum. The read is guaranteed to see the latest write because the single node that responds (R=1) was part of the write quorum of 3 (W=3). So the read always returns the latest data.

Write latency: the coordinator must wait for all 3 replicas to acknowledge (W=3). Write latency = max(A, B, C) — the slowest replica. Writes are slower.

Read latency: the coordinator returns as soon as any 1 replica responds (R=1). Read latency = min(A, B, C). Reads are fast. However, the data could be stale if the coordinator doesn’t pick the right replica (but with W=3, all replicas are up-to-date, so this is not an issue).

2. Hinted handoff durability

Answer. Timeline:

A is down. Write goes to B, C, D (D holds hinted copy for A).
D crashes. D’s local store (including the hinted data for A) is lost.
B and C have the data. A is back up but doesn’t have the latest write.

Number of copies: B and C have the data (2 copies). The hinted copy on D is lost. So only 2 copies remain. If B or C also fails, the data could be lost permanently.

Is data lost? Yes, the latest write is partially lost — specifically, the copy intended for A (which was hinted to D) is gone when D crashed. However, B and C still have the data, so the data is not completely lost. The system has N=3 but only 2 copies survive. Durability is temporarily reduced. When anti-entropy runs (or hinted handoff from another node), it will detect that A is missing the latest version and repair it.

This is the trade-off: sloppy quorums improve availability (writes always succeed) at the cost of temporarily reduced durability during cascading failures.

3. Merkle tree depth

Answer.

Average keys per bucket: 16,000,000 / 65,536 ≈ 244 keys per bucket.

Worst-case hash comparisons to find differing bucket: at each level of the tree, we compare 2 hashes (the two children of the differing node). Depth 16 means 16 levels. At each level, we compare 2 hashes. So 16 × 2 = 32 hash comparisons in the worst case. This is O(log L) where L = 65,536 leaves.

After finding the bucket, we transfer all keys in that bucket (≈244 keys) to identify the exact differing key-value pairs.

Contrast with a full comparison: without Merkle trees, we’d need to transfer or compare all 16 million keys (each key-value pair sent over the network or compared locally). The Merkle tree reduces this to 32 hashes + 244 keys — a 99.998% reduction in data transfer.

4. Dynamo vs. GFS

Answer. Amazon chose the Dynamo architecture for these reasons:

Availability requirements. Amazon’s shopping cart must always accept writes. If a customer adds an item to their cart, it must succeed. GFS’s single master creates an availability bottleneck. Dynamo achieves 99.999% write availability through its fully distributed, masterless design.
No single point of failure. Dynamo has no master. GFS has a single master (SPOF). For Amazon’s SLA-driven services, any SPOF is unacceptable. The Dynamo ring means any node can fail without affecting availability — the system degrades gracefully.
Operational simplicity. There’s no master to elect, promote, or failover. Nodes join and leave the ring independently. This makes operations (deployments, scaling) much simpler than managing a GFS-like master.
Workload characteristics. GFS is optimized for large files and sequential access (search index, MapReduce). Dynamo is optimized for small objects and random access (shopping cart, session state). Dynamo’s quorum system is better suited to key-value workloads than GFS’s chunk-based approach.
Consistency model flexibility. Dynamo’s tunable consistency (N, R, W) lets each service pick its trade-off. GFS always provides strong consistency for single-chunk operations. Not every Amazon service needs strong consistency.
No cross-rack file system dependencies. Dynamo works directly on local disks. GFS depends on a shared file system abstraction. Dynamo’s simpler local-storage model is more portable and easier to manage at Amazon’s scale.

In short: GFS optimizes for throughput (large sequential operations). Dynamo optimizes for availability (small random operations). They were designed for different workloads.