Key Question
How does Dynamo efficiently detect differences between replicas without comparing every key?
Deep Dive
Replicas diverge over time. Hinted handoff might miss some writes. Network partitions cause inconsistencies. Clock skew produces conflicting versions. Dynamo needs an anti-entropy mechanism to find and repair these differences.
The naive approach: node A sends every key-value pair to node B, and B compares them. For a node storing 1 billion keys, thatβs terabytes of data transferred. This doesnβt scale.
The efficient solution: Merkle trees (also called hash trees).
Whatβs a Merkle Tree?
A Merkle tree is a binary tree where:
- Each leaf is the hash of a key-value pair
- Each internal node is the hash of its two children
- The root is the hash of the entire tree
Merkle Tree (depth = 3, 8 leaves):
Root = H(H01 || H23)
/ \
H01 = H(H0 || H1) H23 = H(H2 || H3)
/ \ / \
H0 H1 H2 H3
/ \ / \ / \ / \
KV0 KV1 KV2 KV3 KV4 KV5 KV6 KV7
H0 = H(KV0) H1 = H(KV1) ...
H01 = H( H(H0, H1) ) ...
Root = H( H(H01, H23) )
If any leaf changes (e.g., KV3 is updated), then H3 changes, which means H23 changes, which means Root changes. The change propagates from leaf to root.
How Dynamo Uses Merkle Trees
Each Dynamo node builds a Merkle tree for the key range it hosts. The tree is built by:
- Partitioning the key range into buckets (e.g., 256 buckets)
- Hashing all keys in a bucket β leaf hash
- Building the tree bottom-up
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Node A β
β β
β Key range: [0x0000 - 0xFFFF] β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Merkle Tree β β
β β Root: 0x4F2A... β β
β β βββ H0_1: 0xA1B2... β β
β β β βββ Bucket 0: 0x3C4D... (keys A-C) β β
β β β βββ Bucket 1: 0x8E9F... (keys D-G) β β
β β βββ H2_3: 0x5263... β β
β β βββ Bucket 2: 0x7148... (keys H-L) β β
β β βββ Bucket 3: 0x0B1C... (keys M-P) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Anti-Entropy Protocol: Step by Step
Node A and Node B want to compare their key ranges:
Step 1: A sends its Merkle tree root hash to B.
A ββRoot_AβββΊ B
Step 2: B compares Root_A with Root_B.
Case 1: Root_A == Root_B β trees are identical β no difference β DONE.
Case 2: Root_A != Root_B β trees differ β need to find WHERE.
Step 3: B asks A for children of the root.
A ββH0, H1βββΊ B
Step 4: B compares H0_A vs H0_B, H1_A vs H1_B.
Suppose H0 matches but H1 differs.
The difference is in the H1 subtree.
Step 5: B asks A for children of H1.
A ββH2, H3βββΊ B
Step 6: B compares H2_A vs H2_B, H3_A vs H3_B.
Suppose H3 differs. H3 corresponds to Bucket 3.
Step 7: B asks A for all keys in Bucket 3.
A ββ[keys KV6, KV7]βββΊ B
Step 8: B identifies exactly which keys differ
and initiates repair.
Total data transferred: Root (32 bytes) + 2 hashes (64 bytes) + 2 hashes (64 bytes) + Bucket 3 keys (N bytes). This is O(log L) where L is the number of leaves, compared to O(L) for a full comparison.
Concrete Example
Node A has 1 million keys. Node B has 1 million keys. They differ on exactly one key.
- Full comparison: Transfer 1M key-value pairs (hundreds of MB or GB). Read all 1M entries from each nodeβs local storage.
- Merkle tree: Transfer ~4KB of hashes in log steps. Read only the affected bucketβs keys from local storage.
The difference is 5+ orders of magnitude.
Practical Considerations
- Rebuilding cost. Building the Merkle tree requires reading all keys and computing hashes. This is CPU-intensive but occurs infrequently (periodic anti-entropy runs every few hours).
- Incremental updates. When a key is written, the node doesnβt immediately rebuild the entire tree. It lazily recomputes the affected branches.
- Depth selection. The tree depth determines the trade-off: deeper trees = more granular diff (smaller buckets) but more hashes to transfer during comparison.
- Key range changes. When virtual nodes move (node joins/leaves), Merkle trees must be rebuilt.
Why Not Just Use Version Vectors?
Dynamo also uses version vectors (vector clocks) for reconciling concurrent writes during reads. Merkle trees serve a different purpose: detecting deep inconsistencies between replicas that version vectors alone canβt fix. If a replica misses many writes (e.g., after a long partition), version vectors show the total order but not which keys are missing. Merkle trees pinpoint the exact missing keys efficiently.
Check Your Understanding
-
Node A and Node B have Merkle trees of depth 10 (1024 leaves). Their root hashes differ. How many hash comparisons are needed in the worst case to find the exact differing leaf bucket?
-
Why doesnβt Dynamo use Merkle trees for every anti-entropy check? Whatβs the cost of building a Merkle tree?
-
Two nodes have identical data but build their Merkle trees with different hash functions or bucket boundaries. Can the anti-entropy protocol still work?
The βSo What?β
Merkle trees make anti-entropy practical at scale β without them, Dynamo would need to compare every key with every replica. This same technique is used by Cassandra, Riak, Amazon DynamoDB, Bitcoin (for transaction verification), and Git (for commit graph comparison). Whenever you have two large datasets that might differ, think: βCan I use a Merkle tree to find the difference efficiently?β
βοΈ Exercises
Dynamo: Exercises
-
Quorum consistency. N = 3, R = 1, W = 3. Under what conditions can a read return stale data? What happens to the latency of writes and reads in this configuration?
-
Hinted handoff durability. In a sloppy quorum scenario, node A is down. The coordinator writes key K to B, C, D (D holds hinted data for A). Then D crashes before handing off to A. B and C remain healthy. Node A recovers. How many copies of key K exist? Is any data lost?
-
Merkle tree depth. A Dynamo node stores 16 million keys. It builds a Merkle tree of depth 16 (65,536 leaves). What is the average number of keys per bucket? If the root hashes differ between two nodes, how many hash comparisons are needed in the worst case to find the differing bucket?
-
Dynamo vs. GFS. GFS uses a single master with lease-based replication. Dynamo uses a fully distributed ring with quorums. Why did Amazon choose the Dynamo architecture instead of a GFS-like master-based replication? What constraints drove this decision?
ποΈ View Solutions
Dynamo: Solutions
1. Quorum consistency
Answer. N = 3, R = 1, W = 3. With R = 1, the read quorum is a single node. With W = 3, the write quorum is all three nodes (A, B, C). Since R + W = 4 > N = 3, any read quorum overlaps with any write quorum. The read is guaranteed to see the latest write because the single node that responds (R=1) was part of the write quorum of 3 (W=3). So the read always returns the latest data.
Write latency: the coordinator must wait for all 3 replicas to acknowledge (W=3). Write latency = max(A, B, C) β the slowest replica. Writes are slower.
Read latency: the coordinator returns as soon as any 1 replica responds (R=1). Read latency = min(A, B, C). Reads are fast. However, the data could be stale if the coordinator doesnβt pick the right replica (but with W=3, all replicas are up-to-date, so this is not an issue).
2. Hinted handoff durability
Answer. Timeline:
- A is down. Write goes to B, C, D (D holds hinted copy for A).
- D crashes. Dβs local store (including the hinted data for A) is lost.
- B and C have the data. A is back up but doesnβt have the latest write.
Number of copies: B and C have the data (2 copies). The hinted copy on D is lost. So only 2 copies remain. If B or C also fails, the data could be lost permanently.
Is data lost? Yes, the latest write is partially lost β specifically, the copy intended for A (which was hinted to D) is gone when D crashed. However, B and C still have the data, so the data is not completely lost. The system has N=3 but only 2 copies survive. Durability is temporarily reduced. When anti-entropy runs (or hinted handoff from another node), it will detect that A is missing the latest version and repair it.
This is the trade-off: sloppy quorums improve availability (writes always succeed) at the cost of temporarily reduced durability during cascading failures.
3. Merkle tree depth
Answer.
Average keys per bucket: 16,000,000 / 65,536 β 244 keys per bucket.
Worst-case hash comparisons to find differing bucket: at each level of the tree, we compare 2 hashes (the two children of the differing node). Depth 16 means 16 levels. At each level, we compare 2 hashes. So 16 Γ 2 = 32 hash comparisons in the worst case. This is O(log L) where L = 65,536 leaves.
After finding the bucket, we transfer all keys in that bucket (β244 keys) to identify the exact differing key-value pairs.
Contrast with a full comparison: without Merkle trees, weβd need to transfer or compare all 16 million keys (each key-value pair sent over the network or compared locally). The Merkle tree reduces this to 32 hashes + 244 keys β a 99.998% reduction in data transfer.
4. Dynamo vs. GFS
Answer. Amazon chose the Dynamo architecture for these reasons:
-
Availability requirements. Amazonβs shopping cart must always accept writes. If a customer adds an item to their cart, it must succeed. GFSβs single master creates an availability bottleneck. Dynamo achieves 99.999% write availability through its fully distributed, masterless design.
-
No single point of failure. Dynamo has no master. GFS has a single master (SPOF). For Amazonβs SLA-driven services, any SPOF is unacceptable. The Dynamo ring means any node can fail without affecting availability β the system degrades gracefully.
-
Operational simplicity. Thereβs no master to elect, promote, or failover. Nodes join and leave the ring independently. This makes operations (deployments, scaling) much simpler than managing a GFS-like master.
-
Workload characteristics. GFS is optimized for large files and sequential access (search index, MapReduce). Dynamo is optimized for small objects and random access (shopping cart, session state). Dynamoβs quorum system is better suited to key-value workloads than GFSβs chunk-based approach.
-
Consistency model flexibility. Dynamoβs tunable consistency (N, R, W) lets each service pick its trade-off. GFS always provides strong consistency for single-chunk operations. Not every Amazon service needs strong consistency.
-
No cross-rack file system dependencies. Dynamo works directly on local disks. GFS depends on a shared file system abstraction. Dynamoβs simpler local-storage model is more portable and easier to manage at Amazonβs scale.
In short: GFS optimizes for throughput (large sequential operations). Dynamo optimizes for availability (small random operations). They were designed for different workloads.