Real World Architecture · Dynamo

Key Question

How does Dynamo stay available for writes even when the target replica is down?

Deep Dive

Strict quorums have a problem: if a node that should hold a replica is down, a write to that key cannot make quorum. This means key K would become unavailable for writes during any failure. In a 1000-node cluster, you’re almost guaranteed that at least one node is down at any moment, which would make some keys consistently unavailable.

Dynamo solves this with sloppy quorums and hinted handoff.

The Problem: Node A is Down

Normal state:  Key K is stored on replicas A, B, C

Ring segment:
  ... ●───●───●───●───●───●───●───●───...
       A   B   C   D   E   F   G   H

  Key K → first 3 nodes clockwise: A, B, C

Now A is down (machine failure, network partition):

If we use strict quorums:
   Write to K → try A, B, C → only B and C respond
   With N=3, W=2: Still OK, we have {B, C}. But what if N=3, W=3?
   Write fails because A is down.

   Worse: what if B is also overloaded and slow?
   With N=3, R=2, W=2: we might get only C.
   Write fails because we can't reach 2 nodes.

Sloppy Quorums: Accept Any N Healthy Nodes

With sloppy quorums, the coordinator accepts the write from any N healthy nodes in the ring, not necessarily the first N nodes in the preference list:

Sloppy quorum in action:

  Key K's preference list: A, B, C (A is down)

  Coordinator sends to A, B, C, D (goes further than N=3)
  B and C respond (healthy). D also responds.
  Coordinator gets 3 responses {B, C, D}.
  R=2 or W=2 satisfied.

  But D is NOT in K's preference list!
  D stores the data "on behalf of" A.

Ring view during sloppy quorum:

  Normal:   A  B  C  (preference list for K = [A, B, C])
  A is down:
  Sloppy:      B  C  D  (D now holds data "for" A)
  Ring: ... ●─●─●─●─●─●─●─●─●─...
            A  B  C  D  E  ...
               ▲  ▲  ▲
               │  │  └── D accepts write as "hinted replica for A"
               │  └──── C accepts write as normal replica
               └─────── B accepts write as normal replica

Hinted Handoff: the “Promissory Note”

When D accepts the write for A, D stores:

Data record for key K:
  value = "xyz"
  version = 42
  hint = { intended_replica: A, timestamp: T }

This hint tells D: “this data really belongs to A. I’m just holding it temporarily.” D stores the data in a separate local store (not interfering with data A would normally hold).

The Handoff: When A Comes Back

Timeline:

T=0  A crashes
T=1  Coordinator routes writes to B, C, D (D holds hinted data for A)
T=5  A comes back online, rejoins ring
T=6  D checks membership changes, sees A is back
T=7  D reads all hinted records destined for A
T=8  D sends data + hints to A
       ┌───────────────────────────┐
       │  D → A: "Here's key K,    │
       │         version 42,       │
       │         and all other keys│
       │         I held for you"   │
       └───────────────────────────┘
T=9  A acknowledges. D deletes the hinted records.
T=10 Ring is back to normal: A, B, C hold K.

Hinted handoff ensures that the “wrong” replica D only holds the data temporarily. Once A recovers, the system self-heals.

Durability Concern: What if D Also Crashes?

If D crashes before handing off, the hinted data is lost. This is a durability (not availability) trade-off. Dynamo’s durability guarantee is slightly weaker during failures:

Scenario: A is down. D holds hinted data for A.
   Now D crashes (before handing off to A).
   The hinted data on D is lost.

Durability during this window: only B and C have the data (2 replicas).

This is acceptable for Amazon’s use cases (shopping carts, sessions) where data can be reconstructed or the loss is bounded.

Why “Sloppy Quorums” Instead of “Always Write to N Nodes”?

Sloppy quorums prioritize write availability over replica location correctness. The system accepts a write even if it can’t write to the ideal replicas. This means writes succeed even during extended node failures.

Amazon measured this as critical: their shopping cart service maintained 99.999% write availability using sloppy quorums, compared to ~99.9% with strict quorums.

Check Your Understanding

In sloppy quorums, node A is down and you write to B, C, D. Node A comes back up. D hands off data to A. What guarantees does the system provide about reads that happened during this period?
What happens to hinted handoff data if the coordinator (the node receiving the initial write) also crashes before performing the handoff?
Why does Dynamo choose sloppy quorums over simply replicating to fewer nodes (e.g., N=2 instead of N=3)?

The “So What?”

Hinted handoff is why Dynamo can achieve 99.999% write availability despite regular machine failures. When you use DynamoDB and a write succeeds even during a server outage, you’re benefiting from this design. The technique is widely used: Cassandra uses hinted handoff, and many modern databases use a variant for maintaining availability during failures.

✏️ Exercises

Dynamo: Exercises

Quorum consistency. N = 3, R = 1, W = 3. Under what conditions can a read return stale data? What happens to the latency of writes and reads in this configuration?
Hinted handoff durability. In a sloppy quorum scenario, node A is down. The coordinator writes key K to B, C, D (D holds hinted data for A). Then D crashes before handing off to A. B and C remain healthy. Node A recovers. How many copies of key K exist? Is any data lost?
Merkle tree depth. A Dynamo node stores 16 million keys. It builds a Merkle tree of depth 16 (65,536 leaves). What is the average number of keys per bucket? If the root hashes differ between two nodes, how many hash comparisons are needed in the worst case to find the differing bucket?
Dynamo vs. GFS. GFS uses a single master with lease-based replication. Dynamo uses a fully distributed ring with quorums. Why did Amazon choose the Dynamo architecture instead of a GFS-like master-based replication? What constraints drove this decision?

👁️ View Solutions

Dynamo: Solutions

1. Quorum consistency

Answer. N = 3, R = 1, W = 3. With R = 1, the read quorum is a single node. With W = 3, the write quorum is all three nodes (A, B, C). Since R + W = 4 > N = 3, any read quorum overlaps with any write quorum. The read is guaranteed to see the latest write because the single node that responds (R=1) was part of the write quorum of 3 (W=3). So the read always returns the latest data.

Write latency: the coordinator must wait for all 3 replicas to acknowledge (W=3). Write latency = max(A, B, C) — the slowest replica. Writes are slower.

Read latency: the coordinator returns as soon as any 1 replica responds (R=1). Read latency = min(A, B, C). Reads are fast. However, the data could be stale if the coordinator doesn’t pick the right replica (but with W=3, all replicas are up-to-date, so this is not an issue).

2. Hinted handoff durability

Answer. Timeline:

A is down. Write goes to B, C, D (D holds hinted copy for A).
D crashes. D’s local store (including the hinted data for A) is lost.
B and C have the data. A is back up but doesn’t have the latest write.

Number of copies: B and C have the data (2 copies). The hinted copy on D is lost. So only 2 copies remain. If B or C also fails, the data could be lost permanently.

Is data lost? Yes, the latest write is partially lost — specifically, the copy intended for A (which was hinted to D) is gone when D crashed. However, B and C still have the data, so the data is not completely lost. The system has N=3 but only 2 copies survive. Durability is temporarily reduced. When anti-entropy runs (or hinted handoff from another node), it will detect that A is missing the latest version and repair it.

This is the trade-off: sloppy quorums improve availability (writes always succeed) at the cost of temporarily reduced durability during cascading failures.

3. Merkle tree depth

Answer.

Average keys per bucket: 16,000,000 / 65,536 ≈ 244 keys per bucket.

Worst-case hash comparisons to find differing bucket: at each level of the tree, we compare 2 hashes (the two children of the differing node). Depth 16 means 16 levels. At each level, we compare 2 hashes. So 16 × 2 = 32 hash comparisons in the worst case. This is O(log L) where L = 65,536 leaves.

After finding the bucket, we transfer all keys in that bucket (≈244 keys) to identify the exact differing key-value pairs.

Contrast with a full comparison: without Merkle trees, we’d need to transfer or compare all 16 million keys (each key-value pair sent over the network or compared locally). The Merkle tree reduces this to 32 hashes + 244 keys — a 99.998% reduction in data transfer.

4. Dynamo vs. GFS

Answer. Amazon chose the Dynamo architecture for these reasons:

Availability requirements. Amazon’s shopping cart must always accept writes. If a customer adds an item to their cart, it must succeed. GFS’s single master creates an availability bottleneck. Dynamo achieves 99.999% write availability through its fully distributed, masterless design.
No single point of failure. Dynamo has no master. GFS has a single master (SPOF). For Amazon’s SLA-driven services, any SPOF is unacceptable. The Dynamo ring means any node can fail without affecting availability — the system degrades gracefully.
Operational simplicity. There’s no master to elect, promote, or failover. Nodes join and leave the ring independently. This makes operations (deployments, scaling) much simpler than managing a GFS-like master.
Workload characteristics. GFS is optimized for large files and sequential access (search index, MapReduce). Dynamo is optimized for small objects and random access (shopping cart, session state). Dynamo’s quorum system is better suited to key-value workloads than GFS’s chunk-based approach.
Consistency model flexibility. Dynamo’s tunable consistency (N, R, W) lets each service pick its trade-off. GFS always provides strong consistency for single-chunk operations. Not every Amazon service needs strong consistency.
No cross-rack file system dependencies. Dynamo works directly on local disks. GFS depends on a shared file system abstraction. Dynamo’s simpler local-storage model is more portable and easier to manage at Amazon’s scale.

In short: GFS optimizes for throughput (large sequential operations). Dynamo optimizes for availability (small random operations). They were designed for different workloads.