Distributed & Decentralized Systems Curriculum
Real World Architecture ยท Dynamo

Key Question

How does Dynamo balance read and write performance using configurable quorums?

Deep Dive

Dynamo gives you three knobs: N (replication factor), R (minimum read responses), and W (minimum write responses). These three knobs let you dial in your exact position on the consistency-availability-latency trade-off spectrum.

The Three Knobs

ParameterMeaningDefaultRange
NNumber of replicas for each key31 - โˆž (practical max ~5)
RMinimum nodes for a successful read21 - N
WMinimum nodes for a successful write21 - N

Read: coordinator sends read request to all N replicas. Waits for R responses. Write: coordinator sends write to all N replicas. Waits for W acknowledgments.

Quorum Intersection (R + W > N)

The magic happens when R + W > N. This means any read quorum overlaps with any write quorum on at least one node. That node will have the latest write, guaranteeing strong consistency.

Example: N = 3, R = 2, W = 2

Ring holds 3 replicas of key K: [A, B, C]

Write: coordinator sends to A, B, C
       Waits for 2 acknowledgments (W=2)
       Suppose A and B acknowledge. Last write = "Version 2"

Read:  coordinator sends to A, B, C
       Waits for 2 responses (R=2)
       Suppose B and C respond.
       B has Version 2 (latest). C has Version 1 (stale).
       R=2 means B's response is included.
       Coordinator returns Version 2 (latest).

       Why? Because B is in BOTH the write quorum {A, B} and
       the read quorum {B, C}. The intersection is non-empty.

If R + W <= N, overlap is NOT guaranteed:

N=3, R=1, W=1:
  Write quorum could be {A}. Read quorum could be {B}.
  No overlap โ‡’ read could get stale data.

R + W = N + 1: Strong Consistency

With R + W = N + 1 (e.g., N=3, R=2, W=2), you get strong consistency (linearizability): every read returns the latest write. This is the Dynamo default.

R + W <= N: Eventual Consistency

With R + W <= N (e.g., N=3, R=1, W=1), you get eventual consistency: reads may return stale data, but eventually all replicas will converge. The trade-off is lower latency and higher availability.

Tuning for Different Workloads

Workload Type              N    R    W    Consistency      Latency
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Read-heavy (catalog)       3    1    3    Weak consistent    Fast reads
Write-heavy (logging)      3    3    1    Weak consistent    Fast writes
Balanced (default)         3    2    2    Strong consistent  Balanced
High durability (finance)  5    3    3    Strong consistent  Higher latency

Read-heavy: R=1 means the coordinator returns the first response it gets. Very fast reads, but potentially stale. W=3 means all replicas must acknowledge every write โ€” writes are slower.

Write-heavy: W=1 means the coordinator returns after one acknowledgment. Very fast writes. But now R=3 means all replicas must respond for reads โ€” reads are slower.

Practical Implications

Dynamo (and DynamoDB) let you configure these per-operation or per-table. In Amazonโ€™s production environment:

  • Shopping cart service: N=3, R=2, W=2 (strong consistency for cart operations).
  • Session state service: N=3, R=1, W=2 (fast reads, reliable writes; stale sessions are tolerable).
  • Product catalog: N=3, R=1, W=1 (eventual consistency; propagation delay acceptable).

What W and R Really Cost

Each replica you wait for adds latency equal to the slowest of those replicas. With W=3, you wait for all three replicas โ€” latency = max(A, B, C). With W=1, latency = min(A, B, C) โ€” essentially the fastest replica. At scale, the difference between โ€œfastest replicaโ€ and โ€œthird fastest replicaโ€ can be tens of milliseconds due to network variance and load.

Check Your Understanding

  1. N = 3, R = 1, W = 3. A write quorum is {A, B, C}. A read quorum is {A}. Can the read return stale data? Why or why not?

  2. N = 5, R = 3, W = 3. A write is acknowledged by A, B, C. Later, a read gets responses from D, E, A. Is the read guaranteed to see the latest write?

  3. What happens if R + W > N but a node in the quorum intersection has crashed and lost data? Can the read still be consistent?

The โ€œSo What?โ€

Tuning (N, R, W) is how DynamoDB users control the cost/latency/consistency trade-off for each table. When you set โ€œstrongly consistent readsโ€ in DynamoDB, the system uses R=1 (for eventually consistent) vs. dynamically determining the right quorum for strong consistency. Understanding quorums lets you make informed trade-offs between performance and correctness in any quorum-based system.


โœ๏ธ Exercises

Dynamo: Exercises

  1. Quorum consistency. N = 3, R = 1, W = 3. Under what conditions can a read return stale data? What happens to the latency of writes and reads in this configuration?

  2. Hinted handoff durability. In a sloppy quorum scenario, node A is down. The coordinator writes key K to B, C, D (D holds hinted data for A). Then D crashes before handing off to A. B and C remain healthy. Node A recovers. How many copies of key K exist? Is any data lost?

  3. Merkle tree depth. A Dynamo node stores 16 million keys. It builds a Merkle tree of depth 16 (65,536 leaves). What is the average number of keys per bucket? If the root hashes differ between two nodes, how many hash comparisons are needed in the worst case to find the differing bucket?

  4. Dynamo vs. GFS. GFS uses a single master with lease-based replication. Dynamo uses a fully distributed ring with quorums. Why did Amazon choose the Dynamo architecture instead of a GFS-like master-based replication? What constraints drove this decision?

๐Ÿ‘๏ธ View Solutions

Dynamo: Solutions

1. Quorum consistency

Answer. N = 3, R = 1, W = 3. With R = 1, the read quorum is a single node. With W = 3, the write quorum is all three nodes (A, B, C). Since R + W = 4 > N = 3, any read quorum overlaps with any write quorum. The read is guaranteed to see the latest write because the single node that responds (R=1) was part of the write quorum of 3 (W=3). So the read always returns the latest data.

Write latency: the coordinator must wait for all 3 replicas to acknowledge (W=3). Write latency = max(A, B, C) โ€” the slowest replica. Writes are slower.

Read latency: the coordinator returns as soon as any 1 replica responds (R=1). Read latency = min(A, B, C). Reads are fast. However, the data could be stale if the coordinator doesnโ€™t pick the right replica (but with W=3, all replicas are up-to-date, so this is not an issue).

2. Hinted handoff durability

Answer. Timeline:

  • A is down. Write goes to B, C, D (D holds hinted copy for A).
  • D crashes. Dโ€™s local store (including the hinted data for A) is lost.
  • B and C have the data. A is back up but doesnโ€™t have the latest write.

Number of copies: B and C have the data (2 copies). The hinted copy on D is lost. So only 2 copies remain. If B or C also fails, the data could be lost permanently.

Is data lost? Yes, the latest write is partially lost โ€” specifically, the copy intended for A (which was hinted to D) is gone when D crashed. However, B and C still have the data, so the data is not completely lost. The system has N=3 but only 2 copies survive. Durability is temporarily reduced. When anti-entropy runs (or hinted handoff from another node), it will detect that A is missing the latest version and repair it.

This is the trade-off: sloppy quorums improve availability (writes always succeed) at the cost of temporarily reduced durability during cascading failures.

3. Merkle tree depth

Answer.

Average keys per bucket: 16,000,000 / 65,536 โ‰ˆ 244 keys per bucket.

Worst-case hash comparisons to find differing bucket: at each level of the tree, we compare 2 hashes (the two children of the differing node). Depth 16 means 16 levels. At each level, we compare 2 hashes. So 16 ร— 2 = 32 hash comparisons in the worst case. This is O(log L) where L = 65,536 leaves.

After finding the bucket, we transfer all keys in that bucket (โ‰ˆ244 keys) to identify the exact differing key-value pairs.

Contrast with a full comparison: without Merkle trees, weโ€™d need to transfer or compare all 16 million keys (each key-value pair sent over the network or compared locally). The Merkle tree reduces this to 32 hashes + 244 keys โ€” a 99.998% reduction in data transfer.

4. Dynamo vs. GFS

Answer. Amazon chose the Dynamo architecture for these reasons:

  1. Availability requirements. Amazonโ€™s shopping cart must always accept writes. If a customer adds an item to their cart, it must succeed. GFSโ€™s single master creates an availability bottleneck. Dynamo achieves 99.999% write availability through its fully distributed, masterless design.

  2. No single point of failure. Dynamo has no master. GFS has a single master (SPOF). For Amazonโ€™s SLA-driven services, any SPOF is unacceptable. The Dynamo ring means any node can fail without affecting availability โ€” the system degrades gracefully.

  3. Operational simplicity. Thereโ€™s no master to elect, promote, or failover. Nodes join and leave the ring independently. This makes operations (deployments, scaling) much simpler than managing a GFS-like master.

  4. Workload characteristics. GFS is optimized for large files and sequential access (search index, MapReduce). Dynamo is optimized for small objects and random access (shopping cart, session state). Dynamoโ€™s quorum system is better suited to key-value workloads than GFSโ€™s chunk-based approach.

  5. Consistency model flexibility. Dynamoโ€™s tunable consistency (N, R, W) lets each service pick its trade-off. GFS always provides strong consistency for single-chunk operations. Not every Amazon service needs strong consistency.

  6. No cross-rack file system dependencies. Dynamo works directly on local disks. GFS depends on a shared file system abstraction. Dynamoโ€™s simpler local-storage model is more portable and easier to manage at Amazonโ€™s scale.

In short: GFS optimizes for throughput (large sequential operations). Dynamo optimizes for availability (small random operations). They were designed for different workloads.