Reflection Real Systems · Apache Cassandra

Key Question

Dynamo (the paper) is theory. Cassandra (the product) is practice. Where do they diverge?

Deep Dive

Cassandra is at least as famous for its deviations from the Dynamo paper as for its adherence to it. These deviations reveal the gap between academic theory and production reality.

Deviation 1: LWW Conflict Resolution (No Vector Clocks)

The Dynamo paper describes vector clocks with sibling creation: if two concurrent writes happen, both values are stored as siblings, and the application resolves them on read. Cassandra rejected this. It uses Last-Write-Wins (LWW) with a client-supplied timestamp.

// Dynamo-style: store siblings, let app resolve
// Riak: vector clock → sibling creation
// Cassandra: timestamp → oldest value silently dropped
node.data.set(key, { value, timestamp: clientTimestamp })

Why? Because sibling creation is confusing for users. “Why does my read return two values?” was a constant support question for Riak. Cassandra chose simplicity and determinism over data integrity. The cost is silently lost concurrent writes.

Deviation 2: Hinted Handoff is NOT Durability

The Dynamo paper presents hinted handoff as a core mechanism for achieving “always writable” availability. In production, hinted handoff is fragile:

Hints are stored locally on the coordinator — if the coordinator crashes, hints are lost.
Hints are best-effort. They are not replicated.
Hints have a configurable TTL (default 3 hours). If a node is down longer, hints are discarded.

This means Cassandra’s “always writable” guarantee is weaker than it sounds. It works for short outages (minutes to hours) but not for multi-day failures.

Deviation 3: Compaction is the Operational Bottleneck

The Dynamo paper does not discuss compaction because it assumes an append-only log with periodic merging. In production Cassandra, compaction is the single biggest operational challenge:

Size-tiered compaction: Multiple SSTables of similar size are merged. Requires 50% free disk space.
Leveled compaction: Each key exists in exactly one SSTable per level. Lower disk overhead but higher I/O.
TWCS (Time Window Compaction): For time-series data. Strategically merged by time windows.

Misconfigured compaction is the leading cause of Cassandra production incidents.

Deviation 4: Gossip Is Slow

The SWIM protocol (studied in Module 4) converges in O(log N) rounds. Cassandra’s gossip is tuned for stability over speed:

Each node gossips with 1 random peer per second.
A new node typically takes 30-60 seconds to be fully visible to all nodes.
Failure detection via phi accrual takes 10-30 seconds (configurable).

This is deliberate: fast gossip creates network chatter. Cassandra chooses bandwidth efficiency over convergence speed.

Key Takeaways

Cassandra chose LWW over vector clocks for simplicity, sacrificing concurrent write safety.
Hinted handoff is best-effort, not durable. It protects against short outages.
Compaction is the hidden operational cost — more Cassandra outages are caused by compaction than by consensus failures.
Cassandra is a pragmatic evolution of Dynamo, not a literal implementation.

Full Source

View or download the complete implementation: cassandra.ts

Exercises

Cassandra uses client-supplied timestamps for LWW. What happens if a client’s clock is wrong (e.g., set to the year 2020)?
Why does Cassandra use CL=QUORUM for writes by default instead of CL=ALL?
Cassandra’s gossip interval is 1 second per peer. What happens to convergence time in a 500-node cluster? What trade-off is Cassandra making?

👁️ View Solutions

Any writes from that client will have a lower timestamp than real writes. If a write with timestamp 2020 reaches a node after a write with timestamp 2025, the 2025 write wins — the 2020 write is silently dropped as “stale.” A client clock skew of even a few seconds can cause data loss. This is why Cassandra operators synchronize client clocks with NTP.
CL=ALL means every write requires all RF replicas to be up. In a 3-node cluster, a single node failure blocks all writes. CL=QUORUM survives floor(RF/2) failures. Cassandra prioritizes availability over consistency at the default level.
With 1 peer per second per node, a 500-node cluster takes roughly O(log 500) ≈ 9 rounds for full gossip convergence. That’s 9 seconds — fast enough for membership changes but too slow for rapid failure detection. The trade-off is bandwidth: 500 nodes × 1 KB gossip messages = 500 KB/second total.

✏️ Exercises

Apache Cassandra — Exercises

Exercise 1

You have a 6-node Cassandra cluster with RF=3. Node A is the coordinator for a write to key K with CL=QUORUM. The replicas are Nodes B, C, D. Node C is down. How many nodes must acknowledge the write for it to succeed? Which nodes?

Exercise 2

Describe the sequence of events when a Cassandra read at CL=QUORUM discovers that one of the three replicas has a stale value. What happens to the stale replica? What happens to the client?

Exercise 3

Cassandra’s hinted handoff stores hints on the coordinator. If the coordinator fails before delivering the hints, what happens? How does this differ from Riak’s approach?

Exercise 4

A developer configures R+W ≤ RF (e.g., W=ONE, R=ONE with RF=3). They expect “eventual consistency.” Describe a scenario where a read returns a value that was explicitly overwritten 10 seconds ago.

👁️ View Solutions

QUORUM = floor(3/2)+1 = 2 nodes. The write succeeds if 2 of the 3 replicas (B, C, D) acknowledge. Since C is down, the coordinator sends to B and D. If both are up: ok: true. The coordinator also stores a hint for C, to be delivered when C recovers.
(a) The coordinator sends read requests to all RF replicas. (b) Two replicas return value=v2 @ ts=100. One returns value=v1 @ ts=50. (c) The coordinator sees ts=100 > ts=50 and returns v2 to the client. (d) In the background, the coordinator sends a write to the stale replica: SET key=v2 @ ts=100. (e) Subsequent reads from any replica return v2. The client sees the latest value immediately. The repair is asynchronous from the client’s perspective.
The hints are lost. Cassandra’s hinted handoff is stored locally. If the coordinator crashes before delivering them, the data is gone. Riak’s approach is similar — both rely on the coordinator surviving. Neither provides durable hinted handoff. This is why read repair is essential: it catches the inconsistency on the next read.
Scenario: A user updates their profile (SET name="Alice" → written to Node A only with CL=ONE). Before replication propagates, another user reads from Node B (which still has name="Alice (old)"). The read succeeds at CL=ONE (only Node B needs to respond) and returns the stale value. Even though the write was 10 seconds ago, gossip may not have propagated yet. This is “eventual” — there is no time guarantee.