Key Question
How does Cassandra combine Dynamo’s distribution model with Bigtable’s data model to create a database that never goes down?
Deep Dive
Cassandra is the most direct production descendant of Amazon’s Dynamo paper. It takes Dynamo’s core ideas (consistent hashing, quorums, gossip, hinted handoff, read repair) and adds a Bigtable-inspired column-family data model. The result is a database designed to survive entire data center failures while maintaining high write throughput.
The Token Ring with Virtual Nodes
Like Dynamo, Cassandra uses consistent hashing on a ring. But Cassandra adds virtual nodes (vnodes): each physical node owns multiple random tokens on the ring.
Node A: tokens [T1, T5, T9]
Node B: tokens [T2, T6, T10]
Node C: tokens [T3, T7, T11]
Vnodes solve a critical operational problem: when a node fails, its load is spread across many nodes (not just the immediate neighbor). This prevents a single node from being overwhelmed during recovery.
Tunable Consistency
Cassandra’s most distinctive feature is tunable consistency. Every read and write operation specifies a consistency level:
type ConsistencyLevel = 'ONE' | 'QUORUM' | 'ALL'
// CL=ONE: Fast. Only 1 replica must ack.
// CL=QUORUM: Required = floor(RF/2) + 1. Safe with R+W > RF.
// CL=ALL: Slowest. All replicas must ack.
The magic is in the combination: if R + W > RF, you get strong consistency. If R + W <= RF, you get eventual consistency. The same cluster can serve both strongly consistent reads (CL=ALL) and eventually consistent reads (CL=ONE) simultaneously.
Gossip-Based Failure Detection
Cassandra uses the Phi Accrual Failure Detector (studied in Module 5). Each node gossips its heartbeats, and the suspicion level (φ) is computed from the probability distribution of past heartbeat intervals. If φ exceeds a threshold (default 8), the node is marked down.
Hinted Handoff
When a write target is down, the coordinating node stores the write as a “hint” and replays it when the node recovers. This is why Cassandra can achieve its famous “always writable” guarantee.
Key Takeaways
- Cassandra is Dynamo + Bigtable: Dynamo’s ring/gossip/quorum + Bigtable’s column-family model.
- Tunable consistency lets the same cluster serve both strongly consistent and eventually consistent requests.
- Vnodes spread recovery load across many nodes instead of overloading the immediate successor.
- Cassandra is PA/EL (with tunable ELC) — it favors availability during partitions and lets the operator choose latency vs. consistency during normal operation.
Full Source
View or download the complete implementation: cassandra.ts
Exercises
- In a cluster with RF=5, what consistency level guarantees strong consistency? What about with RF=3?
- What problem do vnodes solve that single-token-per-node does not?
- Why is Cassandra described as PA/EL rather than just PA? What does the “EL” part mean operationally?
👁️ View Solutions
- RF=5:
R + W > 5. Standard choices are W=QUORUM (3), R=QUORUM (3) → 3+3=6≥5. Or W=ALL (5), R=ONE (1) → 5+1=6≥5. RF=3: W=QUORUM (2), R=QUORUM (2) → 2+2=4≥3.R + W > RFis the formula for strong consistency. - Without vnodes, when Node A fails, Node B (the next node clockwise) inherits ALL of A’s data. This can overload B and cause cascading failures. With vnodes, the load is spread across multiple successor nodes. Vnodes also make adding nodes more efficient — the new node takes a fraction of tokens from every existing node.
- “EL” means “Else (normal operation) — choose between Latency and Consistency.” Operationally, this means you set per-request consistency levels: CL=ONE for low-latency reads, CL=QUORUM for consistent reads. The same cluster handles both.
✏️ Exercises
Apache Cassandra — Exercises
Exercise 1
You have a 6-node Cassandra cluster with RF=3. Node A is the coordinator for a write to key K with CL=QUORUM. The replicas are Nodes B, C, D. Node C is down. How many nodes must acknowledge the write for it to succeed? Which nodes?
Exercise 2
Describe the sequence of events when a Cassandra read at CL=QUORUM discovers that one of the three replicas has a stale value. What happens to the stale replica? What happens to the client?
Exercise 3
Cassandra’s hinted handoff stores hints on the coordinator. If the coordinator fails before delivering the hints, what happens? How does this differ from Riak’s approach?
Exercise 4
A developer configures R+W ≤ RF (e.g., W=ONE, R=ONE with RF=3). They expect “eventual consistency.” Describe a scenario where a read returns a value that was explicitly overwritten 10 seconds ago.
👁️ View Solutions
-
QUORUM =
floor(3/2)+1 = 2nodes. The write succeeds if 2 of the 3 replicas (B, C, D) acknowledge. Since C is down, the coordinator sends to B and D. If both are up:ok: true. The coordinator also stores a hint for C, to be delivered when C recovers. -
(a) The coordinator sends read requests to all RF replicas. (b) Two replicas return
value=v2 @ ts=100. One returnsvalue=v1 @ ts=50. (c) The coordinator seests=100 > ts=50and returnsv2to the client. (d) In the background, the coordinator sends a write to the stale replica:SET key=v2 @ ts=100. (e) Subsequent reads from any replica returnv2. The client sees the latest value immediately. The repair is asynchronous from the client’s perspective. -
The hints are lost. Cassandra’s hinted handoff is stored locally. If the coordinator crashes before delivering them, the data is gone. Riak’s approach is similar — both rely on the coordinator surviving. Neither provides durable hinted handoff. This is why read repair is essential: it catches the inconsistency on the next read.
-
Scenario: A user updates their profile (
SET name="Alice"→ written to Node A only with CL=ONE). Before replication propagates, another user reads from Node B (which still hasname="Alice (old)"). The read succeeds at CL=ONE (only Node B needs to respond) and returns the stale value. Even though the write was 10 seconds ago, gossip may not have propagated yet. This is “eventual” — there is no time guarantee.