Key Question
How does Cassandra calculate which nodes own a key, and how does tunable consistency affect write behavior?
Deep Dive
The cassandra.ts file implements a minimal Cassandra-style cluster. Let me walk through the three core algorithms.
Finding the Responsible Node
When a write arrives, Cassandra computes the token (hash) of the partition key and finds the nearest token on the ring:
findResponsible(key: string): CassandraNode {
const keyHash = hashToken(key)
let closest = this.nodes[0]
let minDist = -1n
for (const node of this.nodes) {
for (const token of node.tokens) {
const dist = token >= keyHash
? token - keyHash
: MAX_HASH - keyHash + token // wrap around the ring
if (minDist === -1n || dist < minDist) {
minDist = dist
closest = node
}
}
}
return closest
}
This is a linear scan — acceptable in our simulation, but Cassandra uses a more efficient approach (binary search on a sorted token list) in production. The key insight is the ring wrap: the distance between the last token and the first token is MAX - keyHash + firstToken.
Next-N Replicas
For replication (RF=3), Cassandra picks the coordinator node and the next 2 distinct nodes clockwise:
getReplicas(key: string, count: number): CassandraNode[] {
const sorted = [...allNodes].sort((a, b) => {
// Sort by proximity to keyHash on the ring
})
return sorted.slice(0, count)
}
This is the “leaderless” replication model from Dynamo. Any node can coordinate a write. There is no single master.
The Quorum Formula
The write operation enforces the requested consistency level:
write(key: string, value: string, cl: ConsistencyLevel): WriteResult {
const replicas = this.getReplicas(key, 3) // RF = 3
const required = cl === 'QUORUM' ? Math.floor(replicas.length / 2) + 1
: cl === 'ALL' ? replicas.length
: 1
let writtenTo = 0
for (const replica of replicas) {
if (replica.state === 'UP') {
replica.data.set(key, { value, timestamp: tick() })
writtenTo++
} else {
// Hinted handoff: store for later retry
replica.hintedHandoff.set(key, { ... })
}
}
return { ok: writtenTo >= required, writtenTo, required }
}
- CL=ALL: If any replica is down, the write fails. Strongest guarantee, lowest availability.
- CL=QUORUM: Requires
floor(N/2)+1confirmations. Survivesfloor(N/2)failures. - CL=ONE: Requires 1 confirmation. Fastest. Weakest guarantee.
Read Repair
During a READ at QUORUM, Cassandra fetches data from all replicas, compares timestamps, and updates stale replicas in the background:
const latest = results.reduce((a, b) =>
a.timestamp > b.timestamp ? a : b
)
for (const result of results) {
if (result.timestamp < latest.timestamp) {
result.node.data.set(key, { value: latest.value, timestamp: latest.timestamp })
}
}
This is why eventual consistency converges: every read potentially repairs stale data.
Key Takeaways
- Ring distance determines which node owns a key. Virtual nodes spread ownership across multiple physical nodes.
- Quorum formula
R + W > RFensures strong consistency. Cassandra makes this tunable per request. - Read repair fixes stale replicas on every read.
- Hinted handoff queues writes for unavailable nodes and delivers them on recovery.
Full Source
View or download the complete implementation: cassandra.ts
Exercises
- Run the simulation. Change RF from 3 to 5 and add 2 more nodes. Does CL=QUORUM still guarantee strong consistency?
- What happens if you write at CL=ALL when one replica is down?
- In the simulation, why does N2 have stale data for
user:100before the read repair triggers?
👁️ View Solutions
- With RF=5, QUORUM =
floor(5/2)+1 = 3.R(QUORUM) + W(QUORUM) = 3 + 3 = 6 > 5. Yes, still strongly consistent. But with RF=5, ALL requires 5/5 nodes, which reduces write availability. - The write fails (
ok: false). This is the trade-off: CL=ALL guarantees the strongest consistency but any single node failure blocks writes. This is why CL=ALL is rarely used in production. - In the simulation, we intentionally set N2’s value to
old_alicewith an earlier timestamp to demonstrate read repair. In production, this happens when a replica misses a write due to network issues or temporary downtime.
✏️ Exercises
Apache Cassandra — Exercises
Exercise 1
You have a 6-node Cassandra cluster with RF=3. Node A is the coordinator for a write to key K with CL=QUORUM. The replicas are Nodes B, C, D. Node C is down. How many nodes must acknowledge the write for it to succeed? Which nodes?
Exercise 2
Describe the sequence of events when a Cassandra read at CL=QUORUM discovers that one of the three replicas has a stale value. What happens to the stale replica? What happens to the client?
Exercise 3
Cassandra’s hinted handoff stores hints on the coordinator. If the coordinator fails before delivering the hints, what happens? How does this differ from Riak’s approach?
Exercise 4
A developer configures R+W ≤ RF (e.g., W=ONE, R=ONE with RF=3). They expect “eventual consistency.” Describe a scenario where a read returns a value that was explicitly overwritten 10 seconds ago.
👁️ View Solutions
-
QUORUM =
floor(3/2)+1 = 2nodes. The write succeeds if 2 of the 3 replicas (B, C, D) acknowledge. Since C is down, the coordinator sends to B and D. If both are up:ok: true. The coordinator also stores a hint for C, to be delivered when C recovers. -
(a) The coordinator sends read requests to all RF replicas. (b) Two replicas return
value=v2 @ ts=100. One returnsvalue=v1 @ ts=50. (c) The coordinator seests=100 > ts=50and returnsv2to the client. (d) In the background, the coordinator sends a write to the stale replica:SET key=v2 @ ts=100. (e) Subsequent reads from any replica returnv2. The client sees the latest value immediately. The repair is asynchronous from the client’s perspective. -
The hints are lost. Cassandra’s hinted handoff is stored locally. If the coordinator crashes before delivering them, the data is gone. Riak’s approach is similar — both rely on the coordinator surviving. Neither provides durable hinted handoff. This is why read repair is essential: it catches the inconsistency on the next read.
-
Scenario: A user updates their profile (
SET name="Alice"→ written to Node A only with CL=ONE). Before replication propagates, another user reads from Node B (which still hasname="Alice (old)"). The read succeeds at CL=ONE (only Node B needs to respond) and returns the stale value. Even though the write was 10 seconds ago, gossip may not have propagated yet. This is “eventual” — there is no time guarantee.