Reflection Real Systems · Apache Cassandra

Key Question

How does Cassandra calculate which nodes own a key, and how does tunable consistency affect write behavior?

Deep Dive

The cassandra.ts file implements a minimal Cassandra-style cluster. Let me walk through the three core algorithms.

Finding the Responsible Node

When a write arrives, Cassandra computes the token (hash) of the partition key and finds the nearest token on the ring:

findResponsible(key: string): CassandraNode {
  const keyHash = hashToken(key)
  let closest = this.nodes[0]
  let minDist = -1n

  for (const node of this.nodes) {
    for (const token of node.tokens) {
      const dist = token >= keyHash
        ? token - keyHash
        : MAX_HASH - keyHash + token  // wrap around the ring
      if (minDist === -1n || dist < minDist) {
        minDist = dist
        closest = node
      }
    }
  }
  return closest
}

This is a linear scan — acceptable in our simulation, but Cassandra uses a more efficient approach (binary search on a sorted token list) in production. The key insight is the ring wrap: the distance between the last token and the first token is MAX - keyHash + firstToken.

Next-N Replicas

For replication (RF=3), Cassandra picks the coordinator node and the next 2 distinct nodes clockwise:

getReplicas(key: string, count: number): CassandraNode[] {
  const sorted = [...allNodes].sort((a, b) => {
    // Sort by proximity to keyHash on the ring
  })
  return sorted.slice(0, count)
}

This is the “leaderless” replication model from Dynamo. Any node can coordinate a write. There is no single master.

The Quorum Formula

The write operation enforces the requested consistency level:

write(key: string, value: string, cl: ConsistencyLevel): WriteResult {
  const replicas = this.getReplicas(key, 3) // RF = 3
  const required = cl === 'QUORUM' ? Math.floor(replicas.length / 2) + 1
    : cl === 'ALL' ? replicas.length
    : 1

  let writtenTo = 0
  for (const replica of replicas) {
    if (replica.state === 'UP') {
      replica.data.set(key, { value, timestamp: tick() })
      writtenTo++
    } else {
      // Hinted handoff: store for later retry
      replica.hintedHandoff.set(key, { ... })
    }
  }
  return { ok: writtenTo >= required, writtenTo, required }
}

CL=ALL: If any replica is down, the write fails. Strongest guarantee, lowest availability.
CL=QUORUM: Requires floor(N/2)+1 confirmations. Survives floor(N/2) failures.
CL=ONE: Requires 1 confirmation. Fastest. Weakest guarantee.

Read Repair

During a READ at QUORUM, Cassandra fetches data from all replicas, compares timestamps, and updates stale replicas in the background:

const latest = results.reduce((a, b) =>
  a.timestamp > b.timestamp ? a : b
)
for (const result of results) {
  if (result.timestamp < latest.timestamp) {
    result.node.data.set(key, { value: latest.value, timestamp: latest.timestamp })
  }
}

This is why eventual consistency converges: every read potentially repairs stale data.

Key Takeaways

Ring distance determines which node owns a key. Virtual nodes spread ownership across multiple physical nodes.
Quorum formula R + W > RF ensures strong consistency. Cassandra makes this tunable per request.
Read repair fixes stale replicas on every read.
Hinted handoff queues writes for unavailable nodes and delivers them on recovery.

Full Source

View or download the complete implementation: cassandra.ts

Exercises

Run the simulation. Change RF from 3 to 5 and add 2 more nodes. Does CL=QUORUM still guarantee strong consistency?
What happens if you write at CL=ALL when one replica is down?
In the simulation, why does N2 have stale data for user:100 before the read repair triggers?

👁️ View Solutions

With RF=5, QUORUM = floor(5/2)+1 = 3. R(QUORUM) + W(QUORUM) = 3 + 3 = 6 > 5. Yes, still strongly consistent. But with RF=5, ALL requires 5/5 nodes, which reduces write availability.
The write fails (ok: false). This is the trade-off: CL=ALL guarantees the strongest consistency but any single node failure blocks writes. This is why CL=ALL is rarely used in production.
In the simulation, we intentionally set N2’s value to old_alice with an earlier timestamp to demonstrate read repair. In production, this happens when a replica misses a write due to network issues or temporary downtime.

✏️ Exercises

Apache Cassandra — Exercises

Exercise 1

You have a 6-node Cassandra cluster with RF=3. Node A is the coordinator for a write to key K with CL=QUORUM. The replicas are Nodes B, C, D. Node C is down. How many nodes must acknowledge the write for it to succeed? Which nodes?

Exercise 2

Describe the sequence of events when a Cassandra read at CL=QUORUM discovers that one of the three replicas has a stale value. What happens to the stale replica? What happens to the client?

Exercise 3

Cassandra’s hinted handoff stores hints on the coordinator. If the coordinator fails before delivering the hints, what happens? How does this differ from Riak’s approach?

Exercise 4

A developer configures R+W ≤ RF (e.g., W=ONE, R=ONE with RF=3). They expect “eventual consistency.” Describe a scenario where a read returns a value that was explicitly overwritten 10 seconds ago.

👁️ View Solutions

QUORUM = floor(3/2)+1 = 2 nodes. The write succeeds if 2 of the 3 replicas (B, C, D) acknowledge. Since C is down, the coordinator sends to B and D. If both are up: ok: true. The coordinator also stores a hint for C, to be delivered when C recovers.
(a) The coordinator sends read requests to all RF replicas. (b) Two replicas return value=v2 @ ts=100. One returns value=v1 @ ts=50. (c) The coordinator sees ts=100 > ts=50 and returns v2 to the client. (d) In the background, the coordinator sends a write to the stale replica: SET key=v2 @ ts=100. (e) Subsequent reads from any replica return v2. The client sees the latest value immediately. The repair is asynchronous from the client’s perspective.
The hints are lost. Cassandra’s hinted handoff is stored locally. If the coordinator crashes before delivering them, the data is gone. Riak’s approach is similar — both rely on the coordinator surviving. Neither provides durable hinted handoff. This is why read repair is essential: it catches the inconsistency on the next read.
Scenario: A user updates their profile (SET name="Alice" → written to Node A only with CL=ONE). Before replication propagates, another user reads from Node B (which still has name="Alice (old)"). The read succeeds at CL=ONE (only Node B needs to respond) and returns the stale value. Even though the write was 10 seconds ago, gossip may not have propagated yet. This is “eventual” — there is no time guarantee.