Key Question
How does Riak detect conflicting writes using vector clocks, and how does consistent hashing determine which nodes store which keys?
Deep Dive
The riak.ts file implements Riak’s core mechanisms. Let me walk through the key parts.
Consistent Hashing with PrefList
Riak uses a consistent hash ring. Given a key, it hashes the key and finds the responsible vnode. Then it replicates to the N successor vnodes (the “preference list” or prefList):
findPrefList(key: string, n: number = 3): VNode[] {
const hash = this.hashString(key)
const startIndex = Number(hash % BigInt(this.vnodes.length))
const result: VNode[] = []
for (let i = 0; i < n; i++) {
const idx = (startIndex + i) % this.vnodes.length
result.push(this.vnodes[idx])
}
return result
}
This is the Dynamo “write to N nodes” pattern. In a real Riak cluster, N is the n_val parameter. With n_val=3, each key is stored on exactly 3 vnodes.
Notice the hash function: it’s a simple polynomial hash. Riak’s real hash is SHA-1 (160-bit), and the ring has 2^64 partitions. Our simplified version works for demonstration.
Vector Clock Mechanics
When a write comes in, the coordinator:
- Collects vector clocks from all replicas.
- Merges them into a single clock.
- Increments its own entry.
put(key: string, value: string, clientNodeId: string): void {
let latestClock = new VectorClock()
// Step 1: Collect clocks from replicas
for (const vnode of prefList) {
const existing = vnode.data.get(key)
if (existing) {
const cmp = existing.vectorClock.compare(latestClock)
if (cmp === 0 || cmp === -1) {
// Merge clocks
const merged = mergeClocks(existing.vectorClock, latestClock)
latestClock = merged
}
}
}
// Step 2: Increment our counter
latestClock.increment(clientNodeId)
// Step 3: Write
const obj: RiakObject = { key, value, vectorClock: latestClock, lastWriteTime: Date.now() }
for (const vnode of prefList) {
if (vnode is alive) vnode.data.set(key, obj)
else hintedHandoff(key, obj)
}
}
Conflict Detection
The vector clock compare() method is the heart of the system:
compare(other: VectorClock): number {
let thisHasGreater = false
let otherHasGreater = false
for (const node of allNodes) {
const a = this.counters.get(node) || 0
const b = other.counters.get(node) || 0
if (a > b) thisHasGreater = true
if (b > a) otherHasGreater = true
}
if (!thisHasGreater && !otherHasGreater) return 2 // equal
if (thisHasGreater && !otherHasGreater) return 1 // descendant
if (!thisHasGreater && otherHasGreater) return -1 // ancestor
return 0 // concurrent = CONFLICT!
}
Two clocks are concurrent when each has entries the other doesn’t — meaning the writes happened without causal ordering. Riak returns both values (siblings) and asks the client to resolve.
Last-Write-Wins Mode (LWW)
Riak can also operate in LWW mode, where timestamps resolve conflicts:
getLWW(key: string): string {
let latestValue = '<not found>'
let latestTimestamp = 0
for (const vnode of prefList) {
const obj = vnode.data.get(key)
if (obj && obj.lastWriteTime > latestTimestamp) {
latestValue = obj.value
}
}
return latestValue
}
LWW is simpler but loses data silently. Riak’s CRDT buckets offer a third path: converge automatically without losing any operations.
Key Takeaways
- Consistent hashing distributes keys via a ring of vnodes with N-way replication.
- Vector clocks compare causality: concurrent clocks = conflict = siblings.
- LWW mode prioritizes availability but silently drops writes.
- Read repair catches stale replicas on every read.
Full Source
View or download the complete implementation: riak.ts
Exercises
- Run the simulation. What happens when N2 fails and a write is issued? Where does the write go?
- Modify the
findPrefListfunction to skip unavailable vnodes. What changes? - Why does Riak check ALL replicas during a read (in our simulation) but only wait for
Rin real Riak?
👁️ View Solutions
- When N2 fails, the coordinator cannot write to vnodes hosted on N2. It uses hinted handoff: the write is stored on the next available vnode (in our simulation, the first vnode not on N2). When N2 recovers, the hint is delivered to N2’s vnodes. The data is temporarily on a different physical node, but still replicated to N nodes total.
- Skipping unavailable vnodes means the write may go to fewer than N replicas. Riak’s write consistency is
Wreplicas must acknowledge. If fewer thanWare available, the write fails. Hinted handoff skips this check — it stores the hint and counts it as a successful replica (for durability semantics). This is the Dynamo “sloppy quorum” design. - Real Riak uses
Ras the read consistency level (how many replicas must respond). Our simplified implementation reads all replicas. This is equivalent toR=ALL. In production,R=QUORUM(N/2+1) is typical. Reading all replicas gives the strongest consistency (all siblings detected) at the cost of latency.
✏️ Exercises
Riak KV — Exercises
Exercise 1
A Riak bucket is configured with n_val=3 and r=2, w=2. Three physical nodes host the data. One node is down. Can you still serve reads? Can you still serve writes? Why?
Exercise 2
You have microservices A, B, and C, all writing to the same key cart:user42. A writes at T=1, B writes concurrently at T=2 (without seeing A’s write), and C writes at T=3 (without seeing either). How many siblings does a subsequent read return? What does each sibling’s vector clock look like?
Exercise 3
How does hinted handoff interact with the w consistency level in Riak? If w=2 and only 1 node is available, does the write succeed?
Exercise 4
Explain why LWW mode in Riak can silently drop data. Give a concrete example involving a counter (not a user profile).
👁️ View Solutions
-
Yes to both.
n_val=3means the prefList has 3 vnodes. With one node down (hosting ~1/3 of the vnodes), 2 vnodes are available.r=2andw=2can both be satisfied by 2 available replicas. However, if the down node has 2 of the 3 vnodes in the prefList (unlikely with random distribution but possible), the quorum wouldn’t be met. This is why vnodes are spread across physical nodes — to minimize this scenario. -
Three siblings. A’s clock:
[A:1]. B’s clock:[B:1]. C’s clock:[C:1]. Each vector clock is a single-entry clock with the writing service’s counter. None are causally related (each has entries the others don’t). The read must return all three siblings. The client must resolve them — likely by merging the cart contents from all three. -
In Riak, hinted handoff counts toward the
wacknowledgment. If the coordinator writes to 1 available replica and stores 1 hint (to be delivered later), it has satisfiedw=2. The write succeeds even though only 1 node physically stores the data. This is “sloppy quorum” — from the Dynamo paper. The durability, however, is weak: the hint is only on the coordinator and could be lost. -
Counter example: two region servers both increment a visitor counter:
- Server US:
counter = 1000@ T=100 - Server EU:
counter = 500@ T=101 - LWW picks T=101 →
counter = 500 - The US counter’s 1000 increments are LOST.
With a CRDT counter (PN Counter), the result would be
1500(correct — both increments are preserved). This is the canonical case for Riak CRDTs over LWW. - Server US: