Distributed & Decentralized Systems Curriculum
Reflection Real Systems · Riak KV

Key Question

How does Riak detect conflicting writes using vector clocks, and how does consistent hashing determine which nodes store which keys?

Deep Dive

The riak.ts file implements Riak’s core mechanisms. Let me walk through the key parts.

Consistent Hashing with PrefList

Riak uses a consistent hash ring. Given a key, it hashes the key and finds the responsible vnode. Then it replicates to the N successor vnodes (the “preference list” or prefList):

findPrefList(key: string, n: number = 3): VNode[] {
  const hash = this.hashString(key)
  const startIndex = Number(hash % BigInt(this.vnodes.length))
  const result: VNode[] = []

  for (let i = 0; i < n; i++) {
    const idx = (startIndex + i) % this.vnodes.length
    result.push(this.vnodes[idx])
  }

  return result
}

This is the Dynamo “write to N nodes” pattern. In a real Riak cluster, N is the n_val parameter. With n_val=3, each key is stored on exactly 3 vnodes.

Notice the hash function: it’s a simple polynomial hash. Riak’s real hash is SHA-1 (160-bit), and the ring has 2^64 partitions. Our simplified version works for demonstration.

Vector Clock Mechanics

When a write comes in, the coordinator:

  1. Collects vector clocks from all replicas.
  2. Merges them into a single clock.
  3. Increments its own entry.
put(key: string, value: string, clientNodeId: string): void {
  let latestClock = new VectorClock()

  // Step 1: Collect clocks from replicas
  for (const vnode of prefList) {
    const existing = vnode.data.get(key)
    if (existing) {
      const cmp = existing.vectorClock.compare(latestClock)
      if (cmp === 0 || cmp === -1) {
        // Merge clocks
        const merged = mergeClocks(existing.vectorClock, latestClock)
        latestClock = merged
      }
    }
  }

  // Step 2: Increment our counter
  latestClock.increment(clientNodeId)

  // Step 3: Write
  const obj: RiakObject = { key, value, vectorClock: latestClock, lastWriteTime: Date.now() }
  for (const vnode of prefList) {
    if (vnode is alive) vnode.data.set(key, obj)
    else hintedHandoff(key, obj)
  }
}

Conflict Detection

The vector clock compare() method is the heart of the system:

compare(other: VectorClock): number {
  let thisHasGreater = false
  let otherHasGreater = false

  for (const node of allNodes) {
    const a = this.counters.get(node) || 0
    const b = other.counters.get(node) || 0
    if (a > b) thisHasGreater = true
    if (b > a) otherHasGreater = true
  }

  if (!thisHasGreater && !otherHasGreater) return 2   // equal
  if (thisHasGreater && !otherHasGreater) return 1    // descendant
  if (!thisHasGreater && otherHasGreater) return -1   // ancestor
  return 0  // concurrent = CONFLICT!
}

Two clocks are concurrent when each has entries the other doesn’t — meaning the writes happened without causal ordering. Riak returns both values (siblings) and asks the client to resolve.

Last-Write-Wins Mode (LWW)

Riak can also operate in LWW mode, where timestamps resolve conflicts:

getLWW(key: string): string {
  let latestValue = '<not found>'
  let latestTimestamp = 0

  for (const vnode of prefList) {
    const obj = vnode.data.get(key)
    if (obj && obj.lastWriteTime > latestTimestamp) {
      latestValue = obj.value
    }
  }

  return latestValue
}

LWW is simpler but loses data silently. Riak’s CRDT buckets offer a third path: converge automatically without losing any operations.

Key Takeaways

  • Consistent hashing distributes keys via a ring of vnodes with N-way replication.
  • Vector clocks compare causality: concurrent clocks = conflict = siblings.
  • LWW mode prioritizes availability but silently drops writes.
  • Read repair catches stale replicas on every read.

Full Source

View or download the complete implementation: riak.ts

Exercises

  1. Run the simulation. What happens when N2 fails and a write is issued? Where does the write go?
  2. Modify the findPrefList function to skip unavailable vnodes. What changes?
  3. Why does Riak check ALL replicas during a read (in our simulation) but only wait for R in real Riak?

👁️ View Solutions

  1. When N2 fails, the coordinator cannot write to vnodes hosted on N2. It uses hinted handoff: the write is stored on the next available vnode (in our simulation, the first vnode not on N2). When N2 recovers, the hint is delivered to N2’s vnodes. The data is temporarily on a different physical node, but still replicated to N nodes total.
  2. Skipping unavailable vnodes means the write may go to fewer than N replicas. Riak’s write consistency is W replicas must acknowledge. If fewer than W are available, the write fails. Hinted handoff skips this check — it stores the hint and counts it as a successful replica (for durability semantics). This is the Dynamo “sloppy quorum” design.
  3. Real Riak uses R as the read consistency level (how many replicas must respond). Our simplified implementation reads all replicas. This is equivalent to R=ALL. In production, R=QUORUM (N/2+1) is typical. Reading all replicas gives the strongest consistency (all siblings detected) at the cost of latency.

✏️ Exercises

Riak KV — Exercises

Exercise 1

A Riak bucket is configured with n_val=3 and r=2, w=2. Three physical nodes host the data. One node is down. Can you still serve reads? Can you still serve writes? Why?

Exercise 2

You have microservices A, B, and C, all writing to the same key cart:user42. A writes at T=1, B writes concurrently at T=2 (without seeing A’s write), and C writes at T=3 (without seeing either). How many siblings does a subsequent read return? What does each sibling’s vector clock look like?

Exercise 3

How does hinted handoff interact with the w consistency level in Riak? If w=2 and only 1 node is available, does the write succeed?

Exercise 4

Explain why LWW mode in Riak can silently drop data. Give a concrete example involving a counter (not a user profile).


👁️ View Solutions

  1. Yes to both. n_val=3 means the prefList has 3 vnodes. With one node down (hosting ~1/3 of the vnodes), 2 vnodes are available. r=2 and w=2 can both be satisfied by 2 available replicas. However, if the down node has 2 of the 3 vnodes in the prefList (unlikely with random distribution but possible), the quorum wouldn’t be met. This is why vnodes are spread across physical nodes — to minimize this scenario.

  2. Three siblings. A’s clock: [A:1]. B’s clock: [B:1]. C’s clock: [C:1]. Each vector clock is a single-entry clock with the writing service’s counter. None are causally related (each has entries the others don’t). The read must return all three siblings. The client must resolve them — likely by merging the cart contents from all three.

  3. In Riak, hinted handoff counts toward the w acknowledgment. If the coordinator writes to 1 available replica and stores 1 hint (to be delivered later), it has satisfied w=2. The write succeeds even though only 1 node physically stores the data. This is “sloppy quorum” — from the Dynamo paper. The durability, however, is weak: the hint is only on the coordinator and could be lost.

  4. Counter example: two region servers both increment a visitor counter:

    • Server US: counter = 1000 @ T=100
    • Server EU: counter = 500 @ T=101
    • LWW picks T=101 → counter = 500
    • The US counter’s 1000 increments are LOST.

    With a CRDT counter (PN Counter), the result would be 1500 (correct — both increments are preserved). This is the canonical case for Riak CRDTs over LWW.