Distributed & Decentralized Systems Curriculum
Reflection Real Systems · Riak KV

Key Question

How does Riak KV combine Amazon Dynamo’s architecture with Erlang’s concurrency model to create a truly decentralized key-value store?

Deep Dive

Riak KV is the most faithful open-source implementation of Amazon’s Dynamo architecture (see Module 2). While Cassandra also comes from the Dynamo lineage, Riak stays closer to the original design in several key ways.

Virtual Nodes (Vnodes)

Riak partitions the hash ring into a fixed number of vnodes (default 2^64, or about 10^19 partitions). Physical servers each claim a range of these vnodes. This creates a consistent hashing layer that:

  1. Balances load automatically: a server with more capacity can claim more vnodes.
  2. Minimizes redistribution on joins/leaves: each server’s departure affects only its nearby neighbors.
  3. Handles heterogeneity elegantly: a 64-core server can claim 5x more vnodes than an 8-core server.
// Each physical node hosts multiple vnodes
const vnodes: VNode[] = []
for (let i = 0; i < 12; i++) {
  const physicalNode = nodes[i % nodes.length]
  vnodes.push(new VNode(`vnode${i}`, physicalNode))
}

Vector Clocks

Riak’s causal consistency mechanism. Every write carries a vector clock — a list of (node, counter) pairs. When a read returns multiple objects with concurrent vector clocks, the application must resolve the conflict.

class VectorClock {
  counters: Map<string, number> = new Map()

  increment(nodeId: string): void {
    this.counters.set(nodeId, (this.counters.get(nodeId) || 0) + 1)
  }
}

Riak has an important advantage over Amazon’s original Dynamo: it ships with built-in conflict resolvers. Turn on “last-write-wins” (LWW) and the system resolves using timestamps. But the real power is CRDT-backed buckets — Riak 2.0 added convergent data types so conflicts self-resolve at the storage layer.

Read Repair & Hinted Handoff

Riak’s design (like Cassandra’s) treats unavailable nodes as temporary. Writes that can’t reach their target get hinted handoff (stored on the coordinator). Reads check all replicas and repair stale ones:

// Read repair: update stale replicas
for (const obj of results) {
  if (obj.lastWriteTime < latest.lastWriteTime) {
    const staleVNode = prefList.find(v => v.data.get(key) === obj)
    if (staleVNode) {
      staleVNode.data.set(key, latest)
    }
  }
}

AP by Default

Riak is an AP system (available, partition-tolerant — but not consistent). During a network partition, both sides accept writes. When the partition heals, vector clock conflicts appear and must be resolved.

Key Takeaways

  • Riak stays closest to the original Dynamo design (consistent hashing, vector clocks, hinted handoff).
  • Virtual nodes provide fine-grained load balancing across heterogeneous hardware.
  • Vector clocks detect concurrent writes and surface conflicts to the application.
  • Riak is AP by nature — it accepts writes on both sides of a partition.

Full Source

View or download the complete implementation: riak.ts

Exercises

  1. Riak uses vnodes instead of simple node-based rings (like Redis). What problem does this solve?
  2. What is the difference between hinted handoff and read repair? Which provides stronger durability guarantees?
  3. Why does Riak surface conflicts to the application rather than silently resolving them (like Cassandra’s LWW)?

👁️ View Solutions

  1. Vnodes solve three problems: (a) Hot spotting — a popular key only affects one vnode, not an entire physical node. (b) Uneven capacity — assign more vnodes to beefier servers. (c) Redistribution cost — adding/removing a node only shuffles its vnodes (small fraction of total data), not a full rehash.
  2. Hinted handoff ensures durability during temporary failures (a write is stored on an alternate node). Read repair ensures eventual consistency (stale data is corrected on read). Hinted handoff is pro-active (write survives even if target is down). Read repair is reactive. Together they provide a belt-and-suspenders approach: hinted handoff for durability, read repair for convergence.
  3. Because “last write wins” loses data. If two clients concurrently update Alice’s record, LWW silently drops one update. Riak’s philosophy: the application has domain knowledge to make the right merge decision. Riak provides CRDTs as a middle ground — the operations are designed to converge automatically without LWW data loss.

✏️ Exercises

Riak KV — Exercises

Exercise 1

A Riak bucket is configured with n_val=3 and r=2, w=2. Three physical nodes host the data. One node is down. Can you still serve reads? Can you still serve writes? Why?

Exercise 2

You have microservices A, B, and C, all writing to the same key cart:user42. A writes at T=1, B writes concurrently at T=2 (without seeing A’s write), and C writes at T=3 (without seeing either). How many siblings does a subsequent read return? What does each sibling’s vector clock look like?

Exercise 3

How does hinted handoff interact with the w consistency level in Riak? If w=2 and only 1 node is available, does the write succeed?

Exercise 4

Explain why LWW mode in Riak can silently drop data. Give a concrete example involving a counter (not a user profile).


👁️ View Solutions

  1. Yes to both. n_val=3 means the prefList has 3 vnodes. With one node down (hosting ~1/3 of the vnodes), 2 vnodes are available. r=2 and w=2 can both be satisfied by 2 available replicas. However, if the down node has 2 of the 3 vnodes in the prefList (unlikely with random distribution but possible), the quorum wouldn’t be met. This is why vnodes are spread across physical nodes — to minimize this scenario.

  2. Three siblings. A’s clock: [A:1]. B’s clock: [B:1]. C’s clock: [C:1]. Each vector clock is a single-entry clock with the writing service’s counter. None are causally related (each has entries the others don’t). The read must return all three siblings. The client must resolve them — likely by merging the cart contents from all three.

  3. In Riak, hinted handoff counts toward the w acknowledgment. If the coordinator writes to 1 available replica and stores 1 hint (to be delivered later), it has satisfied w=2. The write succeeds even though only 1 node physically stores the data. This is “sloppy quorum” — from the Dynamo paper. The durability, however, is weak: the hint is only on the coordinator and could be lost.

  4. Counter example: two region servers both increment a visitor counter:

    • Server US: counter = 1000 @ T=100
    • Server EU: counter = 500 @ T=101
    • LWW picks T=101 → counter = 500
    • The US counter’s 1000 increments are LOST.

    With a CRDT counter (PN Counter), the result would be 1500 (correct — both increments are preserved). This is the canonical case for Riak CRDTs over LWW.