Reflection Real Systems · Riak KV

Key Question

Riak is the most pure Dynamo implementation, yet it never achieved Cassandra’s market share. What trade-offs explain this?

Deep Dive

Riak KV is a faithful implementation of the Dynamo paper. But faithfulness has costs.

Trade-off 1: Application-Level Conflict Resolution

Riak’s vector clock conflicts require the application to resolve them. This is philosophically correct — the application has domain knowledge — but it’s operationally painful.

// With Riak:
// Client reads, gets 2 siblings
// Client must decide: merge values, pick one, or prompt the user
// This logic must be in EVERY application that reads Riak

// With Cassandra:
// LWW: latest timestamp wins
// The application never sees conflicts
// But the last write may NOT be the semantically correct one

Real-world impact: Developers using Riak must write conflict resolution code. For simple cases (LWW is fine), this is unnecessary complexity. For complex cases (shopping cart merge), Riak’s approach is better.

Trade-off 2: The Vector Clock Bloat Problem

Vector clocks grow linearly with the number of writers. In a system with thousands of clients (or microservices) writing to the same key, the vector clock becomes enormous:

// A vector clock for a popular key after 10 minutes:
// [client-138:4, client-294:7, client-401:2, client-552:1, ...]
// Hundreds of entries!

// Riak's solution: "clock pruning" — remove old entries
// But this loses causal information → false positives (phantom siblings)

Riak 2.0 added “vnode-based vector clocks” where only the vnode (not the client) appears in the clock. This bounds the clock size to the number of physical nodes.

Trade-off 3: No Query Layer

Riak is a pure key-value store. You can fetch a key, or you can run map-reduce jobs (via Riak Pipe). But you cannot:

-- Not possible in Riak (without building secondary indexes manually)
SELECT * FROM users WHERE age > 30;
-- Riak search (Yokozuna) exists but is add-on, not native
-- Contrast with Cassandra CQL:
SELECT * FROM users WHERE age > 30 ALLOW FILTERING;

Cassandra’s CQL gives it a huge adoption advantage — developers write familiar SQL-like queries. Riak’s pure KV model requires you to design everything around key access patterns.

Trade-off 4: Erlang Niche

Riak is written in Erlang (running on BEAM). This gives it:

Hot code swapping: upgrade without downtime.
Supervisor trees: fault isolation between components.
Excellent concurrency: thousands of lightweight processes.

But it also means Riak is harder to contribute to (Erlang is niche), harder to debug (Erlang stack traces are notoriously cryptic), and harder to hire for.

The Dynamo Legacy

Despite Riak’s limited market share, its design directly influenced:

Cassandra’s LWW approach — simplified to avoid vector clocks.
Amazon DynamoDB — AWS’s managed version, adding tunable consistency.
Redis Cluster — which added consistent hashing but chose single-primary (not Dynamo).
Voldemort, Dynomite — other Dynamo clones.

Riak proved that Dynamo’s design works in production. Cassandra proved that Dynamo’s design needs a query interface and simpler conflict resolution for mass adoption.

Key Takeaways

Application-level conflict resolution is philosophically correct but operationally burdensome.
Vector clock bloat is a real problem that requires pruning strategies.
No query language limits developer adoption compared to Cassandra’s CQL.
Riak’s Erlang foundation provides excellent fault isolation at the cost of mainstream ecosystem support.

Full Source

View or download the complete implementation: riak.ts

Exercises

You’re building a shopping cart service. Would you use Riak’s vector clock sibling resolution or Cassandra’s LWW? Why?
What problem does “clock pruning” solve? What new problem does it introduce?
Why did Riak add CRDT-based buckets in version 2.0? What trade-off do CRDTs eliminate?

👁️ View Solutions

Shopping cart is the canonical example for Riak’s sibling resolution. A shopping cart on Node A adds “socks” (vector clock: [A:1]). Concurrently, a cart on Node B adds “shirt” (vector clock: [B:1]). With Cassandra’s LWW, one item is lost. With Riak, the application sees both siblings and must merge them: { socks, shirt }. Riak’s approach is correct here; LWW would lose data.
Clock pruning removes old (node, counter) entries from the vector clock, bounding its size. Without pruning, clocks would grow indefinitely. The new problem: phantom siblings — falsely detecting conflicts because the pruned clock no longer shows the causal relationship between writes.
CRDT-based buckets let the system resolve conflicts automatically without LWW data loss. For example, a counter CRDT merges all increments without losing any. This eliminates the trade-off between “lose data” (LWW) and “make the application resolve” (vector clock siblings). CRDTs give automatic correct resolution for specific data types.

✏️ Exercises

Riak KV — Exercises

Exercise 1

A Riak bucket is configured with n_val=3 and r=2, w=2. Three physical nodes host the data. One node is down. Can you still serve reads? Can you still serve writes? Why?

Exercise 2

You have microservices A, B, and C, all writing to the same key cart:user42. A writes at T=1, B writes concurrently at T=2 (without seeing A’s write), and C writes at T=3 (without seeing either). How many siblings does a subsequent read return? What does each sibling’s vector clock look like?

Exercise 3

How does hinted handoff interact with the w consistency level in Riak? If w=2 and only 1 node is available, does the write succeed?

Exercise 4

Explain why LWW mode in Riak can silently drop data. Give a concrete example involving a counter (not a user profile).

👁️ View Solutions

Yes to both. n_val=3 means the prefList has 3 vnodes. With one node down (hosting ~1/3 of the vnodes), 2 vnodes are available. r=2 and w=2 can both be satisfied by 2 available replicas. However, if the down node has 2 of the 3 vnodes in the prefList (unlikely with random distribution but possible), the quorum wouldn’t be met. This is why vnodes are spread across physical nodes — to minimize this scenario.
Three siblings. A’s clock: [A:1]. B’s clock: [B:1]. C’s clock: [C:1]. Each vector clock is a single-entry clock with the writing service’s counter. None are causally related (each has entries the others don’t). The read must return all three siblings. The client must resolve them — likely by merging the cart contents from all three.
In Riak, hinted handoff counts toward the w acknowledgment. If the coordinator writes to 1 available replica and stores 1 hint (to be delivered later), it has satisfied w=2. The write succeeds even though only 1 node physically stores the data. This is “sloppy quorum” — from the Dynamo paper. The durability, however, is weak: the hint is only on the coordinator and could be lost.
Counter example: two region servers both increment a visitor counter:
- Server US: counter = 1000 @ T=100
- Server EU: counter = 500 @ T=101
- LWW picks T=101 → counter = 500
- The US counter’s 1000 increments are LOST.
With a CRDT counter (PN Counter), the result would be 1500 (correct — both increments are preserved). This is the canonical case for Riak CRDTs over LWW.