Key Question
When two nodes accept writes that conflict, how does the system decide which one wins?
Deep Dive
In an eventually consistent system, any node can accept writes at any time. When two clients write to different replicas concurrently, or when the same item is modified on both sides of a partition, conflicts arise. The system must resolve these conflicts to converge to a single value.
There are three main approaches, each with trade-offs:
1. Last Writer Wins (LWW)
LWW uses a timestamp (physical or logical) to determine which write is “latest.” The write with the highest timestamp wins.
Replica R1: write X = "A" at T=100
Replica R2: write X = "B" at T=200
Resolution: X = "B" (T=200 > T=100)
Pros:
- Simple to implement
- Always converges to one value
- No client involvement needed
- Bounded storage (one version per object)
Cons:
- Can lose data (B overwrote A even if A’s user didn’t know about B)
- Clock skew can cause issues (if R1’s clock is fast, T=100 might be “later” than T=200 in real time)
- Breaks causality (the ordering might not reflect what actually happened)
Dynamo’s Shopping Cart Example:
Client 1 (phone app): adds "milk" to cart
Client 2 (web app): adds "eggs" to cart
With LWW (timestamps):
Phone write: cart = {milk}, timestamp=10:00:05
Web write: cart = {eggs}, timestamp=10:00:06
Resolution: cart = {eggs} ← milk is LOST!
The user has lost an item from their cart. LWW is destructive when concurrent operations update different fields/elements.
2. Multi-Value Registers (Siblings / Vector Clocks)
Instead of keeping one value, the system keeps ALL conflicting values (called “siblings”) and lets the application resolve the conflict on the next read.
Dynamo uses vector clocks to track causality:
Version: ([R1, 1], [R2, 0]) = {milk, eggs}
Each node maintains a version counter per object. The vector clock ([R1, 1], [R2, 1]) says: both R1 and R2 have modified this object. If two versions are not causally related (neither is an ancestor of the other), they are siblings.
Client 1: writes cart = {milk}
→ Version: ([R1, 1])
Client 2: writes cart = {eggs} (concurrent, doesn't know about milk)
→ Version: ([R2, 1])
These are concurrent: neither dominates the other.
Resolution: keep both siblings: {milk} and {eggs}
On the next read, the application receives both values:
Read returns: [{milk}, {eggs}]
Application merges: {milk, eggs}
Write merged value: {milk, eggs} with version ([R1,1], [R2,1])
No data is lost. The application gets to decide the merge.
3. CRDTs (Conflict-free Replicated Data Types)
CRDTs are data structures designed so that concurrent updates always merge deterministically without conflict. The mathematics guarantee convergence without coordination.
Types of CRDTs:
- G-Counter (Grow-only Counter): Each node has its own counter. The total is the sum of all node counters. Concurrent increments commute via addition.
Node A: +1 → A=1, B=0 → total=1
Node B: +1 → A=0, B=1 → total=1
Merge: A=1, B=1 → total=2
Both nodes see total=2 after merge. Correct!
-
PN-Counter (Positive-Negative Counter): Two G-Counters: one for increments, one for decrements.
-
G-Set (Grow-only Set): Add elements only. Removals not supported. Union operation is always commutative.
-
OR-Set (Observed-Remove Set): Supports both add and remove. Each element has a unique tag. Add creates a tag. Remove adds a tombstone for that tag. Concurrent add and remove: add wins (because remove only removes observed tags).
CRDTs are elegant because they resolve conflicts by design — no timestamps, no vector clocks, no application logic needed. However, not all data structures have a CRDT version, and CRDTs can be memory-intensive (tombstones, per-node counters).
Check Your Understanding
- Under LWW, can clock skew ever cause the “wrong” write to win? How?
- Why might you prefer vector clock siblings over LWW for a collaborative document editing application?
- What is a “sibling” in Dynamo’s conflict resolution model?
- How does an OR-Set ensure that a concurrent “add” and “remove” of the same element doesn’t lose the element?
The “So What?”
Choosing LWW vs siblings vs CRDTs is one of the most consequential decisions when designing an eventually consistent system. LWW is simple and fast but loses data. Siblings preserve data integrity but require application-level merge logic. CRDTs are the holy grail — automatic convergence — but are memory-intensive and don’t cover every use case. This is a fundamental trade-off between simplicity and data integrity, and there’s no universally correct answer.
✏️ Exercises
Eventual Consistency — Exercises
Exercise 1
Under the BASE model, is it acceptable for a system to return stale data indefinitely (i.e., never converge)? Explain your answer using the definition of “eventual consistency.”
Exercise 2
You’re building a commenting system on an eventually consistent database. You want to ensure a user always sees their own comment immediately after posting, but you cannot use sticky sessions (the load balancer doesn’t support it). How would you implement read-your-writes?
Exercise 3
In a gossip protocol with 100 nodes, where each round involves each node contacting one random peer (push only), how many rounds are needed for 99% of nodes to receive an update? Assume the update starts at a single node.
Exercise 4
You’re building a distributed shopping cart that must never lose items (even under concurrent writes from different devices). Would you choose LWW or vector clock siblings? Explain your reasoning. Under what conditions would LWW be the better choice?
👁️ View Solutions
Eventual Consistency — Solutions
Solution 1
No, it is not acceptable. The “E” in BASE stands for “Eventual consistency,” which has a specific formal meaning: there exists a time T such that for all t ≥ T, all reads return the same value. The word “eventual” means convergence is guaranteed (though the time bound is unspecified). If the system never converges, it violates even the weak guarantees of BASE.
A system that never converges would be “inconsistent” in the worst sense — different clients would permanently see different values. This is not eventual consistency; it’s just inconsistency. Systems like Dynamo and Cassandra guarantee eventual convergence through anti-entropy, gossip, read repair, and conflict resolution mechanisms. They do not guarantee “when” but they guarantee “that.”
The practical answer: your system must eventually converge. If it doesn’t, you have a bug in your anti-entropy or conflict resolution logic.
Solution 2
You can implement read-your-writes without sticky sessions using client-provided version tokens:
-
Write phase: When the user posts a comment, the server accepts the write and returns a version token (e.g., a timestamp or a write ID).
POST /comment {"text": "hello"} → Response 201: {"comment_id": 42, "version_token": "T1001"} -
Read phase: The client includes the version token in subsequent GET requests:
GET /comments?after_token=T1001 -
Server side: The coordinating node checks its replica. If the replica’s version is behind T1001, it either:
- Forwards the read to a replica that has seen the write (coordination).
- Blocks the read until the replica catches up via anti-entropy or a direct read-from-primary.
- Returns a “try again” response (least desirable — but the client can retry).
Alternative approach: Client-side compensation. The client sends the POST, gets the response, and optimistically adds the comment to the local UI state. Even if the subsequent GET doesn’t return it, the UI shows it. This is a UX trick, not true RYW, but it works surprisingly well for commenting systems.
Solution 3
With push-only gossip, the infection spreads roughly exponentially. The number of infected nodes after k rounds starting from 1 infected node follows:
- Round 0: 1 node infected
- Round 1: 2 nodes (1 contacts 1)
- Round k: 2^k nodes infected (before hitting N)
For 99% of 100 nodes (99 nodes infected):
We need: 2^k ≥ 99 k ≥ log₂(99) ≈ 6.62
So 7 rounds are needed.
This is the theoretical ideal (random peer selection with no collisions). In practice, collisions happen (two nodes contact the same peer), so it might take 8-10 rounds. The formula works well when N is large relative to the number of infected nodes.
For the last 1% (node that hasn’t been infected yet): the problem is that late-stage propagation slows down because most peers already have the data, so the chance of contacting an uninfected node decreases. This is the “diminishing returns” of gossip — the first 50% spreads fast, the last 10% takes proportionally longer.
Solution 4
For a shopping cart that must never lose items: choose vector clock siblings.
Here’s why:
- With LWW, concurrent writes from different devices can cause item loss. If the phone adds “milk” and the web browser adds “eggs” at nearly the same time, only one survives.
- Vector clocks preserve both writes as siblings, and the merge logic produces the union: {milk, eggs}.
- No items are lost.
When LWW would be better:
- When the data is not additive (e.g., updating a profile’s “status message” — you want the latest one, not both).
- When application-level merge logic is too complex or error-prone.
- When storage is a concern (siblings can accumulate).
- When you’re willing to trade data integrity for simplicity.
Most production shopping cart systems actually use a hybrid: they use vector clocks for cart items (additive) but LWW for cart-level metadata like “last modified date.”