Key Question
How do replicas in an eventually consistent system find and fix their differences?
Deep Dive
In an eventually consistent system with multiple replicas, writes are accepted at any replica. Over time, replicas diverge. To converge, replicas need a way to exchange data and resolve differences. This is called anti-entropy — the background process that keeps replicas in sync.
There are three main gossip-based approaches:
1. Push Gossip
A node periodically picks a random peer and pushes its data to that peer.
Node A: "Hey Node B, here's everything I know."
Node B: "Thanks, I'll merge anything new into my state."
- Pros: Fast propagation. After O(log N) rounds, all nodes are infected.
- Cons: The sender doesn’t learn anything new. Wasted bandwidth if B already has the data.
2. Pull Gossip
A node periodically picks a random peer and requests new data from it.
Node A: "Hey Node B, what do you know that I don't?"
Node B: "Here's my data. Merge as needed."
Node A: (merges)
- Pros: The requester always learns something new. More efficient than push for many scenarios.
- Cons: Slightly slower propagation than push in the initial rounds.
3. Push-Pull Gossip (best of both)
A node picks a random peer, pushes its data, AND pulls the peer’s data in a single exchange.
Round 1: Node A sends its data to Node B.
Node B merges A's data, then sends ITS data to A.
Node A merges B's data.
Result: After one exchange, both nodes have each other's information.
Push-pull is the most common deployment in practice (used by Cassandra, Riak, and others).
Infection Spread Model
Gossip propagation follows the epidemic model. With N nodes, each round each node contacts one random peer:
| Rounds | Nodes infected (push, starting from 1) |
|---|---|
| 1 | 2 |
| 2 | 4 |
| 3 | 8 |
| k | 2^k (until ~N) |
For N=100 nodes, after ⌈log₂(100)⌉ = 7 rounds, all nodes are infected. Each round takes roughly one gossip interval (e.g., 1 second), so convergence takes ~7 seconds.
Merkle Trees for Efficient Comparison
Comparing full data sets between two nodes is O(n) — expensive when each node has terabytes of data. Merkle trees reduce this to O(log n).
A Merkle tree is a hash tree:
- Leaf nodes = hashes of individual data items (or ranges).
- Internal nodes = hash of child hashes.
- Root = hash of entire tree.
Root hash: H(H(1-4) + H(5-8))
/ \
H(1-4): H(h1+h2+h3+h4) H(5-8): H(h5+h6+h7+h8)
/ | \ / | \
h1=H(v1) h2=H(v2) h3=H(v3) h4=H(v4) h5=H(v5) h6=H(v6) ...
When two replicas compare trees:
- Compare root hashes. If equal: trees are identical. Done in O(1).
- If different: recurse into children. Compare at next level.
- Continue until leaf level, identifying exactly which ranges differ.
In the worst case (everything differs), you compare all leaves — O(n). But in practice, most ranges are identical, so you only traverse the differing branches.
Read Repair
Anti-entropy runs on a timer (e.g., every second). But there’s a faster convergence mechanism: read repair.
When a client reads data, the coordinator sends the request to multiple replicas. If responses differ, the coordinator returns the most recent to the client AND writes the latest value back to the stale replicas:
Client: "read X" → coordinator
Coordinator → R1: response X=42 (version 5)
Coordinator → R2: response X=10 (version 2)
Coordinator → R3: response X=42 (version 5)
Coordinator: returns X=42 to client
Coordinator: writes X=42 (version 5) → R2 (repair!)
R2 is now up-to-date without waiting for gossip.
Read repair makes convergence happen FAST for data that is actively being read. Cold data (rarely accessed) relies on slower background anti-entropy.
Check Your Understanding
- In a gossip protocol with 1000 nodes, roughly how many rounds until 99% of nodes have received an update?
- Why does push-pull gossip converge faster than pure push or pure pull?
- A Merkle tree comparison between two replicas finds differing root hashes but identical left-subtree hashes. What does this tell you?
- Under what conditions does read repair fail to converge the system?
The “So What?”
Anti-entropy is the engine that makes “eventually” eventually happen. Without it, replicas would drift apart permanently. Understanding gossip + Merkle trees helps you reason about convergence time: how long after a write will all nodes have the latest data? If your application has a “consistency SLA” (e.g., “all nodes converged within 30 seconds”), you need to tune your gossip interval and Merkle tree resolution accordingly. Read repair adds a nice boost for hot data, but cold data relies entirely on background anti-entropy.
✏️ Exercises
Eventual Consistency — Exercises
Exercise 1
Under the BASE model, is it acceptable for a system to return stale data indefinitely (i.e., never converge)? Explain your answer using the definition of “eventual consistency.”
Exercise 2
You’re building a commenting system on an eventually consistent database. You want to ensure a user always sees their own comment immediately after posting, but you cannot use sticky sessions (the load balancer doesn’t support it). How would you implement read-your-writes?
Exercise 3
In a gossip protocol with 100 nodes, where each round involves each node contacting one random peer (push only), how many rounds are needed for 99% of nodes to receive an update? Assume the update starts at a single node.
Exercise 4
You’re building a distributed shopping cart that must never lose items (even under concurrent writes from different devices). Would you choose LWW or vector clock siblings? Explain your reasoning. Under what conditions would LWW be the better choice?
👁️ View Solutions
Eventual Consistency — Solutions
Solution 1
No, it is not acceptable. The “E” in BASE stands for “Eventual consistency,” which has a specific formal meaning: there exists a time T such that for all t ≥ T, all reads return the same value. The word “eventual” means convergence is guaranteed (though the time bound is unspecified). If the system never converges, it violates even the weak guarantees of BASE.
A system that never converges would be “inconsistent” in the worst sense — different clients would permanently see different values. This is not eventual consistency; it’s just inconsistency. Systems like Dynamo and Cassandra guarantee eventual convergence through anti-entropy, gossip, read repair, and conflict resolution mechanisms. They do not guarantee “when” but they guarantee “that.”
The practical answer: your system must eventually converge. If it doesn’t, you have a bug in your anti-entropy or conflict resolution logic.
Solution 2
You can implement read-your-writes without sticky sessions using client-provided version tokens:
-
Write phase: When the user posts a comment, the server accepts the write and returns a version token (e.g., a timestamp or a write ID).
POST /comment {"text": "hello"} → Response 201: {"comment_id": 42, "version_token": "T1001"} -
Read phase: The client includes the version token in subsequent GET requests:
GET /comments?after_token=T1001 -
Server side: The coordinating node checks its replica. If the replica’s version is behind T1001, it either:
- Forwards the read to a replica that has seen the write (coordination).
- Blocks the read until the replica catches up via anti-entropy or a direct read-from-primary.
- Returns a “try again” response (least desirable — but the client can retry).
Alternative approach: Client-side compensation. The client sends the POST, gets the response, and optimistically adds the comment to the local UI state. Even if the subsequent GET doesn’t return it, the UI shows it. This is a UX trick, not true RYW, but it works surprisingly well for commenting systems.
Solution 3
With push-only gossip, the infection spreads roughly exponentially. The number of infected nodes after k rounds starting from 1 infected node follows:
- Round 0: 1 node infected
- Round 1: 2 nodes (1 contacts 1)
- Round k: 2^k nodes infected (before hitting N)
For 99% of 100 nodes (99 nodes infected):
We need: 2^k ≥ 99 k ≥ log₂(99) ≈ 6.62
So 7 rounds are needed.
This is the theoretical ideal (random peer selection with no collisions). In practice, collisions happen (two nodes contact the same peer), so it might take 8-10 rounds. The formula works well when N is large relative to the number of infected nodes.
For the last 1% (node that hasn’t been infected yet): the problem is that late-stage propagation slows down because most peers already have the data, so the chance of contacting an uninfected node decreases. This is the “diminishing returns” of gossip — the first 50% spreads fast, the last 10% takes proportionally longer.
Solution 4
For a shopping cart that must never lose items: choose vector clock siblings.
Here’s why:
- With LWW, concurrent writes from different devices can cause item loss. If the phone adds “milk” and the web browser adds “eggs” at nearly the same time, only one survives.
- Vector clocks preserve both writes as siblings, and the merge logic produces the union: {milk, eggs}.
- No items are lost.
When LWW would be better:
- When the data is not additive (e.g., updating a profile’s “status message” — you want the latest one, not both).
- When application-level merge logic is too complex or error-prone.
- When storage is a concern (siblings can accumulate).
- When you’re willing to trade data integrity for simplicity.
Most production shopping cart systems actually use a hybrid: they use vector clocks for cart items (additive) but LWW for cart-level metadata like “last modified date.”