Reflection Real Systems · MongoDB

Key Question

MongoDB is often criticized by distributed systems purists. Where does it deviate from theoretically “correct” design, and why?

Deep Dive

MongoDB has been the subject of intense debate in the distributed systems community. Understanding the criticisms — and MongoDB’s responses — reveals important lessons about engineering trade-offs.

Deviation 1: Rollback After Failover

The most famous MongoDB criticism: when a new primary is elected, writes that were acknowledged by the old primary (but not replicated to secondaries) are rolled back. The client received an acknowledgment, but the write is gone.

// Scenario:
// Primary M1 acknowledges a write (opTime 5).
// M1 crashes before replicating opTime 5.
// M2 becomes primary with opTime 4.
// The write at opTime 5 is ROLLED BACK.

MongoDB stores rolled-back writes in a rollback/ directory so an operator can manually recover them. This is the trade-off for asynchronous replication with single-primary writes.

Compare with Cassandra: a write at CL=ONE can also be lost, but Cassandra is transparent about it. MongoDB was historically less transparent, leading to unpleasant surprises.

Deviation 2: Single-Primary Bottleneck

In MongoDB, all writes go through the primary. This means write throughput is limited by a single node’s capacity. Cassandra and Riak can scale writes linearly with nodes because any node can coordinate any write.

MongoDB’s counterargument: most applications are read-heavy. The single-primary bottleneck matters less in practice, and the simplicity of the single-writer model eliminates distributed concurrency problems.

Deviation 3: Sharding is an Add-On

MongoDB’s sharding (horizontal partitioning) was added after the replica set architecture was built. It uses a mongos router layer that maps collections to shards via a range-based partitioning scheme:

mongos router → config servers (metadata)
              → shard 1 (replica set)
              → shard 2 (replica set)
              → shard 3 (replica set)

This is more complex than Cassandra’s native ring — every operation requires an extra hop to the config servers (or a cached routing table). And unlike Cassandra, adding a shard requires data migration (just like Redis).

Deviation 4: Read Preference Confusion

MongoDB offers five read preferences: primary, primaryPreferred, secondary, secondaryPreferred, nearest. Each has different consistency properties. The default is primary (strong consistency), but many applications use secondaryPreferred for lower latency, inadvertently accepting stale reads.

This is MongoDB’s version of Cassandra’s tunable consistency, but expressed as connection-level configuration rather than per-request levels.

Key Takeaways

Rollback is MongoDB’s dirty secret: acknowledged writes can disappear during failover.
Single-primary write path limits throughput but eliminates conflict resolution.
Sharding was retrofitted and is more complex than native ring-based distribution.
MongoDB’s consistency model is: strongly consistent by default, eventually consistent when you relax read preferences.

Full Source

View or download the complete implementation: mongodb.ts

Exercises

In MongoDB, a client receives acknowledgment for a write (w: “majority”). Does this guarantee the write survives a primary failure?
Why did MongoDB add “causal consistency” in version 3.6? What problem did it solve?
Compare MongoDB’s single-primary model with Cassandra’s leaderless model. Which is better for a global chat application?

👁️ View Solutions

Yes — w: "majority" means the write was replicated to a majority of voting members. Even if the primary fails, the new primary (which has the latest data from the majority) will have the write. w: "majority" provides the same durability as Raft’s commit index. Only w: 1 writes can be rolled back.
Before 3.6, reading from a secondary could miss your own writes (read-your-writes consistency was broken). Causal consistency using afterClusterTime ensures a secondary doesn’t respond to a read until it has caught up to the client’s last write. This made secondaries safe to use without risking stale reads.
MongoDB’s single-primary is simpler per write but requires distributing reads across replicas for scalability. Cassandra’s leaderless model is better for a chat app because: (a) writes can go to any nearby node (lower latency), (b) there is no single point of failure, (c) write throughput scales linearly. However, MongoDB provides stronger consistency guarantees per operation.

✏️ Exercises

MongoDB — Exercises

Exercise 1

A MongoDB replica set has 5 members: M1 (priority 5), M2 (priority 3), M3 (priority 3), M4 (arbiter), M5 (priority 1). M1 fails. Who becomes the new primary? What happens when M1 recovers?

Exercise 2

Write a write at w: "majority" is acknowledged by the primary. After the primary fails, can the new primary roll back this write? Explain why or why not.

Exercise 3

You read from a secondary with readPreference: "secondaryPreferred" and no afterClusterTime. The secondary’s oplog is 2 seconds behind the primary. You just wrote data 500ms ago. Will your read see the write? What if you add afterClusterTime?

Exercise 4

Describe one scenario where MongoDB’s single-primary write path is a weakness. Describe one scenario where it is a strength.

👁️ View Solutions

M2 and M3 have equal priority (3). MongoDB uses additional tiebreakers: lastOpTime (oplog recency), then _id (lexicographic). Assuming both are equally caught up, one wins. M4 (arbiter) votes but cannot become primary. When M1 recovers (priority 5 > 3), M1 triggers a new election and takes over. This is “preemptive step-down” — it’s behavior unique to MongoDB’s priority-based system.
A w: "majority" write cannot be rolled back. It was written to the primary AND replicated to a majority of voting members. Even if the primary fails, the new primary (which must be elected by a majority) will have the latest data from the majority, which includes this write. Rollback only happens for w: 1 writes that hadn’t been replicated at the time of failure.
Without afterClusterTime: the secondary may return stale data because it hasn’t applied the oplog entry yet. You wrote 500ms ago, but the secondary is 2s behind — you don’t see your own write. With afterClusterTime: the server blocks until the oplog advances past your opTime. The read blocks for ~1500ms, then returns your write. This is the trade-off: >1s of latency for read-your-writes consistency.
Weakness: High-throughput write system (e.g., time-series ingestion at 100k writes/sec). The single primary becomes a bottleneck and cannot scale. Cassandra or Riak would be better because any node accepts writes. Strength: Banking transaction system where write order and consistency are critical. Single-primary means no write conflicts, no vector clocks, no last-writer-wins ambiguity. The write path is simple and verifiable.