Key Question
MongoDB powers some of the largest applications in the world (including many at enterprise scale). What patterns emerge from real-world production experience?
Deep Dive
Pitfall 1: Neglecting the Election Storm
When a primary fails, all members detect the failure and trigger elections. In early versions (pre-3.2), this caused an “election storm” — every secondary saw the heartbeat timeout and simultaneously called for election. The fix was to stagger the “step-down” and “call-for-election” phases:
// Modern MongoDB: step-down is instantaneous, election has backoff
// But if multiple members have equal priority, they can still conflict
Lesson: Even with priority-based elections, you can get temporary election storms. MongoDB’s fix was to add a random backoff phase, essentially incorporating Raft’s approach alongside its own priority system.
Pattern 1: Oplog Window Monitoring
The single most common production issue is a secondary falling outside the oplog window. This requires a full resync (re-cloning the entire data set from the primary — potentially terabytes). Operators monitor:
// In production, you check:
// rs.printReplicationInfo() → "oplog size" and "time elapsed"
// Both should leave room for planned maintenance windows
Rule of thumb: Set oplog size to at least 24 hours of write operations. For write-heavy workloads, this could be 50GB+.
Pattern 2: Causal Consistency Chains
In production systems with microservices, causal consistency becomes critical:
Service A writes user profile → causal chain
Service B reads user profile → must see A's write
Service C writes based on B → must see B's view
Without afterClusterTime, each service hop could drop the client to a stale secondary, producing non-causal behavior. With causal consistency, the operation time is passed through the chain:
// A writes
const { opTime } = primary.write('users', profile)
// A sends opTime to B in the response
// B uses opTime to ensure it sees A's write
const { result } = secondary.read('users', query, opTime)
Pitfall 2: Write Scaling Assumptions
Teams migrating from SQL often assume MongoDB scales writes infinitely because “it’s NoSQL.” In reality:
- All writes go to one primary (the bottleneck).
- Sharding requires careful key selection (hot shards are common).
- Document-level locking (WiredTiger) prevents contention, but the single primary still saturates CPU or disk I/O.
MongoDB hits a write ceiling at ~10k–50k ops/sec per primary depending on hardware. Beyond that, you must shard — or switch to Cassandra/DynamoDB.
Pitfall 3: Transaction Overuse
MongoDB 4.0 added multi-document ACID transactions. This is powerful but easy to overuse:
// Overuse: one transaction per user request
session.startTransaction()
await collection1.insertOne(doc1, { session })
await collection2.insertOne(doc2, { session })
await session.commitTransaction()
// Each transaction requires 2-phase commit across the primary
// Throughput: < 100 txn/sec
Lesson: Multi-document transactions have similar throughput limits to MySQL or PostgreSQL. Use document embedding and atomic single-document operations where possible.
Key Takeaways
- The oplog window is the most critical operational metric to monitor.
- Causal consistency chains are essential in microservice architectures.
- MongoDB’s write ceiling is real — plan for sharding or use a different system for write-heavy workloads.
- Transactions are expensive — prefer single-document operations.
Full Source
View or download the complete implementation: mongodb.ts
Exercises
- You’re designing a social media platform. Would you use MongoDB’s replica set or Cassandra for the “likes” counter system? Why?
- A secondary’s
lastOpTimeis lagging behind the primary’s by 30 minutes. What happens when the primary fails? What is the risk? - Design a causal consistency chain for: User creates a post → notification is sent → notification is read. How does MongoDB ensure notification read sees the post creation?
👁️ View Solutions
- For a “likes” counter, neither is ideal. Both MongoDB and Cassandra have write skew issues with counters. MongoDB’s
$incwith single-document atomicity works well for per-document counters but doesn’t scale to thousands of concurrent increments per second. Cassandra’s distributed counter is better for high-throughput counters but has known issues with under-counts during node failures. Consider Redis for real-time counters and periodically flush to MongoDB/Cassandra for persistence. - If the primary fails, the secondary cannot become primary immediately — it is too far behind. An election is held, but the secondary may be ineligible (MongoDB checks
lastOpTimeagainst the optimal commit point). The replica set becomes read-only until the lagging secondary catches up or is removed. The risk: data loss equal to the 30-minute lag. - The chain: (a) Post creation returns
opTime1. (b) The notification service receivesopTime1in the request context. (c) When the notification service reads the post (to include content in the notification), it passesafterClusterTime: opTime1. (d) The server ensures its oplog has advanced to at leastopTime1before returning. (e) When the notification is delivered and the user reads it, the sameafterClusterTimechain ensures visibility. This guarantees that a user never sees a notification for a post they can’t see.
✏️ Exercises
MongoDB — Exercises
Exercise 1
A MongoDB replica set has 5 members: M1 (priority 5), M2 (priority 3), M3 (priority 3), M4 (arbiter), M5 (priority 1). M1 fails. Who becomes the new primary? What happens when M1 recovers?
Exercise 2
Write a write at w: "majority" is acknowledged by the primary. After the primary fails, can the new primary roll back this write? Explain why or why not.
Exercise 3
You read from a secondary with readPreference: "secondaryPreferred" and no afterClusterTime. The secondary’s oplog is 2 seconds behind the primary. You just wrote data 500ms ago. Will your read see the write? What if you add afterClusterTime?
Exercise 4
Describe one scenario where MongoDB’s single-primary write path is a weakness. Describe one scenario where it is a strength.
👁️ View Solutions
-
M2 and M3 have equal priority (3). MongoDB uses additional tiebreakers:
lastOpTime(oplog recency), then_id(lexicographic). Assuming both are equally caught up, one wins. M4 (arbiter) votes but cannot become primary. When M1 recovers (priority 5 > 3), M1 triggers a new election and takes over. This is “preemptive step-down” — it’s behavior unique to MongoDB’s priority-based system. -
A
w: "majority"write cannot be rolled back. It was written to the primary AND replicated to a majority of voting members. Even if the primary fails, the new primary (which must be elected by a majority) will have the latest data from the majority, which includes this write. Rollback only happens forw: 1writes that hadn’t been replicated at the time of failure. -
Without
afterClusterTime: the secondary may return stale data because it hasn’t applied the oplog entry yet. You wrote 500ms ago, but the secondary is 2s behind — you don’t see your own write. WithafterClusterTime: the server blocks until the oplog advances past your opTime. The read blocks for ~1500ms, then returns your write. This is the trade-off: >1s of latency for read-your-writes consistency. -
Weakness: High-throughput write system (e.g., time-series ingestion at 100k writes/sec). The single primary becomes a bottleneck and cannot scale. Cassandra or Riak would be better because any node accepts writes. Strength: Banking transaction system where write order and consistency are critical. Single-primary means no write conflicts, no vector clocks, no last-writer-wins ambiguity. The write path is simple and verifiable.