Distributed & Decentralized Systems Curriculum
Reflection Real Systems · MongoDB

Key Question

MongoDB powers some of the largest applications in the world (including many at enterprise scale). What patterns emerge from real-world production experience?

Deep Dive

Pitfall 1: Neglecting the Election Storm

When a primary fails, all members detect the failure and trigger elections. In early versions (pre-3.2), this caused an “election storm” — every secondary saw the heartbeat timeout and simultaneously called for election. The fix was to stagger the “step-down” and “call-for-election” phases:

// Modern MongoDB: step-down is instantaneous, election has backoff
// But if multiple members have equal priority, they can still conflict

Lesson: Even with priority-based elections, you can get temporary election storms. MongoDB’s fix was to add a random backoff phase, essentially incorporating Raft’s approach alongside its own priority system.

Pattern 1: Oplog Window Monitoring

The single most common production issue is a secondary falling outside the oplog window. This requires a full resync (re-cloning the entire data set from the primary — potentially terabytes). Operators monitor:

// In production, you check:
//   rs.printReplicationInfo() → "oplog size" and "time elapsed"
// Both should leave room for planned maintenance windows

Rule of thumb: Set oplog size to at least 24 hours of write operations. For write-heavy workloads, this could be 50GB+.

Pattern 2: Causal Consistency Chains

In production systems with microservices, causal consistency becomes critical:

Service A writes user profile  → causal chain
Service B reads user profile   → must see A's write
Service C writes based on B    → must see B's view

Without afterClusterTime, each service hop could drop the client to a stale secondary, producing non-causal behavior. With causal consistency, the operation time is passed through the chain:

// A writes
const { opTime } = primary.write('users', profile)
// A sends opTime to B in the response
// B uses opTime to ensure it sees A's write
const { result } = secondary.read('users', query, opTime)

Pitfall 2: Write Scaling Assumptions

Teams migrating from SQL often assume MongoDB scales writes infinitely because “it’s NoSQL.” In reality:

  1. All writes go to one primary (the bottleneck).
  2. Sharding requires careful key selection (hot shards are common).
  3. Document-level locking (WiredTiger) prevents contention, but the single primary still saturates CPU or disk I/O.

MongoDB hits a write ceiling at ~10k–50k ops/sec per primary depending on hardware. Beyond that, you must shard — or switch to Cassandra/DynamoDB.

Pitfall 3: Transaction Overuse

MongoDB 4.0 added multi-document ACID transactions. This is powerful but easy to overuse:

// Overuse: one transaction per user request
session.startTransaction()
await collection1.insertOne(doc1, { session })
await collection2.insertOne(doc2, { session })
await session.commitTransaction()
// Each transaction requires 2-phase commit across the primary
// Throughput: < 100 txn/sec

Lesson: Multi-document transactions have similar throughput limits to MySQL or PostgreSQL. Use document embedding and atomic single-document operations where possible.

Key Takeaways

  • The oplog window is the most critical operational metric to monitor.
  • Causal consistency chains are essential in microservice architectures.
  • MongoDB’s write ceiling is real — plan for sharding or use a different system for write-heavy workloads.
  • Transactions are expensive — prefer single-document operations.

Full Source

View or download the complete implementation: mongodb.ts

Exercises

  1. You’re designing a social media platform. Would you use MongoDB’s replica set or Cassandra for the “likes” counter system? Why?
  2. A secondary’s lastOpTime is lagging behind the primary’s by 30 minutes. What happens when the primary fails? What is the risk?
  3. Design a causal consistency chain for: User creates a post → notification is sent → notification is read. How does MongoDB ensure notification read sees the post creation?

👁️ View Solutions

  1. For a “likes” counter, neither is ideal. Both MongoDB and Cassandra have write skew issues with counters. MongoDB’s $inc with single-document atomicity works well for per-document counters but doesn’t scale to thousands of concurrent increments per second. Cassandra’s distributed counter is better for high-throughput counters but has known issues with under-counts during node failures. Consider Redis for real-time counters and periodically flush to MongoDB/Cassandra for persistence.
  2. If the primary fails, the secondary cannot become primary immediately — it is too far behind. An election is held, but the secondary may be ineligible (MongoDB checks lastOpTime against the optimal commit point). The replica set becomes read-only until the lagging secondary catches up or is removed. The risk: data loss equal to the 30-minute lag.
  3. The chain: (a) Post creation returns opTime1. (b) The notification service receives opTime1 in the request context. (c) When the notification service reads the post (to include content in the notification), it passes afterClusterTime: opTime1. (d) The server ensures its oplog has advanced to at least opTime1 before returning. (e) When the notification is delivered and the user reads it, the same afterClusterTime chain ensures visibility. This guarantees that a user never sees a notification for a post they can’t see.

✏️ Exercises

MongoDB — Exercises

Exercise 1

A MongoDB replica set has 5 members: M1 (priority 5), M2 (priority 3), M3 (priority 3), M4 (arbiter), M5 (priority 1). M1 fails. Who becomes the new primary? What happens when M1 recovers?

Exercise 2

Write a write at w: "majority" is acknowledged by the primary. After the primary fails, can the new primary roll back this write? Explain why or why not.

Exercise 3

You read from a secondary with readPreference: "secondaryPreferred" and no afterClusterTime. The secondary’s oplog is 2 seconds behind the primary. You just wrote data 500ms ago. Will your read see the write? What if you add afterClusterTime?

Exercise 4

Describe one scenario where MongoDB’s single-primary write path is a weakness. Describe one scenario where it is a strength.


👁️ View Solutions

  1. M2 and M3 have equal priority (3). MongoDB uses additional tiebreakers: lastOpTime (oplog recency), then _id (lexicographic). Assuming both are equally caught up, one wins. M4 (arbiter) votes but cannot become primary. When M1 recovers (priority 5 > 3), M1 triggers a new election and takes over. This is “preemptive step-down” — it’s behavior unique to MongoDB’s priority-based system.

  2. A w: "majority" write cannot be rolled back. It was written to the primary AND replicated to a majority of voting members. Even if the primary fails, the new primary (which must be elected by a majority) will have the latest data from the majority, which includes this write. Rollback only happens for w: 1 writes that hadn’t been replicated at the time of failure.

  3. Without afterClusterTime: the secondary may return stale data because it hasn’t applied the oplog entry yet. You wrote 500ms ago, but the secondary is 2s behind — you don’t see your own write. With afterClusterTime: the server blocks until the oplog advances past your opTime. The read blocks for ~1500ms, then returns your write. This is the trade-off: >1s of latency for read-your-writes consistency.

  4. Weakness: High-throughput write system (e.g., time-series ingestion at 100k writes/sec). The single primary becomes a bottleneck and cannot scale. Cassandra or Riak would be better because any node accepts writes. Strength: Banking transaction system where write order and consistency are critical. Single-primary means no write conflicts, no vector clocks, no last-writer-wins ambiguity. The write path is simple and verifiable.