Distributed & Decentralized Systems Curriculum
Reflection Real Systems · MongoDB

Key Question

How does MongoDB combine a familiar document model with Raft-inspired replication to become the most popular non-relational database?

Deep Dive

MongoDB is the most widely deployed non-relational database in production. Its architecture is a study in pragmatic engineering: it takes familiar concepts from relational databases (primary-secondary replication, write-ahead logs) and adapts them for distributed operation.

Replica Sets

A MongoDB replica set is a group of mongod processes that maintain the same data set. It has exactly one primary (at any given time) and zero or more secondaries. This is fundamentally different from Dynamo-style leaderless replication (Cassandra, Riak) — MongoDB uses a single-writer model.

Client → Primary (writes) → Oplog → Secondary 1 (replicates)
                                  → Secondary 2 (replicates)
                                  → Arbiter (votes only, no data)

The single-writer model is simpler: there are no write conflicts to resolve (no vector clocks, no LWW ambiguity). Every write goes through the primary, and the order is totally ordered via the oplog.

Priority-Based Elections

Unlike Raft’s randomized election timeouts (studied in Module 3), MongoDB elections are priority-based. Each member has a priority (1–1000). The member with the highest priority that can communicate with a majority initiates the election. This means:

  • You can ensure specific nodes are primary (e.g., prefer the node in us-east-1 over us-west-2).
  • Failover is deterministic, not random.
  • The primary automatically steps down if a higher-priority member comes online.
// MongoDB: higher priority wins
const sorted = candidates.sort((a, b) => b.priority - a.priority)
const winner = sorted[0]

// Raft: random timeout, first to finish wins
// (no priority concept)

The Oplog

MongoDB’s replication is driven by the oplog (operations log), a capped collection stored on each replica set member. When the primary executes a write, it records an oplog entry. Secondaries tail the primary’s oplog and replay operations asynchronously.

The oplog is MongoDB’s equivalent of a write-ahead log, but it serves double duty: it drives replication AND enables point-in-time recovery.

Write Concern

MongoDB’s write concern is equivalent to Cassandra’s consistency levels but from a single-primary perspective:

  • w: 1 — acknowledged by primary only (fastest, weakest)
  • w: "majority" — acknowledged by majority of voting members
  • w: "all" — acknowledged by every data-bearing member

Key Takeaways

  • MongoDB uses a single-primary, leader-based replication model (like Raft).
  • Priority-based elections allow operators to control which node becomes primary.
  • The oplog is the backbone of both replication and point-in-time recovery.
  • MongoDB is PC/EC in PACELC terms — it prefers consistency during partitions.

Full Source

View or download the complete implementation: mongodb.ts

Exercises

  1. Contrast MongoDB’s priority-based election with Raft’s randomized timeout approach. When would each be preferable?
  2. A MongoDB replica set has 3 members with priorities 5, 5, and 1. The primary (priority 5) fails. What happens?
  3. What is the purpose of an arbiter in a MongoDB replica set?

👁️ View Solutions

  1. Priority-based elections are preferable when you want specific nodes to be primary (e.g., preferring a node in the primary data center). Raft’s randomized approach is preferable when all nodes are equal. MongoDB’s approach gives operators control; Raft’s approach is simpler and avoids the “flapping” problem (primary keeps switching as higher-priority nodes come and go).
  2. The two remaining members (priority 5 and priority 1) hold an election. The priority 5 node wins because it has higher priority. If the priorities were equal (5 and 5), the election would use other factors (oplog recency, uptime). MongoDB adds a “tiebreaker” to prevent split votes.
  3. An arbiter is a voting member that does not store data. It exists only to participate in elections and provide a majority vote. In a 2-node replica set, an arbiter prevents a “split-brain” scenario while avoiding the cost of a full third data-bearing node.

✏️ Exercises

MongoDB — Exercises

Exercise 1

A MongoDB replica set has 5 members: M1 (priority 5), M2 (priority 3), M3 (priority 3), M4 (arbiter), M5 (priority 1). M1 fails. Who becomes the new primary? What happens when M1 recovers?

Exercise 2

Write a write at w: "majority" is acknowledged by the primary. After the primary fails, can the new primary roll back this write? Explain why or why not.

Exercise 3

You read from a secondary with readPreference: "secondaryPreferred" and no afterClusterTime. The secondary’s oplog is 2 seconds behind the primary. You just wrote data 500ms ago. Will your read see the write? What if you add afterClusterTime?

Exercise 4

Describe one scenario where MongoDB’s single-primary write path is a weakness. Describe one scenario where it is a strength.


👁️ View Solutions

  1. M2 and M3 have equal priority (3). MongoDB uses additional tiebreakers: lastOpTime (oplog recency), then _id (lexicographic). Assuming both are equally caught up, one wins. M4 (arbiter) votes but cannot become primary. When M1 recovers (priority 5 > 3), M1 triggers a new election and takes over. This is “preemptive step-down” — it’s behavior unique to MongoDB’s priority-based system.

  2. A w: "majority" write cannot be rolled back. It was written to the primary AND replicated to a majority of voting members. Even if the primary fails, the new primary (which must be elected by a majority) will have the latest data from the majority, which includes this write. Rollback only happens for w: 1 writes that hadn’t been replicated at the time of failure.

  3. Without afterClusterTime: the secondary may return stale data because it hasn’t applied the oplog entry yet. You wrote 500ms ago, but the secondary is 2s behind — you don’t see your own write. With afterClusterTime: the server blocks until the oplog advances past your opTime. The read blocks for ~1500ms, then returns your write. This is the trade-off: >1s of latency for read-your-writes consistency.

  4. Weakness: High-throughput write system (e.g., time-series ingestion at 100k writes/sec). The single primary becomes a bottleneck and cannot scale. Cassandra or Riak would be better because any node accepts writes. Strength: Banking transaction system where write order and consistency are critical. Single-primary means no write conflicts, no vector clocks, no last-writer-wins ambiguity. The write path is simple and verifiable.