Key Question
How does a MongoDB replica set elect a primary and replicate writes via the oplog?
Deep Dive
The mongodb.ts file implements a simplified MongoDB replica set. Let me walk through the three core mechanisms.
Priority-Based Election
MongoDB’s election is fundamentally different from Raft’s. In Raft, every server has an equal chance of becoming leader (randomized timeout). In MongoDB, priority determines the outcome:
holdElection() {
const candidates = this.members.filter(m => m.isAlive && m.state !== 'arbiter')
// Sort by priority (highest first)
const sorted = [...candidates].sort((a, b) => {
if (b.priority !== a.priority) return b.priority - a.priority
return a.id.localeCompare(b.id)
})
const winner = sorted[0]
this.currentTerm++
// But if the winner is behind on data, skip to the next candidate
const maxOpTime = Math.max(...this.members.map(m => Number(m.lastOpTime)))
if (Number(winner.lastOpTime) < maxOpTime - 1) {
const next = sorted[1]
if (next) {
next.state = 'primary'
this.currentPrimary = next
}
}
}
The priority check makes MongoDB’s failover predictable: you know which node will become primary. But it also adds a “preemption” behavior that Raft avoids — if a higher-priority node rejoins, it triggers a new election and takes over.
The Oplog
Every write creates an oplog entry. The entry includes a monotonically increasing timestamp, the operation type, and the document:
interface OplogEntry {
ts: bigint // Lamport-style timestamp
op: 'i' | 'u' | 'd' // insert, update, delete
ns: string // namespace
o: Record<string, unknown>
}
The primary applies the write AND records it in the oplog atomically:
write(ns: string, doc: Record<string, unknown>): { ok: boolean, opTime?: bigint } {
const ts = nextOpTime()
const oplogEntry: OplogEntry = { ts, op: 'i', ns, o: doc }
// Primary applies immediately
this.currentPrimary.lastOpTime = ts
this.currentPrimary.oplog.push(oplogEntry)
return { ok: true, opTime: ts }
}
Secondaries “tail” the primary’s oplog by pulling entries with timestamps higher than their lastOpTime:
syncSecondaries() {
for (const member of this.members) {
if (member.state === 'secondary' && member.isAlive) {
const newEntries = this.currentPrimary.oplog.filter(
e => e.ts > member.lastOpTime
)
for (const entry of newEntries) {
member.oplog.push({ ...entry })
member.lastOpTime = entry.ts
}
}
}
}
Causal Consistency
MongoDB 3.6+ supports causal consistency using afterClusterTime. A client reads with a specified operation time; the server ensures that its oplog has advanced to at least that time before responding:
read(ns: string, query: Record<string, unknown>, afterClusterTime?: bigint) {
if (afterClusterTime) {
while (this.getReadableMember().operationTime < afterClusterTime) {
this.syncSecondaries() // Wait for oplog to catch up
}
}
// Now safe to read — we'll see the write
}
This is how MongoDB guarantees “read your writes” even when reading from a secondary.
Key Takeaways
- Priority determines election outcomes (not randomness like Raft).
- The oplog is a capped collection that drives replication and recovery.
- Causal consistency (
afterClusterTime) bridges the gap between eventual and strong consistency.
Full Source
View or download the complete implementation: mongodb.ts
Exercises
- Run the simulation. What happens when M1 (priority 3) fails and later recovers?
- Modify the simulation to add an arbiter member that has
state: 'arbiter'. Does the election behavior change? - What happens to the oplog when it reaches its capped collection size limit?
👁️ View Solutions
- When M1 fails, M2 (priority 2) becomes primary. When M1 recovers, it has higher priority (3 > 2) and has the most recent opTime (it was the original primary). M1 triggers a new election and takes over from M2. This is “preemptive failover” — unique to priority-based systems.
- An arbiter participates in elections (votes) but does not store data. It helps reach majority without adding a full replica. In a 2-member + arbiter setup, the arbiter ensures one node gets majority (2/3 votes) without needing a third data-bearing node.
- The oldest oplog entries are overwritten. If a secondary is too far behind (more than the oplog window), it must do a full resync from the primary. This is why MongoDB operators monitor
replSetGetStatusto ensure secondaries stay within the oplog window.
✏️ Exercises
MongoDB — Exercises
Exercise 1
A MongoDB replica set has 5 members: M1 (priority 5), M2 (priority 3), M3 (priority 3), M4 (arbiter), M5 (priority 1). M1 fails. Who becomes the new primary? What happens when M1 recovers?
Exercise 2
Write a write at w: "majority" is acknowledged by the primary. After the primary fails, can the new primary roll back this write? Explain why or why not.
Exercise 3
You read from a secondary with readPreference: "secondaryPreferred" and no afterClusterTime. The secondary’s oplog is 2 seconds behind the primary. You just wrote data 500ms ago. Will your read see the write? What if you add afterClusterTime?
Exercise 4
Describe one scenario where MongoDB’s single-primary write path is a weakness. Describe one scenario where it is a strength.
👁️ View Solutions
-
M2 and M3 have equal priority (3). MongoDB uses additional tiebreakers:
lastOpTime(oplog recency), then_id(lexicographic). Assuming both are equally caught up, one wins. M4 (arbiter) votes but cannot become primary. When M1 recovers (priority 5 > 3), M1 triggers a new election and takes over. This is “preemptive step-down” — it’s behavior unique to MongoDB’s priority-based system. -
A
w: "majority"write cannot be rolled back. It was written to the primary AND replicated to a majority of voting members. Even if the primary fails, the new primary (which must be elected by a majority) will have the latest data from the majority, which includes this write. Rollback only happens forw: 1writes that hadn’t been replicated at the time of failure. -
Without
afterClusterTime: the secondary may return stale data because it hasn’t applied the oplog entry yet. You wrote 500ms ago, but the secondary is 2s behind — you don’t see your own write. WithafterClusterTime: the server blocks until the oplog advances past your opTime. The read blocks for ~1500ms, then returns your write. This is the trade-off: >1s of latency for read-your-writes consistency. -
Weakness: High-throughput write system (e.g., time-series ingestion at 100k writes/sec). The single primary becomes a bottleneck and cannot scale. Cassandra or Riak would be better because any node accepts writes. Strength: Banking transaction system where write order and consistency are critical. Single-primary means no write conflicts, no vector clocks, no last-writer-wins ambiguity. The write path is simple and verifiable.