Reflection Real Systems · MongoDB

Key Question

How does a MongoDB replica set elect a primary and replicate writes via the oplog?

Deep Dive

The mongodb.ts file implements a simplified MongoDB replica set. Let me walk through the three core mechanisms.

Priority-Based Election

MongoDB’s election is fundamentally different from Raft’s. In Raft, every server has an equal chance of becoming leader (randomized timeout). In MongoDB, priority determines the outcome:

holdElection() {
  const candidates = this.members.filter(m => m.isAlive && m.state !== 'arbiter')

  // Sort by priority (highest first)
  const sorted = [...candidates].sort((a, b) => {
    if (b.priority !== a.priority) return b.priority - a.priority
    return a.id.localeCompare(b.id)
  })

  const winner = sorted[0]
  this.currentTerm++

  // But if the winner is behind on data, skip to the next candidate
  const maxOpTime = Math.max(...this.members.map(m => Number(m.lastOpTime)))
  if (Number(winner.lastOpTime) < maxOpTime - 1) {
    const next = sorted[1]
    if (next) {
      next.state = 'primary'
      this.currentPrimary = next
    }
  }
}

The priority check makes MongoDB’s failover predictable: you know which node will become primary. But it also adds a “preemption” behavior that Raft avoids — if a higher-priority node rejoins, it triggers a new election and takes over.

The Oplog

Every write creates an oplog entry. The entry includes a monotonically increasing timestamp, the operation type, and the document:

interface OplogEntry {
  ts: bigint         // Lamport-style timestamp
  op: 'i' | 'u' | 'd' // insert, update, delete
  ns: string         // namespace
  o: Record<string, unknown>
}

The primary applies the write AND records it in the oplog atomically:

write(ns: string, doc: Record<string, unknown>): { ok: boolean, opTime?: bigint } {
  const ts = nextOpTime()
  const oplogEntry: OplogEntry = { ts, op: 'i', ns, o: doc }

  // Primary applies immediately
  this.currentPrimary.lastOpTime = ts
  this.currentPrimary.oplog.push(oplogEntry)

  return { ok: true, opTime: ts }
}

Secondaries “tail” the primary’s oplog by pulling entries with timestamps higher than their lastOpTime:

syncSecondaries() {
  for (const member of this.members) {
    if (member.state === 'secondary' && member.isAlive) {
      const newEntries = this.currentPrimary.oplog.filter(
        e => e.ts > member.lastOpTime
      )
      for (const entry of newEntries) {
        member.oplog.push({ ...entry })
        member.lastOpTime = entry.ts
      }
    }
  }
}

Causal Consistency

MongoDB 3.6+ supports causal consistency using afterClusterTime. A client reads with a specified operation time; the server ensures that its oplog has advanced to at least that time before responding:

read(ns: string, query: Record<string, unknown>, afterClusterTime?: bigint) {
  if (afterClusterTime) {
    while (this.getReadableMember().operationTime < afterClusterTime) {
      this.syncSecondaries()  // Wait for oplog to catch up
    }
  }
  // Now safe to read — we'll see the write
}

This is how MongoDB guarantees “read your writes” even when reading from a secondary.

Key Takeaways

Priority determines election outcomes (not randomness like Raft).
The oplog is a capped collection that drives replication and recovery.
Causal consistency (afterClusterTime) bridges the gap between eventual and strong consistency.

Full Source

View or download the complete implementation: mongodb.ts

Exercises

Run the simulation. What happens when M1 (priority 3) fails and later recovers?
Modify the simulation to add an arbiter member that has state: 'arbiter'. Does the election behavior change?
What happens to the oplog when it reaches its capped collection size limit?

👁️ View Solutions

When M1 fails, M2 (priority 2) becomes primary. When M1 recovers, it has higher priority (3 > 2) and has the most recent opTime (it was the original primary). M1 triggers a new election and takes over from M2. This is “preemptive failover” — unique to priority-based systems.
An arbiter participates in elections (votes) but does not store data. It helps reach majority without adding a full replica. In a 2-member + arbiter setup, the arbiter ensures one node gets majority (2/3 votes) without needing a third data-bearing node.
The oldest oplog entries are overwritten. If a secondary is too far behind (more than the oplog window), it must do a full resync from the primary. This is why MongoDB operators monitor replSetGetStatus to ensure secondaries stay within the oplog window.

✏️ Exercises

MongoDB — Exercises

Exercise 1

A MongoDB replica set has 5 members: M1 (priority 5), M2 (priority 3), M3 (priority 3), M4 (arbiter), M5 (priority 1). M1 fails. Who becomes the new primary? What happens when M1 recovers?

Exercise 2

Write a write at w: "majority" is acknowledged by the primary. After the primary fails, can the new primary roll back this write? Explain why or why not.

Exercise 3

You read from a secondary with readPreference: "secondaryPreferred" and no afterClusterTime. The secondary’s oplog is 2 seconds behind the primary. You just wrote data 500ms ago. Will your read see the write? What if you add afterClusterTime?

Exercise 4

Describe one scenario where MongoDB’s single-primary write path is a weakness. Describe one scenario where it is a strength.

👁️ View Solutions

M2 and M3 have equal priority (3). MongoDB uses additional tiebreakers: lastOpTime (oplog recency), then _id (lexicographic). Assuming both are equally caught up, one wins. M4 (arbiter) votes but cannot become primary. When M1 recovers (priority 5 > 3), M1 triggers a new election and takes over. This is “preemptive step-down” — it’s behavior unique to MongoDB’s priority-based system.
A w: "majority" write cannot be rolled back. It was written to the primary AND replicated to a majority of voting members. Even if the primary fails, the new primary (which must be elected by a majority) will have the latest data from the majority, which includes this write. Rollback only happens for w: 1 writes that hadn’t been replicated at the time of failure.
Without afterClusterTime: the secondary may return stale data because it hasn’t applied the oplog entry yet. You wrote 500ms ago, but the secondary is 2s behind — you don’t see your own write. With afterClusterTime: the server blocks until the oplog advances past your opTime. The read blocks for ~1500ms, then returns your write. This is the trade-off: >1s of latency for read-your-writes consistency.
Weakness: High-throughput write system (e.g., time-series ingestion at 100k writes/sec). The single primary becomes a bottleneck and cannot scale. Cassandra or Riak would be better because any node accepts writes. Strength: Banking transaction system where write order and consistency are critical. Single-primary means no write conflicts, no vector clocks, no last-writer-wins ambiguity. The write path is simple and verifiable.