Consistency Trade offs · PACELC

Key Question

How do real databases like Dynamo, Cassandra, Spanner, and MongoDB fit into PACELC?

Deep Dive

Let’s map major distributed databases to their PACELC classifications and understand WHY each system lands where it does.

Dynamo: PA/EL

Amazon’s Dynamo (the original, not DynamoDB) is the archetypal PA/EL system.

PA: During a partition, Dynamo prefers availability. Writes are accepted on any reachable node. This was the explicit design goal: “always writable” for Amazon’s shopping cart.
EL: During normal operation, Dynamo prefers low latency. Writes go to a preference list (a subset of nodes), and reads can go to any node. No cross-node coordination is required in the common case.

Why? Amazon’s core requirement: the shopping cart must always accept items. Losing a sale because the network is slow is worse than temporarily showing an incorrect cart.

Dynamo architecture (simplified):
  Request → Coordinator
            ├── Write to preference list (N nodes, async)
            ├── Read from one node (fast)
            └── Vector clocks for conflict detection

Cassandra: PA/EL (Tunable)

Cassandra follows Dynamo’s design but adds tunable consistency levels.

PA: During a partition, each side continues to operate. Writes succeed on reachable nodes.
EL: By default, reads use ONE (lowest latency). Writes are also fast (written to commit log and memtable, acknowledged before full replication).

BUT — Cassandra can be configured to behave like PC/EC by setting ALL operations to QUORUM or ALL. In practice, most deployments use ONE for reads and QUORUM for writes, balancing latency and consistency.

Cassandra consistency spectrum:
  CL=ONE:    Latency ~ms,  stale possible (PA/EL behavior)
  CL=QUORUM: Latency ~5-15ms, strong (PC/EC behavior)
  CL=ALL:    Latency ~10-50ms, strongest (decreased availability)

Spanner: PC/EC

Google Spanner chooses consistency in both regimes.

PC: During a partition, Spanner uses the Paxos protocol. The minority partition cannot elect a leader and becomes unavailable for writes. Reads from the minority return nothing (or time out). Consistency is preserved.
EC: During normal operation, Spanner uses TrueTime (GPS + atomic clocks) to assign global timestamps. Every read and write is linearizable, incurring coordination latency.

Why? Spanner is designed for globally consistent data (e.g., Google Ads, Google Play). Inconsistent financial data is unacceptable.

Spanner's TrueTime:
  ┌─────────────────┐
  │ Each datacenter │──── GPS receivers + atomic clocks
  │ has a time      │──── TT.interval = [now - ε, now + ε]
  │ daemon          │
  └─────────────────┘
  
  Commit waits: TT.after(commit_timestamp)
  Ensures linearizability across continents
  Cost: commit wait adds ~10ms per transaction

MongoDB (with replica sets): PC/EC

MongoDB uses a primary-secondary replication model with Raft-like leader election.

PC: During a partition, the majority elects a new primary if the current primary is isolated. The minority remains read-only (or elects a secondary that cannot become primary). Writes to the minority fail.
EC: In normal operation, writes go to the primary and are replicated to secondaries. Reads by default go to the primary (strong consistency). Reads from secondaries are possible but risk staleness.

MongoDB’s default read preference is primary (consistent). You can relax it to primaryPreferred, secondary, or nearest (lower latency, but possible staleness) — moving toward PA/EL for reads.

Riak: PA/EL

Riak is another Dynamo-inspired system.

PA: Available during partitions.
EL: Low-latency operation by default.

Riak uniquely emphasizes CRDTs (Conflict-free Replicated Data Types) to minimize conflicts during partition healing.

Comparison Table

System	PACELC	CAP	Consistency Level	Partition Behavior
Dynamo	PA/EL	AP	Eventual	Accepts all writes
Cassandra	PA/EL	AP	Tunable (ONE→ALL)	Accepts writes (tunable)
Riak	PA/EL	AP	Eventual + CRDTs	Accepts all writes
DynamoDB	PA/EL	AP	Eventual / Strong	Accepts writes on reachable
Spanner	PC/EC	CP	Strict serializable	Minority unavailable
ZooKeeper	PC/EC	CP	Linearizable	Minority unavailable
etcd	PC/EC	CP	Linearizable	Minority unavailable
MongoDB	PC/EC	CP	Strong (primary)	Minority read-only

What PACELC Doesn’t Capture

PACELC is a high-level classification. It doesn’t capture:

Tunability: Cassandra can move between PA/EL and PC/EC based on request-level settings.
Cross-DC behavior: DynamoDB’s global tables behave differently than single-region tables.
Read vs. Write asymmetry: A system might be PC for writes (strong) but EL for reads (fast).

Despite these limitations, PACELC is strictly more informative than CAP for understanding real systems.

Check Your Understanding

Why is Spanner classified as PC/EC rather than PC/EL?
Can a DynamoDB Global Table be PA/EL in normal operation? Why?
How does Cassandra’s tunable consistency challenge the fixed PACELC classification?
Why is ZooKeeper’s PC/EC necessary for its role as a coordination service?

The “So What?”

Classifying a system with PACELC tells you exactly how it will behave in good times and bad. If you’re choosing between Spanner (PC/EC) and Cassandra (PA/EL), you’re not just choosing partition behavior — you’re choosing: “Do I pay the latency tax during normal operation for strong consistency?” If your application needs strong consistency always, you pay the latency cost (Spanner, ZooKeeper). If you can tolerate inconsistency, you get low latency (Cassandra, DynamoDB). PACELC makes this trade-off explicit.

✏️ Exercises

PACELC — Exercises

Exercise 1

Classify a single-node PostgreSQL database (no replication) using PACELC. Explain your reasoning for each part of the acronym.

Exercise 2

What PACELC classification does a system built on the Raft consensus algorithm provide? Explain by answering both the “P” and “E” parts of the acronym.

Exercise 3

Is a PC/EL system theoretically possible? Explain why or why not. What practical challenges would such a system face?

Exercise 4

You’re building a real-time bidding system for online ads. Bids must be processed in under 50ms or they’re rejected. During a partition, you can afford to lose some bid data (within reason) but cannot afford to drop incoming requests. Which PACELC class do you choose? Which database(s) would you consider? Justify your answer.

👁️ View Solutions

PACELC — Solutions

Solution 1

Single-node PostgreSQL: N/A (PACELC doesn’t meaningfully apply)

PACELC is designed for distributed (multi-node) systems. A single-node database:

P: There are NO replicas, so network partitions are impossible. The “P” condition never triggers.
E (Else): There’s only one copy of the data, so there’s no need to coordinate with other nodes. Reads and writes are trivially consistent and low latency.

Some might argue it’s trivially PC/EC: the single node is “consistent” (one copy = always consistent) and “consistent” in normal operation (same reasoning). But this misses the point — PACELC describes how multi-node systems behave when they need to copy data between nodes. A single-node system doesn’t face these trade-offs.

If forced to classify: it’s a degenerate case. It is both “consistent” (one copy) and “low latency” (no coordination), but it lacks fault tolerance — if the node dies, everything dies. PACELC doesn’t capture this availability/durability concern.

Solution 2

Raft provides PC/EC.

P — Partition (During a partition):

Raft maintains a leader. If the leader is in the majority partition, it continues to operate. The minority partition cannot elect a leader (needs majority), so it cannot accept writes. The minority becomes unavailable for writes.

Raft sacrifices Availability during a partition → PC (Consistency is chosen).

If the leader is in the MINORITY partition (isolated from the majority), the leader steps down. The majority elects a new leader. Writes to the old leader are lost. But from the perspective of the system as a whole: the majority partition remains consistent, and the minority rejects writes. This is CP behavior.

E — Else (Normal operation):

Raft is designed for strong consistency. All reads and writes go through the leader. Write requests are replicated to a majority before being acknowledged. Reads can also go through the leader (guaranteed consistent) or use read-index/lease mechanisms (consistent with some optimization).

Raft provides EC (strong consistency) during normal operation, at the cost of some latency (two round trips for writes: leader → followers → leader acknowledgment).

Raft provides PC/EC. This is exactly what systems like etcd, ZooKeeper (ZAB, which is Raft-like), and Consul provide.

Solution 3

PC/EL is theoretically impossible. Here’s why:

PC means: during a partition, the system chooses Consistency over Availability. To be consistent, nodes must coordinate (e.g., via quorum, leader-based replication).
EL means: in normal operation (no partition), the system chooses Latency over Consistency — it uses fast, uncoordinated operations that might return stale data.

The contradiction:

If you’re willing to accept stale data in normal operation (EL), why would you enforce strong consistency during a partition (PC)? The partition is precisely when maintaining consistency is HARDEST (network is broken). If you don’t enforce consistency when the network works, you certainly can’t enforce it when the network is broken.

More formally:

EL implies that in normal operation, you use techniques like reading from one replica (no quorum coordination).
But for PC, during a partition, you need quorum-based coordination to maintain consistency.
The system would need to somehow “switch” from non-coordinated mode to coordinated mode during a partition. But a partition is the worst time to switch to a coordination-heavy mode — you can’t coordinate because the network is broken!

Practical attempt at PC/EL:

You might try: “Use fast uncoordinated reads normally, but during a partition, block all reads.” That’s not PC/EL — blocking reads is not EL (it’s not low latency).

You might try: “Pre-configure all nodes with a consistent snapshot, so reads during a partition are fast AND consistent.” But this requires coordination to PREPARE the snapshot — you’ve just moved the coordination to a different time.

PC/EL is a contradiction in terms. Consistency requires coordination, and coordination adds latency. You can minimize the latency through optimization, but you cannot eliminate it. Systems like Spanner are PC/EC — they accept the latency cost for consistency.

Solution 4

Recommended class: PA/EL

Reasoning:

The constraints are:

Latency SLA < 50ms → The system must be fast. PC/EC systems like Spanner add 10-50ms for coordination, eating the entire budget.
Cannot drop requests during partition → The system must be Available.
Can afford to lose some bid data → Inconsistency is tolerable (within reason).

Combining these: PA (must be available during partition) + EL (must be fast in normal operation = eventual consistency).

Database choices:

Cassandra (CL=ONE for both reads and writes):
- Writes are fast (commit log + memtable, acknowledged immediately).
- Reads are fast (read from one replica).
- During partition, continues accepting writes.
- Downside: bids might be lost in conflicts (LWW resolution).
- Mitigation: use client-side timestamps and accept some data loss.
Redis Cluster (async replication):
- Extremely low latency (in-memory).
- During partition, each side continues.
- Downside: data loss is possible (async replication).
- Acceptable for bidding (losing a bid = lost revenue, but better than dropping all bids).
Custom in-memory cache + async write to durable store:
- Keep bid state in a distributed cache (Redis/Memcached).
- Async writes to Cassandra/PostgreSQL for durability.
- Trade-off: bid data is ephemeral; losing recent bids is OK.
- This is a common real-world RTB architecture.

Why NOT PC/EC:

Spanner: 10-50ms latency is too high for the 50ms SLA.
ZooKeeper/etcd: too slow (all writes through leader).
MongoDB (primary reads): a single primary becomes a bottleneck and latency risk.

Best answer: PA/EL, implemented with Cassandra (CL=ONE) or Redis Cluster. Accept some data loss as a trade-off for speed and availability.