Consistency Trade offs · PACELC

Key Question

Even when the network is perfect, why must we still choose between latency and consistency?

Deep Dive

The “ELC” part of PACELC (Else — Latency vs Consistency) addresses the trade-off that dominates day-to-day system operation: how much are you willing to slow down requests to get stronger consistency guarantees?

Why Coordination Costs Latency

Maintaining strong consistency requires coordination between replicas. Coordination means:

Sending messages to multiple nodes.
Waiting for responses (quorum).
Blocking the response to the client until the outcome is known.

Each network round trip adds latency. The more replicas you involve, the slower the operation.

Read from ONE replica:
Client → Replica → Client
  1 round trip, ~1-5ms total
  Risk: data might be stale

Read from QUORUM (3 of 5 replicas):
Client → Coordinator → Replica A, B, C (in parallel)
                       Replica A responds
                       Replica B responds ← slowest of the quorum
                       Coordinator → Client
  2 round trips, latency = slowest quorum member ~5-15ms
  Guarantee: at least quorum saw latest write

Read from ALL replicas:
Client → Coordinator → Replica A, B, C, D, E (in parallel)
                       All 5 respond
                       Coordinator → Client
  2 round trips, latency = slowest of ALL replicas ~5-50ms
  Strongest guarantee: absolutely the latest value

The Latency-Consistency Curve

Latency
  ^
  |                      *  (ALL)
  |                    *
  |                  *
  |               *       (QUORUM)
  |            *
  |         *             (ONE)
  |      *
  |   *
  +───────────────────────────────────> Consistency Level
        Weak                    Strong

Each jump in consistency level adds latency. The marginal cost increases non-linearly because you’re waiting for the slowest node in your quorum.

Cassandra’s Configurable Consistency

Cassandra exposes the ELC trade-off directly to the application developer through configurable consistency levels:

Level	Behavior	Latency	Guarantee
ONE	Read/write to one replica	Lowest	Weak — may read stale
QUORUM	Read/write to majority	Medium	Strong — intersection with write quorum
ALL	Read/write to all replicas	Highest	Strongest — all replicas up-to-date
LOCAL_QUORUM	Quorum in local datacenter	Medium+	Good for multi-DC (avoids cross-DC latency)

A developer chooses per-request:

-- Strong consistency for a payment
SELECT * FROM payments WHERE id=123 CONSISTENCY QUORUM;

-- Weak consistency for a user profile (stale is fine)
SELECT * FROM profiles WHERE id=456 CONSISTENCY ONE;

DynamoDB: Two Read Modes

DynamoDB offers two read consistency modes:

Eventually consistent reads (~5ms): Read from one replica. Might return stale data. Cheaper (consumes half the read capacity units).
Strongly consistent reads (~10ms): Read from quorum. Always returns the latest data. Twice the cost.

The choice affects both latency AND cost:

Eventually consistent:  5ms, 1 RCU per 4KB
Strongly consistent:   10ms, 1 RCU per 4KB (effectively 2x cost)

Why Not Always Use Strong Consistency?

If strong consistency is only 2-10ms slower, why not always use it? Three reasons:

Tail latency: Strong consistency latency is the SLOWEST node. At scale, the 99th percentile of “slowest node” can be 10x the median. This means occasional requests become very slow.
Availability: Strong consistency requires more replicas to be alive. If one replica is down, QUORUM might still work, but ALL would fail. Strong consistency reduces availability.
Geographic distribution: Cross-datacenter coordination adds 50-200ms. For globally distributed apps, strong consistency is often infeasible for user-facing operations.

The Core Insight

The ELC trade-off reveals that consistency is NOT binary. It’s a continuum:

Eventual Read ──→ Read-Your-Writes ──→ Monotonic Read ──→ Causal ──→ Linearizable
   2-5ms              2-6ms               3-8ms            5-15ms      10-200ms

Each step “up” the consistency ladder costs latency. The art of system design is choosing the RIGHT consistency level for each operation, not the strongest one for all operations.

Check Your Understanding

Why does reading from ALL replicas take longer than reading from ONE?
In Cassandra, what consistency level would you choose for a “write new user” vs “read user’s age for display”? Explain.
What happens to the ELC trade-off in a multi-datacenter deployment?
Is it possible to have a system with PC/EL classification? Why or why not?

The “So What?”

The ELC trade-off is the daily reality of database configuration. Every time you set a consistency level, choose a read mode, or configure a quorum size, you’re making an ELC decision. This is what PACELC captures that CAP doesn’t: most of your time is spent optimizing the normal case (low latency vs consistency), not planning for partitions. Cassandra’s tunable consistency and DynamoDB’s two read modes are explicit manifestations of the ELC trade-off.

✏️ Exercises

PACELC — Exercises

Exercise 1

Classify a single-node PostgreSQL database (no replication) using PACELC. Explain your reasoning for each part of the acronym.

Exercise 2

What PACELC classification does a system built on the Raft consensus algorithm provide? Explain by answering both the “P” and “E” parts of the acronym.

Exercise 3

Is a PC/EL system theoretically possible? Explain why or why not. What practical challenges would such a system face?

Exercise 4

You’re building a real-time bidding system for online ads. Bids must be processed in under 50ms or they’re rejected. During a partition, you can afford to lose some bid data (within reason) but cannot afford to drop incoming requests. Which PACELC class do you choose? Which database(s) would you consider? Justify your answer.

👁️ View Solutions

PACELC — Solutions

Solution 1

Single-node PostgreSQL: N/A (PACELC doesn’t meaningfully apply)

PACELC is designed for distributed (multi-node) systems. A single-node database:

P: There are NO replicas, so network partitions are impossible. The “P” condition never triggers.
E (Else): There’s only one copy of the data, so there’s no need to coordinate with other nodes. Reads and writes are trivially consistent and low latency.

Some might argue it’s trivially PC/EC: the single node is “consistent” (one copy = always consistent) and “consistent” in normal operation (same reasoning). But this misses the point — PACELC describes how multi-node systems behave when they need to copy data between nodes. A single-node system doesn’t face these trade-offs.

If forced to classify: it’s a degenerate case. It is both “consistent” (one copy) and “low latency” (no coordination), but it lacks fault tolerance — if the node dies, everything dies. PACELC doesn’t capture this availability/durability concern.

Solution 2

Raft provides PC/EC.

P — Partition (During a partition):

Raft maintains a leader. If the leader is in the majority partition, it continues to operate. The minority partition cannot elect a leader (needs majority), so it cannot accept writes. The minority becomes unavailable for writes.

Raft sacrifices Availability during a partition → PC (Consistency is chosen).

If the leader is in the MINORITY partition (isolated from the majority), the leader steps down. The majority elects a new leader. Writes to the old leader are lost. But from the perspective of the system as a whole: the majority partition remains consistent, and the minority rejects writes. This is CP behavior.

E — Else (Normal operation):

Raft is designed for strong consistency. All reads and writes go through the leader. Write requests are replicated to a majority before being acknowledged. Reads can also go through the leader (guaranteed consistent) or use read-index/lease mechanisms (consistent with some optimization).

Raft provides EC (strong consistency) during normal operation, at the cost of some latency (two round trips for writes: leader → followers → leader acknowledgment).

Raft provides PC/EC. This is exactly what systems like etcd, ZooKeeper (ZAB, which is Raft-like), and Consul provide.

Solution 3

PC/EL is theoretically impossible. Here’s why:

PC means: during a partition, the system chooses Consistency over Availability. To be consistent, nodes must coordinate (e.g., via quorum, leader-based replication).
EL means: in normal operation (no partition), the system chooses Latency over Consistency — it uses fast, uncoordinated operations that might return stale data.

The contradiction:

If you’re willing to accept stale data in normal operation (EL), why would you enforce strong consistency during a partition (PC)? The partition is precisely when maintaining consistency is HARDEST (network is broken). If you don’t enforce consistency when the network works, you certainly can’t enforce it when the network is broken.

More formally:

EL implies that in normal operation, you use techniques like reading from one replica (no quorum coordination).
But for PC, during a partition, you need quorum-based coordination to maintain consistency.
The system would need to somehow “switch” from non-coordinated mode to coordinated mode during a partition. But a partition is the worst time to switch to a coordination-heavy mode — you can’t coordinate because the network is broken!

Practical attempt at PC/EL:

You might try: “Use fast uncoordinated reads normally, but during a partition, block all reads.” That’s not PC/EL — blocking reads is not EL (it’s not low latency).

You might try: “Pre-configure all nodes with a consistent snapshot, so reads during a partition are fast AND consistent.” But this requires coordination to PREPARE the snapshot — you’ve just moved the coordination to a different time.

PC/EL is a contradiction in terms. Consistency requires coordination, and coordination adds latency. You can minimize the latency through optimization, but you cannot eliminate it. Systems like Spanner are PC/EC — they accept the latency cost for consistency.

Solution 4

Recommended class: PA/EL

Reasoning:

The constraints are:

Latency SLA < 50ms → The system must be fast. PC/EC systems like Spanner add 10-50ms for coordination, eating the entire budget.
Cannot drop requests during partition → The system must be Available.
Can afford to lose some bid data → Inconsistency is tolerable (within reason).

Combining these: PA (must be available during partition) + EL (must be fast in normal operation = eventual consistency).

Database choices:

Cassandra (CL=ONE for both reads and writes):
- Writes are fast (commit log + memtable, acknowledged immediately).
- Reads are fast (read from one replica).
- During partition, continues accepting writes.
- Downside: bids might be lost in conflicts (LWW resolution).
- Mitigation: use client-side timestamps and accept some data loss.
Redis Cluster (async replication):
- Extremely low latency (in-memory).
- During partition, each side continues.
- Downside: data loss is possible (async replication).
- Acceptable for bidding (losing a bid = lost revenue, but better than dropping all bids).
Custom in-memory cache + async write to durable store:
- Keep bid state in a distributed cache (Redis/Memcached).
- Async writes to Cassandra/PostgreSQL for durability.
- Trade-off: bid data is ephemeral; losing recent bids is OK.
- This is a common real-world RTB architecture.

Why NOT PC/EC:

Spanner: 10-50ms latency is too high for the 50ms SLA.
ZooKeeper/etcd: too slow (all writes through leader).
MongoDB (primary reads): a single primary becomes a bottleneck and latency risk.

Best answer: PA/EL, implemented with Cassandra (CL=ONE) or Redis Cluster. Accept some data loss as a trade-off for speed and availability.