Key Question
Which real-world systems choose CP, which choose AP, and why?
Deep Dive
Every distributed system must choose what breaks during a partition: consistency (CP) or availability (AP). Letโs examine real systems on both sides.
CP Systems (Sacrifice Availability during partition)
CP systems choose to stop serving writes (or reads) when they canโt guarantee consistency. They prefer correctness over uptime.
HBase:
- Architecture: HBase uses HDFS for storage and ZooKeeper for coordination. It has a single active master and per-region servers.
- During partition: If a region server is partitioned away from ZooKeeper, it stops serving data. The master might reassign its regions.
- User experience: Queries to the partitioned region fail or timeout. The system is unavailable for that data.
- Why CP? HBase is designed for analytic workloads (e.g., supporting a search index or recommendation system) where reading inconsistent data would produce wrong results.
Before partition: During partition:
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ HMaster โ โ HMaster โ โ (partition) โ
โ + ZK โ โ + ZK โ โ โ
โโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโค
โ RegionServerโ โ RegionServerโ โ RegionServerโ
โ (active) โ โ (active) โ โ (stopped!) โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
All reads/writes Denied service
go here for data here
ZooKeeper:
- Architecture: ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast) with a leader and followers.
- During partition: If the leader is isolated from the majority, the leader steps down. The majority elects a new leader. The isolated minority receives no updates and stops serving.
- Why CP? ZooKeeper is a coordination service for distributed locks, configuration, and leader election. You cannot serve stale lock information โ that could cause split-brain (two processes both thinking they hold the lock).
AP Systems (Sacrifice Consistency during partition)
AP systems continue to serve all requests during a partition, even if some replicas return stale data.
Cassandra:
- Architecture: Peer-to-peer with no single leader. Any node can accept writes.
- During partition: Each side of the partition continues accepting reads and writes. Data diverges.
- User experience: All requests succeed. Clients on different sides of the partition see different data temporarily.
- Why AP? Cassandra was designed for always-available systems (e.g., time-series data, IoT sensor data) where collecting data is more important than ensuring every read is perfectly up-to-date.
During partition:
โโโโโโโโโโโ // โโโโโโโโโโโ
โCassandraโ~~~~~~~~โCassandraโ
โ Node A โ โ Node B โ
โ Accepts โ โ Accepts โ
โ writes โ โ writes โ
โ X=1 โ โ X=2 โ
โโโโโโโโโโโ โโโโโโโโโโโ
โ โ
Client A: Client B:
writes X=1 writes X=2
[both succeed!]
DynamoDB (eventual consistency mode):
- Architecture: Multi-master with conflict resolution.
- During partition: Both sides accept reads and writes.
- Why AP? Amazonโs core use case was shopping carts โ dropping requests risks losing sales. Serving slightly stale data (e.g., โitem appears unavailable for 1 secondโ) is better than throwing errors.
System Classification Diagram
Consistency (C)
โ
CP โ CA (not possible in
HBase โ distributed systems)
ZooKeeper โ
etcd โ
Redis โ
(cluster) โ
โ
โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ Availability (A)
โ
โ AP
โ Cassandra
โ DynamoDB EC
โ Riak
โ CouchDB
โ
PC/EC PA/EL
(PACELC) (PACELC)
Use Cases that Justify Each Choice
| System Type | Good for | Bad for |
|---|---|---|
| CP | Financial transactions, locks, leader election, metadata | High-availability web apps, real-time analytics |
| AP | Social media, shopping carts, IoT, content delivery | Systems requiring strict ordering |
The choice should be driven by: what is more costly for your application โ serving stale data or serving no data?
Check Your Understanding
- Is HBase โalwaysโ unavailable during a partition, or only for the partitioned data?
- Why does ZooKeeper need to be CP rather than AP?
- Can Cassandra be configured to behave like a CP system? If so, how?
- During a partition in DynamoDB (eventual consistency mode), what happens to writes on the minority side?
The โSo What?โ
When youโre evaluating a database, one of the first questions should be: โWhat does this system do when the network breaks?โ If the answer is โstops servingโ (CP), you need retry logic and a tolerance for downtime. If the answer is โkeeps serving with stale dataโ (AP), you need conflict resolution and a tolerance for temporary inconsistency. Thereโs no right answer โ only what fits your application.
โ๏ธ Exercises
CAP Theorem โ Exercises
Exercise 1
A traditional relational database (e.g., PostgreSQL) with a single primary and two asynchronous replicas experiences a network partition that isolates the primary from the replicas. Is this system CP or CA during the partition? Explain what happens.
Exercise 2
A system claims to provide โeventual consistency.โ Can it also claim to provide CAP Consistency (linearizability)? Explain the relationship between eventual consistency and CAPโs C.
Exercise 3
During a network partition, an AP system accepts writes on both sides of the partition. When the partition heals, those writes might conflict. Describe two approaches the system can use to resolve these conflicts and converge.
Exercise 4
A single-machine database with no replication experiences a hard drive failure that prevents any reads or writes. Does CAP apply to this scenario? Why or why not? What about a single-machine database that IS replicated (e.g., a single PostgreSQL instance that streams WAL to a standby)?
๐๏ธ View Solutions
CAP Theorem โ Solutions
Solution 1
This system is neither cleanly CP nor AP โ itโs a CP system that becomes unavailable during a partition.
Hereโs what happens:
- The primary is isolated from the replicas.
- Writes to the primary succeed (no C โ replicas donโt have them).
- Reads from the primary return the latest data (A for primary-connected clients).
- Reads from the replicas return stale data (C is violated โ replicas are behind the primary).
- Read-only clients connected to replicas see stale data silently.
PostgreSQLโs asynchronous replication does NOT detect or handle partitions explicitly. So during a partition:
- The system is available (writes succeed on primary, reads on both sides) โ โ A
- But C is violated: replicas serve stale data โ โ C
- During the partition, it behaves like AP (inconsistently)
However, if you consider the system as designed (it wasnโt designed for partition tolerance), then when the partition heals, the replicas eventually catch up. But during the partition, it is NOT consistent.
If the replicas were SYNCHRONOUS (waiting for at least one replica to acknowledge), the system would be CP: the primary would block writes during the partition (since the replica canโt acknowledge), and reads from the primary would be consistent but the primary would be effectively unavailable for writes.
Solution 2
No. Eventual consistency and CAP Consistency (linearizability) are different things.
- CAP Consistency (linearizability) provides a real-time guarantee: once a write completes, every subsequent read must return that value.
- Eventual consistency guarantees convergence only when writes stop. It does NOT guarantee that a read after a write returns the latest value.
You CANNOT claim CAP Consistency AND eventual consistency for the same operation at the same time. Hereโs why:
Client A writes X=1 (completed)
Client B reads X (starts 1ms after A's write completed)
- Under CAP-C: B must see X=1
- Under EC: B might see X=0 (it will eventually converge, maybe after 100ms)
However, a system can provide both guarantees for DIFFERENT operations:
- Strongly consistent (CAP-C) writes to the catalog table.
- Eventually consistent reads to the user data table.
Or provide tunable consistency:
- Cassandra: read at QUORUM โ CAP-C, read at ONE โ eventual consistency.
The key insight: eventual consistency is a WEAKER guarantee than CAP-C. They are on a spectrum, not two sides of a coin.
Solution 3
Approach 1: Last Writer Wins (LWW)
Each write is timestamped. When the partition heals and writes are compared, the write with the latest timestamp wins.
Side A: write X=1 at T=10:00:05
Side B: write X=2 at T=10:00:06
After healing: X=2 wins (T=10:00:06 > 10:00:05)
Simple and always converges. But Side Aโs write (X=1) is silently lost.
Approach 2: Vector Clocks with Sibling Resolution
Each node maintains a version counter. After healing, conflicting versions are detected via vector clocks:
Side A: write X=1 โ version ([A,1])
Side B: write X=2 โ version ([B,1])
Neither dominates the other (A doesn't know about B's write, vice versa).
Resolution: both values are kept as siblings: {1, 2}
On next read, the application merges them:
- Shopping cart: union of items {milk} + {eggs} = {milk, eggs}
- Profile name: ask the user which one to keep
- Counter: sum (if appropriate) or max (if appropriate)
Both approaches are used in production. LWW for simplicity (most systems), siblings for data integrity (Dynamoโs shopping cart).
Solution 4
Part 1: Single-machine database with hard drive failure
CAP does NOT apply to this scenario in the meaningful sense:
- There is NO network (single machine).
- Partition Tolerance (โPโ) is about network partitions โ the system cannot be split into groups that cannot communicate.
- The hard drive failure is a node failure, not a partition.
A single-node system is effectively โCAโ trivially:
- C: Thereโs only one copy of data, so itโs always consistent.
- A: Itโs available as long as the node is up.
- P: โPartitionsโ are impossible because thereโs only one node.
This is why CAP is only interesting for DISTRIBUTED (multi-node) systems.
Part 2: Single PostgreSQL instance with WAL streaming to standby
CAP DOES apply here because there are two nodes connected by a network:
- If the primary fails and the standby isnโt promoted: system becomes unavailable (no A).
- If the network between primary and standby drops: the standby falls behind (no C if reads from standby).
- If you configure synchronous replication: the primary blocks writes when the standby is unreachable (no A for writes).
So even a โsimpleโ primary-standby setup is subject to CAP. The moment you have two nodes connected by a network, CAP applies.