Distributed & Decentralized Systems Curriculum
Reflection Real Systems · Redis Cluster

Key Question

In a world of strongly consistent databases, why would anyone choose Redis Cluster?

Deep Dive

Redis Cluster is not a general-purpose database. It is a specialized tool for workloads that fit a specific profile: high-throughput, low-latency, and tolerant of data loss.

The Case for Redis Cluster

The typical production use case is a session store or cache layer. Consider an e-commerce application:

  • A user browses products. Each page load reads the session → single-digit millisecond latency.
  • The user adds items to a cart. The cart lives in Redis → sub-millisecond write.
  • If a Redis node fails during a flash sale, the last 2 seconds of cart additions are lost → users re-add items. Annoying but not catastrophic.
  • Peak traffic is 500K requests/second → Redis handles it on 3 nodes.

Contrast this with a strongly consistent system:

  • Spanner would guarantee no data loss, but adds 5–15ms commit wait latency.
  • Cassandra (QUORUM) would add latency and require more nodes.
  • etcd would handle the load but at a fraction of the throughput.

When NOT to Use Redis Cluster

  • Financial transactions: losing write #4 of a series means the balance is wrong.
  • Counter data that must be exact: “How many times did this ad get viewed?” — Redis’s async replication means views can disappear.
  • Data that exceeds memory budget: Redis is an in-memory store. If your dataset is 500 GB and you have 3 × 64 GB nodes, you have a problem.

The Operational Reality

Running Redis Cluster in production requires:

  • Slot monitoring: Uneven slot distribution creates hot spots.
  • Resharding during low traffic: Moving slots between nodes is network-intensive.
  • Replica placement: Replicas should be in different failure domains (racks, availability zones).
  • Cluster bus: A separate, non-TLS-protected port (ports +10000) — another thing to firewall.

The “So What?”

Redis Cluster is the most honest distributed system of the four studied in this module. It does not pretend to be strongly consistent. Its documentation explicitly states what you lose. This honesty is valuable: when you choose Redis Cluster, you know exactly what you’re getting.

Full Source

View or download the complete implementation: redis-cluster.ts

Exercises

  1. You’re building a real-time leaderboard for a gaming platform. Would you use Redis Cluster? Why or why not?
  2. A startup uses Redis Cluster as its primary database. What risks does this pose as the company grows?
  3. Design a hybrid architecture that uses Redis Cluster for caching and another system for durable storage. How would you handle cache misses and writes?

👁️ View Solutions

  1. Yes — Redis’s sorted sets (ZADD/ZRANGE) are the ideal data structure for leaderboards, and the async replication gap is acceptable for a leaderboard where scores may be slightly stale.
  2. Three risks: (a) data loss during failover erases recent writes, (b) total dataset may exceed available memory as the company grows, (c) lack of cross-slot transactions makes multi-key operations difficult.
  3. Pattern: “Cache-aside” or “Write-through.” On read: check Redis (cache hit → return), miss → read from Postgres, write to Redis, return. On write: write to Postgres first, then async write to Redis. This combines Redis’s speed with Postgres’s durability. Redis Cluster handles 99% of reads; Postgres handles writes.

✏️ Exercises

Redis Cluster — Exercises

Exercise 1

You have a 6-node Redis Cluster (3 masters, 3 replicas). A master node handling slots 5461–10922 crashes. Its replica is promoted. What keys are lost?

Exercise 2

Given the key set: user:100, {shard:a}:profile, order:{2024}:001, session:abc, compute the hash slot for each using CRC16 mod 16384. Which keys would land on the same node if you use hash tags?

Exercise 3

Redis Cluster’s gossip protocol requires each node to maintain its own epoch (configuration version). When a new node joins, its epoch is 0. The cluster leader assigns it a higher epoch. Why is this epoch necessary? (Hint: think about stale routing tables.)

Exercise 4

Compare the slot bitmap approach (Redis) with the token ring approach (Cassandra). Which requires more metadata to route a request? Which handles node addition more smoothly?


👁️ View Solutions

  1. Only the keys that were written to the crashed master but not yet replicated to the replica. Since replication is asynchronous, the window is typically 1–100 milliseconds of writes. The keys on the surviving masters are unaffected.

  2. Hash slot computation:

    • CRC16("user:100") mod 16384 = slot X
    • CRC16("shard:a") mod 16384 = slot Y (only “shard:a” due to {})
    • CRC16("2024") mod 16384 = slot Z (only “2024” due to {})
    • CRC16("session:abc") mod 16384 = slot W
    • {shard:a}:profile and {shard:a}:name would share the same slot. order:{2024}:001 and order:{2024}:002 would share a different slot.
  3. The epoch prevents stale routing information from overwriting fresh data. If a partitioned node wakes up with an old epoch and broadcasts its slot map, nodes with a higher epoch know to ignore it. This is functionally identical to Raft’s term number.

  4. Redis uses a fixed-size 16384-entry array — O(1) lookup, 2 KB per node. Cassandra’s token ring uses a sorted list of tokens with binary search — O(log N) lookup. Redis is simpler at the cost of manual rebalancing; Cassandra is more complex but handles additions and removals automatically.