Production Engineering & Resilience: Rate Limiting

Production Engineering Resilience · Traffic Control

Key Question

How do we prevent a single abusive client or a “retry storm” from crashing our entire system?

Deep Dive

A distributed system has finite resources (CPU, Memory, Database connections). If the incoming request rate exceeds the system’s capacity, the result is Congestion Collapse: the system spends more time managing overhead than doing actual work, and latency spikes to infinity.

Rate Limiting is the practice of rejecting requests once a certain threshold is reached to protect the system’s health.

Common Algorithms

Token Bucket: The system has a “bucket” that refills with tokens at a fixed rate. Each request costs one token. If the bucket is empty, the request is rejected. This allows for small “bursts” of traffic but enforces a long-term average.
Leaky Bucket: Imagine a bucket with a hole in the bottom. Requests enter the bucket, but they “leak” out (are processed) at a constant rate. This smoothes out traffic perfectly, eliminating bursts.
Fixed Window: You allow 100 requests per minute. At 12:01:00, the counter resets. This is simple but dangerous: if a client sends 100 requests at 12:00:59 and another 100 at 12:01:01, the system sees a spike of 200 requests in 2 seconds.

Distributed Rate Limiting

In a distributed system, rate limiting is hard because the “counter” must be shared across multiple servers. If you store the counter in a local variable, a client could hit 10 different servers and get 10x the limit.

The solution is usually an external, fast Key-Value store like Redis. Each server checks Redis before processing a request. To minimize the network “tax” of checking Redis for every request, some systems use Local Batching: each server reserves 10 tokens from Redis at once and doles them out locally.

Key Takeaways

Rate limiting is about Fairness and Self-Preservation.
Leaky Bucket = Smooth traffic; Token Bucket = Controlled bursts.
Distributed rate limiting requires a centralized fast counter (like Redis).

Exercises

Which algorithm is better for an API that expects occasional bursts of traffic?
A client is rate-limited. What HTTP status code should the server return?
How does “Local Batching” improve performance in a distributed rate limiter?

👁️ View Solutions

Token Bucket, as it allows the client to use up accumulated tokens quickly.
429 Too Many Requests.
It reduces the number of network round-trips to the central counter (e.g., Redis), lowering latency and reducing load on the counter itself.