Key Question
How do we recover from transient network failures without accidentally duplicating operations?
Deep Dive
In a distributed system, the network is unreliable. If you send a request and it times out, you face the Three Scenarios of Uncertainty:
- The request never reached the server.
- The request reached the server and failed.
- The request reached the server, succeeded, but the response was lost.
If you blindly retry, scenario #3 leads to double-billing a customer or creating duplicate records. This is where Idempotency becomes the load-bearing pillar of resilience. An idempotent operation is one that can be performed multiple times with the same result as a single application.
The Anatomy of a Resilient Retry
To build a system that survives “flaky” networks, you must implement three things:
- Idempotency Keys: The client attaches a unique ID (e.g., a UUID) to the request. The server stores this key. If it sees the same key again, it returns the cached success response instead of executing the logic twice.
- Exponential Backoff: Don’t retry immediately. If a service is down, a “retry storm” from thousands of clients acts like a Distributed Denial of Service (DDoS) attack. Instead, wait 1s, then 2s, then 4s…
- Jitter: If all clients wait exactly 2s, they will all retry at the same millisecond, causing “thundering herd” spikes. Add a random delay (e.g., wait between 1.5s and 2.5s) to spread the load.
Key Takeaways
- Retries are necessary because of the “Reliability Fallacy.”
- Idempotency makes retries safe by ensuring that “exactly-once” effects are achieved despite “at-least-once” delivery.
- Backoff + Jitter protects the system from its own recovery mechanisms.
Exercises
- Why is a
POST /ordersrequest usually NOT idempotent by default, while aPUT /users/123often is? - If a server receives a request with an idempotency key it has already processed, what HTTP status code should it return?
- Calculate the wait time for the 4th retry using binary exponential backoff (base 2).
👁️ View Solutions
POSTtypically creates a new resource each time, whereasPUT(by spec) replaces the resource at a specific location, making it naturally idempotent.- Usually a
200 OK(returning the original result) or a204 No Content. Some systems return a409 Conflictif the parameters differ from the original request. - $2^{(4-1)} = 2^3 = 8$ units of time.