Distributed & Decentralized Systems Curriculum
Production Engineering Resilience · Resilience Patterns

Key Question

How do we recover from transient network failures without accidentally duplicating operations?

Deep Dive

In a distributed system, the network is unreliable. If you send a request and it times out, you face the Three Scenarios of Uncertainty:

  1. The request never reached the server.
  2. The request reached the server and failed.
  3. The request reached the server, succeeded, but the response was lost.

If you blindly retry, scenario #3 leads to double-billing a customer or creating duplicate records. This is where Idempotency becomes the load-bearing pillar of resilience. An idempotent operation is one that can be performed multiple times with the same result as a single application.

The Anatomy of a Resilient Retry

To build a system that survives “flaky” networks, you must implement three things:

  1. Idempotency Keys: The client attaches a unique ID (e.g., a UUID) to the request. The server stores this key. If it sees the same key again, it returns the cached success response instead of executing the logic twice.
  2. Exponential Backoff: Don’t retry immediately. If a service is down, a “retry storm” from thousands of clients acts like a Distributed Denial of Service (DDoS) attack. Instead, wait 1s, then 2s, then 4s…
  3. Jitter: If all clients wait exactly 2s, they will all retry at the same millisecond, causing “thundering herd” spikes. Add a random delay (e.g., wait between 1.5s and 2.5s) to spread the load.

Key Takeaways

  • Retries are necessary because of the “Reliability Fallacy.”
  • Idempotency makes retries safe by ensuring that “exactly-once” effects are achieved despite “at-least-once” delivery.
  • Backoff + Jitter protects the system from its own recovery mechanisms.

Exercises

  1. Why is a POST /orders request usually NOT idempotent by default, while a PUT /users/123 often is?
  2. If a server receives a request with an idempotency key it has already processed, what HTTP status code should it return?
  3. Calculate the wait time for the 4th retry using binary exponential backoff (base 2).

👁️ View Solutions

  1. POST typically creates a new resource each time, whereas PUT (by spec) replaces the resource at a specific location, making it naturally idempotent.
  2. Usually a 200 OK (returning the original result) or a 204 No Content. Some systems return a 409 Conflict if the parameters differ from the original request.
  3. $2^{(4-1)} = 2^3 = 8$ units of time.