IdempotencyRetryExponential BackoffCircuit BreakerFault ToleranceResilience

Reliability

Build resilient systems — idempotency keys, retry with exponential backoff, and circuit breaker patterns. Ensure correctness despite retries, duplicates, and partial failures.

28 min read9 sections

The Big Picture — Why Failures Are Inevitable

In a distributed system, failures aren't exceptions — they're the norm. Networks drop packets, servers crash, databases timeout, and third-party APIs go down. The question isn't "will something fail?" — it's "when it fails, does the system handle it correctly?"

💳

The Double-Charge Problem

You're buying concert tickets online. You click 'Pay $200'. The screen freezes — did it work? You click again. Without reliability patterns, you just bought two tickets and got charged $400. The payment server received both requests and processed both. With idempotency: the second click sends the same idempotency key. The server recognizes it's a duplicate and returns the original result — one charge, one ticket. With retry + backoff: if the first request timed out, the client waits 1 second and retries. If it fails again, waits 2 seconds. Doesn't hammer the server. With circuit breaker: if the payment service is down, the system stops trying after 5 failures and shows 'Payment service temporarily unavailable' instead of timing out for 30 seconds.

Types of Failures

⏱️

Network Timeouts

The request was sent but no response came back. Did the server process it? You don't know. Retrying might cause a duplicate. Not retrying might lose the operation.

💔

Partial Failures

The payment was charged but the order wasn't created. Or the order was created but the inventory wasn't decremented. Half the operation succeeded, half didn't.

🌊

Cascading Failures

Service A calls Service B which calls Service C. Service C is slow. B's threads are blocked waiting for C. A's threads are blocked waiting for B. The entire system freezes.

🔥 Key Insight

Reliability patterns don't prevent failures — they ensure the system behaves correctly despite failures. Idempotency handles duplicates. Retries handle transient errors. Circuit breakers prevent cascading failures. Together, they make a system that's resilient, not just functional.

Reliability Patterns Overview

👤

Client

Sends request

🔁

Retry Logic

Handles transient failures

🖥️

Service

Idempotency check

🔌

Downstream

Circuit breaker

🛡️ Fail Safe (Retry + Idempotency)

Client retries with exponential backoff on timeout
Server uses idempotency key to detect duplicates
Same request processed at most once
Result: safe retries, no duplicate side effects

🔌 Fail Fast (Circuit Breaker)

Monitor failure rate of downstream calls
If failures exceed threshold → stop calling (circuit opens)
Return fallback response immediately
Result: no cascading failures, fast error response

Idempotency Keys

An idempotency key is a unique identifier attached to a request that allows the server to recognize and deduplicate retries. If the same key is seen twice, the server returns the cached result from the first execution without reprocessing.

Idempotency Key — Internal Flowtext

First request:
  POST /api/payments
  Idempotency-Key: idem_a1b2c3d4
  Body: { "amount": 200, "to": "merchant_42" }

  Server flow:
    1. Check Redis: EXISTS idem_a1b2c3d4 → NO
    2. Acquire lock: SET idem_a1b2c3d4:lock EX 30 NX → OK
    3. Process payment → success, txn_id = "txn_789"
    4. Store result: SET idem_a1b2c3d4 '{"txn_id":"txn_789","status":"success"}' EX 86400
    5. Release lock
    6. Return: 201 Created { "txn_id": "txn_789" }

Retry (same key, network timeout on first attempt):
  POST /api/payments
  Idempotency-Key: idem_a1b2c3d4
  Body: { "amount": 200, "to": "merchant_42" }

  Server flow:
    1. Check Redis: EXISTS idem_a1b2c3d4 → YES
    2. Return cached result: 200 OK { "txn_id": "txn_789" }
    3. Payment NOT processed again ✅

Conflict (same key, different body):
  POST /api/payments
  Idempotency-Key: idem_a1b2c3d4
  Body: { "amount": 500, "to": "merchant_99" }  ← different!

  Server flow:
    1. Check Redis: EXISTS idem_a1b2c3d4 → YES
    2. Compare body hash → MISMATCH
    3. Return: 422 Unprocessable Entity
       "Idempotency key already used with different parameters"

Implementation Details

Rules

✅Client generates the key (UUID v4 is standard)
✅Key is sent in a header: Idempotency-Key: uuid
✅Server stores key → result in Redis (TTL: 24-48h)
✅Same key + same body = return cached result
✅Same key + different body = return 422 Conflict
✅Lock during processing to prevent race conditions

Where It's Critical

✅Payment processing (Stripe, PayPal, Square)
✅Order creation (prevent duplicate orders)
✅Email/SMS sending (prevent duplicate notifications)
✅Any POST with irreversible side effects
✅Any operation in an at-least-once delivery system

🎯 Interview Insight

Idempotency keys are the #1 answer to "how do you prevent duplicate payments?" Walk through the flow: client generates UUID → sends with request → server checks Redis → process or return cached result. Mention Stripe as the real-world example — they require idempotency keys on all mutating API calls.

Retry with Exponential Backoff

When a request fails due to a transient error (timeout, 503, connection refused), the client should retry — but not immediately. Exponential backoff increases the delay between retries, giving the failing system time to recover without being hammered by retries.

Exponential Backoff — Internal Mechanicstext

Base delay: 1 second
Max retries: 5
Jitter: random(0, delay * 0.5)

Attempt 1: request fails (timeout)
  → wait 1s + jitter(0, 0.5s) = ~1.3s

Attempt 2: request fails (503)
  → wait 2s + jitter(0, 1.0s) = ~2.7s

Attempt 3: request fails (timeout)
  → wait 4s + jitter(0, 2.0s) = ~5.1s

Attempt 4: request succeeds ✅
  → return result

If all 5 attempts fail:
  → return error to caller
  → total wait: ~1 + 2 + 4 + 8 + 16 = ~31 seconds

Why jitter?
  Without jitter: 1000 clients all retry at exactly 1s, 2s, 4s
  → Synchronized retry storms overwhelm the recovering server
  With jitter: retries are spread randomly across the window
  → Server recovers gradually

What to Retry vs What Not to Retry

Error Type	Retry?	Why
408 Request Timeout	✅ Yes	Transient — server might be temporarily slow
429 Too Many Requests	✅ Yes (after Retry-After)	Rate limited — wait and try again
500 Internal Server Error	✅ Yes (cautiously)	Might be transient, but could be a bug
502 Bad Gateway	✅ Yes	Upstream server temporarily unavailable
503 Service Unavailable	✅ Yes	Server overloaded, likely recovers soon
400 Bad Request	❌ No	Client error — retrying won't fix invalid input
401 Unauthorized	❌ No	Auth failed — retrying with same token is pointless
403 Forbidden	❌ No	Permission denied — won't change on retry
404 Not Found	❌ No	Resource doesn't exist — retrying won't create it

Best Practices

✅Always add jitter to prevent synchronized retry storms
✅Set a max retry count (3-5 is typical)
✅Set a max total timeout (30-60 seconds)
✅Only retry idempotent operations (or use idempotency keys)
✅Log each retry with attempt number for debugging
✅Respect Retry-After headers from the server

Anti-Patterns

❌Retrying immediately (hammers the failing server)
❌Retrying forever (wastes resources, blocks the caller)
❌Retrying non-idempotent operations without idempotency keys
❌Retrying 4xx errors (client errors won't fix themselves)
❌No jitter (causes thundering herd on recovery)
❌Retrying without logging (impossible to debug)

🎯 Interview Insight

Always pair retries with idempotency. Say: "I'd use exponential backoff with jitter for transient failures (5xx, timeouts). Each retry includes the same idempotency key, so even if the original request actually succeeded (but the response was lost), the retry is safe — the server returns the cached result."

Circuit Breaker Pattern

A circuit breaker monitors calls to a downstream service. When failures exceed a threshold, it "opens" the circuit — immediately rejecting requests instead of waiting for timeouts. This prevents cascading failures and gives the failing service time to recover.

⚡

The Electrical Circuit Breaker

An electrical circuit breaker trips when current exceeds a safe level — cutting power instantly to prevent a fire. It doesn't keep pushing more electricity through a failing wire. A software circuit breaker does the same: when a downstream service is failing, it stops sending requests (trips the breaker) to prevent the failure from spreading. After a cooldown period, it cautiously tests if the service has recovered.

Three States

Circuit Breaker — State Machinetext

                    success
              ┌──────────────────┐
              │                  │
              ▼                  │
          ┌────────┐    ┌───────────┐    ┌────────┐
          │ CLOSED │───→│   OPEN    │───→│ HALF-  │
          │(normal)│    │ (failing) │    │ OPEN   │
          └────────┘    └───────────┘    └────────┘
              │              │                │
         failures >      timeout          test request
         threshold       expires           succeeds?
                                          ├─ yes → CLOSED
                                          └─ no  → OPEN

CLOSED (normal operation):
  All requests pass through to the downstream service.
  Failures are counted. If failures exceed threshold → OPEN.

OPEN (circuit tripped):
  All requests are immediately rejected (no downstream call).
  Return fallback response or error.
  After a timeout (e.g., 30 seconds) → HALF-OPEN.

HALF-OPEN (testing recovery):
  Allow ONE test request through to the downstream service.
  If it succeeds → CLOSED (service recovered).
  If it fails → OPEN (still broken, reset timeout).

Circuit Breaker — Configuration Exampletext

Configuration:
  failure_threshold: 5        // Open after 5 consecutive failures
  timeout: 30 seconds         // Stay open for 30s before testing
  success_threshold: 3        // Close after 3 consecutive successes in half-open

Timeline:
  T=0:   CLOSED. Request to Payment Service → success
  T=1:   CLOSED. Request → success
  T=2:   CLOSED. Request → timeout (failure 1/5)
  T=3:   CLOSED. Request → timeout (failure 2/5)
  T=4:   CLOSED. Request → 503 (failure 3/5)
  T=5:   CLOSED. Request → timeout (failure 4/5)
  T=6:   CLOSED. Request → 503 (failure 5/5) → CIRCUIT OPENS

  T=7:   OPEN. Request → immediately rejected (no downstream call)
         Return: "Payment service temporarily unavailable"
  T=8-35: OPEN. All requests rejected instantly (~0ms response)

  T=36:  HALF-OPEN. Allow 1 test request → success!
  T=37:  HALF-OPEN. Allow 1 test request → success!
  T=38:  HALF-OPEN. Allow 1 test request → success! (3/3) → CIRCUIT CLOSES

  T=39:  CLOSED. Normal operation resumes.

Benefits

✅Prevents cascading failures (failing service doesn't drag others down)
✅Fails fast (immediate rejection vs 30-second timeout)
✅Gives failing service time to recover (no retry storm)
✅Enables graceful degradation (fallback responses)
✅Reduces resource waste (no threads blocked on timeouts)

Considerations

❌Adds complexity (state machine, configuration, monitoring)
❌Needs careful tuning (threshold too low = false trips, too high = slow detection)
❌Fallback responses must be designed (what to show when circuit is open?)
❌Half-open state needs careful handling (don't flood with test requests)
❌Different downstream services need different configurations

🎯 Interview Insight

Circuit breakers are essential in microservices. Say: "I'd add a circuit breaker on every downstream service call. If the payment service fails 5 times in a row, the circuit opens and we return a fallback immediately instead of waiting for timeouts. After 30 seconds, we test with one request. If it succeeds, we resume normal operation." Mention Netflix Hystrix or Resilience4j as real-world implementations.

End-to-End Scenario

Let's design a reliable payment system using all three patterns together.

Reliable Payment System — All Patterns Combinedtext

User clicks "Pay $200" for order #456

1. CLIENT — Retry with Exponential Backoff
   Client generates: Idempotency-Key: idem_xyz789
   Sends: POST /api/payments { order: 456, amount: 200 }
   If timeout → retry after 1s, 2s, 4s (same idempotency key)

2. API SERVER — Idempotency Check
   Check Redis: EXISTS idem_xyz789?
   → NO: acquire lock, proceed to step 3
   → YES: return cached result (no reprocessing)

3. API SERVER → PAYMENT SERVICE — Circuit Breaker
   Circuit breaker state: CLOSED (normal)
   Call payment service: charge($200, merchant_42)

   Scenario A: Payment service responds → success
     → Store result in Redis: idem_xyz789 → { txn: "txn_123", status: "success" }
     → Return 201 Created to client

   Scenario B: Payment service times out
     → Circuit breaker records failure (3/5)
     → API server returns 503 to client
     → Client retries with same idempotency key (backoff: 2s)
     → Second attempt: payment service responds → success
     → Idempotency key stored → return result

   Scenario C: Payment service is down (5 consecutive failures)
     → Circuit breaker OPENS
     → Next request: immediately rejected (no downstream call)
     → Return 503 "Payment service temporarily unavailable"
     → Client retries after Retry-After header
     → After 30s: circuit HALF-OPEN → test request → success → CLOSED

4. RESULT
   User is charged exactly once (idempotency)
   Transient failures are handled automatically (retry + backoff)
   Cascading failures are prevented (circuit breaker)
   User sees clear feedback at every step

💡 This Is How Stripe Works

Stripe requires idempotency keys on all mutating API calls. Their client libraries implement exponential backoff with jitter. Their internal services use circuit breakers to prevent cascading failures. The three patterns work together as a complete reliability stack.

Trade-offs & Decision Making

Pattern	Problem Solved	Trade-off	When to Use
Idempotency Keys	Duplicate requests cause duplicate side effects	Storage overhead (Redis), key management, lock complexity	Any POST with irreversible side effects (payments, orders, emails)
Retry + Backoff	Transient failures (timeouts, 503s)	Increased latency (wait between retries), resource usage	Any call to an external or downstream service
Circuit Breaker	Cascading failures, resource exhaustion	Complexity (state machine, tuning), false positives	Any microservice-to-microservice call, external API calls

🔁 Retry vs Fail Fast

Retry: when the failure is likely transient (timeout, 503)
Fail fast: when the failure is permanent (400, 401, 404)
Retry + circuit breaker: retry transient failures, but stop if the service is consistently down
The combination prevents both under-retrying and over-retrying

🎯 Minimum Viable Reliability

Always: idempotency keys on payment/order endpoints
Always: retry with backoff + jitter on downstream calls
Always: circuit breaker on external service calls
Always: timeouts on every network call (never wait forever)

Interview Questions

Q:What is idempotency and why is it important?

A: An operation is idempotent if performing it multiple times produces the same result as performing it once. It's critical because networks are unreliable — requests can timeout, and clients retry. Without idempotency, a payment retry charges the customer twice. Implementation: the client sends a unique idempotency key (UUID) with each request. The server stores key → result in Redis. On retry (same key), the server returns the cached result without reprocessing. Stripe, PayPal, and every major payment API requires idempotency keys.

Q:Why exponential backoff instead of immediate retry?

A: Immediate retry hammers a failing server with even more load, making recovery harder. If 1,000 clients all retry immediately after a timeout, the server gets 1,000 extra requests on top of its existing load — a retry storm. Exponential backoff (1s, 2s, 4s, 8s) spreads retries over time, giving the server breathing room. Jitter (random delay added to each retry) prevents synchronized retries — without it, all 1,000 clients retry at exactly 1s, 2s, 4s, creating periodic spikes.

Q:How does a circuit breaker work and when would you use it?

A: A circuit breaker has three states: CLOSED (normal — requests pass through), OPEN (tripped — requests immediately rejected), HALF-OPEN (testing — one request allowed through). When failures exceed a threshold (e.g., 5 consecutive), the circuit opens. After a timeout (e.g., 30s), it enters half-open and tests with one request. If it succeeds, the circuit closes. Use it on every call to a downstream service — especially external APIs, payment providers, and other microservices. It prevents cascading failures where one slow service brings down the entire system.

Your payment system occasionally charges customers twice

How do you fix this?

Answer: Add idempotency keys. (1) Client generates a UUID for each payment attempt and sends it as an Idempotency-Key header. (2) Server checks Redis: if the key exists, return the cached result (no reprocessing). If not, acquire a lock, process the payment, store key → result with 24h TTL, release lock. (3) On retry (same key), the server returns the original result. (4) If the key exists but the body is different, return 422 Conflict. This guarantees exactly-once processing regardless of how many times the client retries.

Your microservice architecture has cascading timeouts — one slow service makes everything slow

How do you prevent this?

Answer: (1) Add timeouts on every downstream call (never wait forever — 5s max). (2) Add circuit breakers: if a service fails 5 times consecutively, stop calling it for 30 seconds. Return a fallback response (cached data, degraded response, or error). (3) Add bulkheads: isolate thread pools per downstream service so one slow service can't exhaust all threads. (4) Add retries with backoff only for transient failures (5xx), not for slow responses (which would add more load). The combination of timeouts + circuit breakers + bulkheads prevents any single service failure from cascading.

Common Pitfalls

🔁

Retrying non-idempotent operations

A POST /api/orders endpoint creates an order. The client retries on timeout. Without idempotency, two orders are created. The customer gets charged twice and receives two shipments. This happens thousands of times per day at scale.

✅Never retry a mutating operation without an idempotency key. Either make the operation naturally idempotent (PUT with full replacement) or add an Idempotency-Key header that the server uses to deduplicate. Store key → result in Redis with a 24-48h TTL.

♾️

Infinite retries

The retry logic has no max attempts or total timeout. A permanently failing request retries forever — consuming threads, connections, and memory. Multiply by thousands of concurrent requests and the client service runs out of resources.

✅Always set: max retries (3-5), max total timeout (30-60s), and a circuit breaker that stops retries when the downstream is consistently failing. After max retries, return an error to the caller — don't keep trying.

⚙️

Misconfigured circuit breakers

Threshold too low (2 failures): the circuit trips on normal transient errors, blocking legitimate traffic. Threshold too high (100 failures): the circuit never trips, and cascading failures happen anyway. Timeout too short (5s): the circuit keeps flapping between open and closed.

✅Start with: failure threshold = 5 consecutive failures, timeout = 30 seconds, success threshold = 3 in half-open. Monitor and tune based on actual failure patterns. Different services need different configurations — a flaky external API needs a lower threshold than a reliable internal service.

🙈

Ignoring failure scenarios in design

The system is designed for the happy path only. No idempotency on payments, no retries on API calls, no circuit breakers on downstream services. Everything works in development. In production, the first network hiccup causes duplicate charges, the first service outage causes cascading failures, and the first timeout causes data inconsistency.

✅Design for failure from day one. Every downstream call needs: a timeout (never wait forever), retry with backoff (for transient failures), idempotency (for safe retries), and a circuit breaker (for cascading failure prevention). These aren't optimizations — they're requirements for production systems.

Reliability

Table of Contents

The Big Picture — Why Failures Are Inevitable

The Double-Charge Problem

Types of Failures

Network Timeouts

Partial Failures

Cascading Failures

Reliability Patterns Overview

🛡️ Fail Safe (Retry + Idempotency)

🔌 Fail Fast (Circuit Breaker)

Idempotency Keys

Implementation Details

Rules

Where It's Critical

Retry with Exponential Backoff

What to Retry vs What Not to Retry

Best Practices

Anti-Patterns

Circuit Breaker Pattern

The Electrical Circuit Breaker

Three States

Benefits

Considerations

End-to-End Scenario

Trade-offs & Decision Making

🔁 Retry vs Fail Fast

🎯 Minimum Viable Reliability

Interview Questions

Q:What is idempotency and why is it important?

Q:Why exponential backoff instead of immediate retry?

Q:How does a circuit breaker work and when would you use it?

Your payment system occasionally charges customers twice

Your microservice architecture has cascading timeouts — one slow service makes everything slow

Common Pitfalls

Retrying non-idempotent operations

Infinite retries

Misconfigured circuit breakers

Ignoring failure scenarios in design