IdempotencyRetryExponential BackoffCircuit BreakerFault ToleranceResilience

Reliability

Build resilient systems — idempotency keys, retry with exponential backoff, and circuit breaker patterns. Ensure correctness despite retries, duplicates, and partial failures.

28 min read9 sections
01

The Big Picture — Why Failures Are Inevitable

In a distributed system, failures aren't exceptions — they're the norm. Networks drop packets, servers crash, databases timeout, and third-party APIs go down. The question isn't "will something fail?" — it's "when it fails, does the system handle it correctly?"

💳

The Double-Charge Problem

You're buying concert tickets online. You click 'Pay $200'. The screen freezes — did it work? You click again. Without reliability patterns, you just bought two tickets and got charged $400. The payment server received both requests and processed both. With idempotency: the second click sends the same idempotency key. The server recognizes it's a duplicate and returns the original result — one charge, one ticket. With retry + backoff: if the first request timed out, the client waits 1 second and retries. If it fails again, waits 2 seconds. Doesn't hammer the server. With circuit breaker: if the payment service is down, the system stops trying after 5 failures and shows 'Payment service temporarily unavailable' instead of timing out for 30 seconds.

Types of Failures

⏱️

Network Timeouts

The request was sent but no response came back. Did the server process it? You don't know. Retrying might cause a duplicate. Not retrying might lose the operation.

💔

Partial Failures

The payment was charged but the order wasn't created. Or the order was created but the inventory wasn't decremented. Half the operation succeeded, half didn't.

🌊

Cascading Failures

Service A calls Service B which calls Service C. Service C is slow. B's threads are blocked waiting for C. A's threads are blocked waiting for B. The entire system freezes.

🔥 Key Insight

Reliability patterns don't prevent failures — they ensure the system behaves correctly despite failures. Idempotency handles duplicates. Retries handle transient errors. Circuit breakers prevent cascading failures. Together, they make a system that's resilient, not just functional.

02

Reliability Patterns Overview

👤

Client

Sends request

🔁

Retry Logic

Handles transient failures

🖥️

Service

Idempotency check

🔌

Downstream

Circuit breaker

🛡️ Fail Safe (Retry + Idempotency)

  • Client retries with exponential backoff on timeout
  • Server uses idempotency key to detect duplicates
  • Same request processed at most once
  • Result: safe retries, no duplicate side effects

🔌 Fail Fast (Circuit Breaker)

  • Monitor failure rate of downstream calls
  • If failures exceed threshold → stop calling (circuit opens)
  • Return fallback response immediately
  • Result: no cascading failures, fast error response
03

Idempotency Keys

An idempotency key is a unique identifier attached to a request that allows the server to recognize and deduplicate retries. If the same key is seen twice, the server returns the cached result from the first execution without reprocessing.

Idempotency Key — Internal Flowtext
First request:
  POST /api/payments
  Idempotency-Key: idem_a1b2c3d4
  Body: { "amount": 200, "to": "merchant_42" }

  Server flow:
    1. Check Redis: EXISTS idem_a1b2c3d4NO
    2. Acquire lock: SET idem_a1b2c3d4:lock EX 30 NXOK
    3. Process paymentsuccess, txn_id = "txn_789"
    4. Store result: SET idem_a1b2c3d4 '{"txn_id":"txn_789","status":"success"}' EX 86400
    5. Release lock
    6. Return: 201 Created { "txn_id": "txn_789" }

Retry (same key, network timeout on first attempt):
  POST /api/payments
  Idempotency-Key: idem_a1b2c3d4
  Body: { "amount": 200, "to": "merchant_42" }

  Server flow:
    1. Check Redis: EXISTS idem_a1b2c3d4YES
    2. Return cached result: 200 OK { "txn_id": "txn_789" }
    3. Payment NOT processed again

Conflict (same key, different body):
  POST /api/payments
  Idempotency-Key: idem_a1b2c3d4
  Body: { "amount": 500, "to": "merchant_99" }  ← different!

  Server flow:
    1. Check Redis: EXISTS idem_a1b2c3d4YES
    2. Compare body hashMISMATCH
    3. Return: 422 Unprocessable Entity
       "Idempotency key already used with different parameters"

Implementation Details

Rules

  • Client generates the key (UUID v4 is standard)
  • Key is sent in a header: Idempotency-Key: uuid
  • Server stores key → result in Redis (TTL: 24-48h)
  • Same key + same body = return cached result
  • Same key + different body = return 422 Conflict
  • Lock during processing to prevent race conditions

Where It's Critical

  • Payment processing (Stripe, PayPal, Square)
  • Order creation (prevent duplicate orders)
  • Email/SMS sending (prevent duplicate notifications)
  • Any POST with irreversible side effects
  • Any operation in an at-least-once delivery system

🎯 Interview Insight

Idempotency keys are the #1 answer to "how do you prevent duplicate payments?" Walk through the flow: client generates UUID → sends with request → server checks Redis → process or return cached result. Mention Stripe as the real-world example — they require idempotency keys on all mutating API calls.

04

Retry with Exponential Backoff

When a request fails due to a transient error (timeout, 503, connection refused), the client should retry — but not immediately. Exponential backoff increases the delay between retries, giving the failing system time to recover without being hammered by retries.

Exponential Backoff — Internal Mechanicstext
Base delay: 1 second
Max retries: 5
Jitter: random(0, delay * 0.5)

Attempt 1: request fails (timeout)
wait 1s + jitter(0, 0.5s) = ~1.3s

Attempt 2: request fails (503)
wait 2s + jitter(0, 1.0s) = ~2.7s

Attempt 3: request fails (timeout)
wait 4s + jitter(0, 2.0s) = ~5.1s

Attempt 4: request succeeds
return result

If all 5 attempts fail:
return error to caller
total wait: ~1 + 2 + 4 + 8 + 16 = ~31 seconds

Why jitter?
  Without jitter: 1000 clients all retry at exactly 1s, 2s, 4s
Synchronized retry storms overwhelm the recovering server
  With jitter: retries are spread randomly across the window
Server recovers gradually

What to Retry vs What Not to Retry

Error TypeRetry?Why
408 Request Timeout✅ YesTransient — server might be temporarily slow
429 Too Many Requests✅ Yes (after Retry-After)Rate limited — wait and try again
500 Internal Server Error✅ Yes (cautiously)Might be transient, but could be a bug
502 Bad Gateway✅ YesUpstream server temporarily unavailable
503 Service Unavailable✅ YesServer overloaded, likely recovers soon
400 Bad Request❌ NoClient error — retrying won't fix invalid input
401 Unauthorized❌ NoAuth failed — retrying with same token is pointless
403 Forbidden❌ NoPermission denied — won't change on retry
404 Not Found❌ NoResource doesn't exist — retrying won't create it

Best Practices

  • Always add jitter to prevent synchronized retry storms
  • Set a max retry count (3-5 is typical)
  • Set a max total timeout (30-60 seconds)
  • Only retry idempotent operations (or use idempotency keys)
  • Log each retry with attempt number for debugging
  • Respect Retry-After headers from the server

Anti-Patterns

  • Retrying immediately (hammers the failing server)
  • Retrying forever (wastes resources, blocks the caller)
  • Retrying non-idempotent operations without idempotency keys
  • Retrying 4xx errors (client errors won't fix themselves)
  • No jitter (causes thundering herd on recovery)
  • Retrying without logging (impossible to debug)

🎯 Interview Insight

Always pair retries with idempotency. Say: "I'd use exponential backoff with jitter for transient failures (5xx, timeouts). Each retry includes the same idempotency key, so even if the original request actually succeeded (but the response was lost), the retry is safe — the server returns the cached result."

05

Circuit Breaker Pattern

A circuit breaker monitors calls to a downstream service. When failures exceed a threshold, it "opens" the circuit — immediately rejecting requests instead of waiting for timeouts. This prevents cascading failures and gives the failing service time to recover.

The Electrical Circuit Breaker

An electrical circuit breaker trips when current exceeds a safe level — cutting power instantly to prevent a fire. It doesn't keep pushing more electricity through a failing wire. A software circuit breaker does the same: when a downstream service is failing, it stops sending requests (trips the breaker) to prevent the failure from spreading. After a cooldown period, it cautiously tests if the service has recovered.

Three States

Circuit Breaker — State Machinetext
                    success
              ┌──────────────────┐
              │                  │
              ▼                  │
          ┌────────┐    ┌───────────┐    ┌────────┐
CLOSED │───→│   OPEN    │───→│ HALF-  │
          │(normal)│    │ (failing) │    │ OPEN
          └────────┘    └───────────┘    └────────┘
              │              │                │
         failures >      timeout          test request
         threshold       expires           succeeds?
                                          ├─ yesCLOSED
                                          └─ noOPEN

CLOSED (normal operation):
  All requests pass through to the downstream service.
  Failures are counted. If failures exceed thresholdOPEN.

OPEN (circuit tripped):
  All requests are immediately rejected (no downstream call).
  Return fallback response or error.
  After a timeout (e.g., 30 seconds) → HALF-OPEN.

HALF-OPEN (testing recovery):
  Allow ONE test request through to the downstream service.
  If it succeedsCLOSED (service recovered).
  If it failsOPEN (still broken, reset timeout).
Circuit Breaker — Configuration Exampletext
Configuration:
  failure_threshold: 5        // Open after 5 consecutive failures
  timeout: 30 seconds         // Stay open for 30s before testing
  success_threshold: 3        // Close after 3 consecutive successes in half-open

Timeline:
  T=0:   CLOSED. Request to Payment Servicesuccess
  T=1:   CLOSED. Requestsuccess
  T=2:   CLOSED. Requesttimeout (failure 1/5)
  T=3:   CLOSED. Requesttimeout (failure 2/5)
  T=4:   CLOSED. Request503 (failure 3/5)
  T=5:   CLOSED. Requesttimeout (failure 4/5)
  T=6:   CLOSED. Request503 (failure 5/5) → CIRCUIT OPENS

  T=7:   OPEN. Requestimmediately rejected (no downstream call)
         Return: "Payment service temporarily unavailable"
  T=8-35: OPEN. All requests rejected instantly (~0ms response)

  T=36:  HALF-OPEN. Allow 1 test requestsuccess!
  T=37:  HALF-OPEN. Allow 1 test requestsuccess!
  T=38:  HALF-OPEN. Allow 1 test requestsuccess! (3/3) → CIRCUIT CLOSES

  T=39:  CLOSED. Normal operation resumes.

Benefits

  • Prevents cascading failures (failing service doesn't drag others down)
  • Fails fast (immediate rejection vs 30-second timeout)
  • Gives failing service time to recover (no retry storm)
  • Enables graceful degradation (fallback responses)
  • Reduces resource waste (no threads blocked on timeouts)

Considerations

  • Adds complexity (state machine, configuration, monitoring)
  • Needs careful tuning (threshold too low = false trips, too high = slow detection)
  • Fallback responses must be designed (what to show when circuit is open?)
  • Half-open state needs careful handling (don't flood with test requests)
  • Different downstream services need different configurations

🎯 Interview Insight

Circuit breakers are essential in microservices. Say: "I'd add a circuit breaker on every downstream service call. If the payment service fails 5 times in a row, the circuit opens and we return a fallback immediately instead of waiting for timeouts. After 30 seconds, we test with one request. If it succeeds, we resume normal operation." Mention Netflix Hystrix or Resilience4j as real-world implementations.

06

End-to-End Scenario

Let's design a reliable payment system using all three patterns together.

Reliable Payment System — All Patterns Combinedtext
User clicks "Pay $200" for order #456

1. CLIENTRetry with Exponential Backoff
   Client generates: Idempotency-Key: idem_xyz789
   Sends: POST /api/payments { order: 456, amount: 200 }
   If timeoutretry after 1s, 2s, 4s (same idempotency key)

2. API SERVERIdempotency Check
   Check Redis: EXISTS idem_xyz789?
NO: acquire lock, proceed to step 3
YES: return cached result (no reprocessing)

3. API SERVERPAYMENT SERVICECircuit Breaker
   Circuit breaker state: CLOSED (normal)
   Call payment service: charge($200, merchant_42)

   Scenario A: Payment service respondssuccess
Store result in Redis: idem_xyz789 → { txn: "txn_123", status: "success" }
Return 201 Created to client

   Scenario B: Payment service times out
Circuit breaker records failure (3/5)
API server returns 503 to client
Client retries with same idempotency key (backoff: 2s)
Second attempt: payment service respondssuccess
Idempotency key storedreturn result

   Scenario C: Payment service is down (5 consecutive failures)
Circuit breaker OPENS
Next request: immediately rejected (no downstream call)
Return 503 "Payment service temporarily unavailable"
Client retries after Retry-After header
After 30s: circuit HALF-OPENtest requestsuccessCLOSED

4. RESULT
   User is charged exactly once (idempotency)
   Transient failures are handled automatically (retry + backoff)
   Cascading failures are prevented (circuit breaker)
   User sees clear feedback at every step

💡 This Is How Stripe Works

Stripe requires idempotency keys on all mutating API calls. Their client libraries implement exponential backoff with jitter. Their internal services use circuit breakers to prevent cascading failures. The three patterns work together as a complete reliability stack.

07

Trade-offs & Decision Making

PatternProblem SolvedTrade-offWhen to Use
Idempotency KeysDuplicate requests cause duplicate side effectsStorage overhead (Redis), key management, lock complexityAny POST with irreversible side effects (payments, orders, emails)
Retry + BackoffTransient failures (timeouts, 503s)Increased latency (wait between retries), resource usageAny call to an external or downstream service
Circuit BreakerCascading failures, resource exhaustionComplexity (state machine, tuning), false positivesAny microservice-to-microservice call, external API calls

🔁 Retry vs Fail Fast

  • Retry: when the failure is likely transient (timeout, 503)
  • Fail fast: when the failure is permanent (400, 401, 404)
  • Retry + circuit breaker: retry transient failures, but stop if the service is consistently down
  • The combination prevents both under-retrying and over-retrying

🎯 Minimum Viable Reliability

  • Always: idempotency keys on payment/order endpoints
  • Always: retry with backoff + jitter on downstream calls
  • Always: circuit breaker on external service calls
  • Always: timeouts on every network call (never wait forever)
08

Interview Questions

Q:What is idempotency and why is it important?

A: An operation is idempotent if performing it multiple times produces the same result as performing it once. It's critical because networks are unreliable — requests can timeout, and clients retry. Without idempotency, a payment retry charges the customer twice. Implementation: the client sends a unique idempotency key (UUID) with each request. The server stores key → result in Redis. On retry (same key), the server returns the cached result without reprocessing. Stripe, PayPal, and every major payment API requires idempotency keys.

Q:Why exponential backoff instead of immediate retry?

A: Immediate retry hammers a failing server with even more load, making recovery harder. If 1,000 clients all retry immediately after a timeout, the server gets 1,000 extra requests on top of its existing load — a retry storm. Exponential backoff (1s, 2s, 4s, 8s) spreads retries over time, giving the server breathing room. Jitter (random delay added to each retry) prevents synchronized retries — without it, all 1,000 clients retry at exactly 1s, 2s, 4s, creating periodic spikes.

Q:How does a circuit breaker work and when would you use it?

A: A circuit breaker has three states: CLOSED (normal — requests pass through), OPEN (tripped — requests immediately rejected), HALF-OPEN (testing — one request allowed through). When failures exceed a threshold (e.g., 5 consecutive), the circuit opens. After a timeout (e.g., 30s), it enters half-open and tests with one request. If it succeeds, the circuit closes. Use it on every call to a downstream service — especially external APIs, payment providers, and other microservices. It prevents cascading failures where one slow service brings down the entire system.

1

Your payment system occasionally charges customers twice

How do you fix this?

Answer: Add idempotency keys. (1) Client generates a UUID for each payment attempt and sends it as an Idempotency-Key header. (2) Server checks Redis: if the key exists, return the cached result (no reprocessing). If not, acquire a lock, process the payment, store key → result with 24h TTL, release lock. (3) On retry (same key), the server returns the original result. (4) If the key exists but the body is different, return 422 Conflict. This guarantees exactly-once processing regardless of how many times the client retries.

2

Your microservice architecture has cascading timeouts — one slow service makes everything slow

How do you prevent this?

Answer: (1) Add timeouts on every downstream call (never wait forever — 5s max). (2) Add circuit breakers: if a service fails 5 times consecutively, stop calling it for 30 seconds. Return a fallback response (cached data, degraded response, or error). (3) Add bulkheads: isolate thread pools per downstream service so one slow service can't exhaust all threads. (4) Add retries with backoff only for transient failures (5xx), not for slow responses (which would add more load). The combination of timeouts + circuit breakers + bulkheads prevents any single service failure from cascading.

09

Common Pitfalls

🔁

Retrying non-idempotent operations

A POST /api/orders endpoint creates an order. The client retries on timeout. Without idempotency, two orders are created. The customer gets charged twice and receives two shipments. This happens thousands of times per day at scale.

Never retry a mutating operation without an idempotency key. Either make the operation naturally idempotent (PUT with full replacement) or add an Idempotency-Key header that the server uses to deduplicate. Store key → result in Redis with a 24-48h TTL.

♾️

Infinite retries

The retry logic has no max attempts or total timeout. A permanently failing request retries forever — consuming threads, connections, and memory. Multiply by thousands of concurrent requests and the client service runs out of resources.

Always set: max retries (3-5), max total timeout (30-60s), and a circuit breaker that stops retries when the downstream is consistently failing. After max retries, return an error to the caller — don't keep trying.

⚙️

Misconfigured circuit breakers

Threshold too low (2 failures): the circuit trips on normal transient errors, blocking legitimate traffic. Threshold too high (100 failures): the circuit never trips, and cascading failures happen anyway. Timeout too short (5s): the circuit keeps flapping between open and closed.

Start with: failure threshold = 5 consecutive failures, timeout = 30 seconds, success threshold = 3 in half-open. Monitor and tune based on actual failure patterns. Different services need different configurations — a flaky external API needs a lower threshold than a reliable internal service.

🙈

Ignoring failure scenarios in design

The system is designed for the happy path only. No idempotency on payments, no retries on API calls, no circuit breakers on downstream services. Everything works in development. In production, the first network hiccup causes duplicate charges, the first service outage causes cascading failures, and the first timeout causes data inconsistency.

Design for failure from day one. Every downstream call needs: a timeout (never wait forever), retry with backoff (for transient failures), idempotency (for safe retries), and a circuit breaker (for cascading failure prevention). These aren't optimizations — they're requirements for production systems.