Token BucketSliding WindowFixed WindowLeaky BucketRedisCircuit Breaker429

Rate Limiting & Throttling

Protecting your system from overload, abuse, and unfair usage. Algorithms, distributed limiting with Redis, and circuit breakers.

40 min read9 sections

Why Rate Limiting at the Gateway

Rate limiting at the gateway protects your entire system from a single enforcement point. Without it, a single misbehaving client can overwhelm your backends, degrade service for everyone, and potentially cause cascading failures across your microservices.

Why Rate Limit at the Gateway

✅Fair usage — prevent one client from consuming all capacity
✅Backend protection — reject excess traffic before it reaches services
✅Cost control — limit expensive operations (AI inference, third-party API calls)
✅DDoS first line — absorb volumetric attacks at the edge
✅Monetization — enforce plan-based limits (free: 100 req/min, pro: 10,000 req/min)

🚰

The Water Treatment Plant

Rate limiting is like a water treatment plant's intake valve. No matter how much water flows from the river (incoming requests), the valve controls how much enters the system. Without it, a flood (traffic spike) overwhelms the treatment capacity (backend services) and nothing works properly. The valve protects the entire downstream system.

Rate Limiting is Not Optional

Every public API needs rate limiting. Even internal APIs benefit from it. Without rate limiting: a bug in one client can DDoS your system, a retry storm after a partial outage can prevent recovery, and you have no way to prioritize traffic during degraded conditions. It's a safety mechanism, not a feature.

Rate Limiting Dimensions

Rate limits can be applied along multiple dimensions simultaneously. The key is choosing the right identifier — what uniquely identifies the entity you want to limit.

Dimension	Identifier	Use Case
Per IP	Client IP address	Anonymous traffic, DDoS protection
Per API Key	API key / client ID	Third-party integrations, plan enforcement
Per User	Authenticated user ID	Fair usage across user sessions
Per Endpoint	HTTP method + path	Protect expensive endpoints differently
Per Tenant	Tenant/organization ID	Multi-tenant isolation
Combined	User + Endpoint	User can call /search 10/min but /orders 100/min

kong-rate-limiting.yamlyaml

plugins:
  # Global rate limit — per consumer
  - name: rate-limiting
    config:
      minute: 100
      hour: 5000
      policy: redis
      redis_host: redis-cluster
      redis_port: 6379
      limit_by: consumer  # per API key/consumer
      header_name: X-Consumer-ID

# Per-route override — expensive endpoint
routes:
  - name: search-route
    paths:
      - /api/v1/search
    plugins:
      - name: rate-limiting
        config:
          minute: 10
          policy: redis
          redis_host: redis-cluster
          limit_by: consumer

Layered Rate Limits

Apply limits at multiple layers: a global limit (10,000 req/min per key), an endpoint limit (100 req/min for /search), and a burst limit (50 req/sec instantaneous). The most restrictive limit applies. This prevents both sustained overload and sudden bursts.

Fixed Window & Sliding Window

Fixed Window Counter

The simplest algorithm. Divide time into fixed windows (e.g., 1-minute intervals). Count requests in the current window. Reject when count exceeds the limit. Reset counter at window boundary.

fixed-window-redis.shbash

# Fixed window: 100 requests per minute
# Key: rate_limit:{client_id}:{window_timestamp}

# Window = current minute (floor to minute boundary)
WINDOW=$(date +%Y%m%d%H%M)
KEY="rate_limit:client_123:${WINDOW}"

# Atomic increment + check
CURRENT=$(redis-cli INCR "$KEY")
redis-cli EXPIRE "$KEY" 60  # Auto-cleanup

if [ "$CURRENT" -gt 100 ]; then
  echo "429 Too Many Requests"
fi

The Boundary Spike Problem

Fixed windows have a flaw: a client can send 100 requests at 12:00:59 and 100 more at 12:01:00 — 200 requests in 2 seconds while respecting the "100 per minute" limit. The window boundary creates a spike opportunity. Sliding window algorithms solve this.

Sliding Window Log

Store the timestamp of every request. To check the limit, count timestamps within the last N seconds. Precise but memory-intensive — storing every timestamp is expensive at high request rates.

Sliding Window Counter

A hybrid approach: use fixed window counters but weight the previous window's count by the overlap percentage. If we're 30% into the current window, the effective count is: (previous_window × 0.7) + current_window. This approximates a true sliding window with minimal memory.

Algorithm	Memory	Precision	Boundary Spike
Fixed Window	O(1) — one counter	Low	Yes — 2x burst at boundary
Sliding Window Log	O(n) — all timestamps	Exact	No
Sliding Window Counter	O(1) — two counters	Approximate	Minimal

Token Bucket & Leaky Bucket

Token Bucket

A bucket holds tokens (capacity = burst size). Tokens are added at a fixed rate (refill rate = sustained limit). Each request consumes one token. If the bucket is empty, the request is rejected. This allows bursts up to bucket capacity while enforcing a sustained rate.

🪣

The Arcade Token Dispenser

Imagine an arcade that gives you 10 tokens per hour, and your bucket holds 20 tokens max. You can spend 20 tokens immediately (burst), but then you wait for refills. If you pace yourself at 10/hour, you always have tokens. The token bucket allows short bursts while enforcing a long-term average rate.

token-bucket-config.jsonjson

{
  "algorithm": "token_bucket",
  "bucket_capacity": 50,
  "refill_rate": 10,
  "refill_interval_ms": 1000,
  "comment": "Allows burst of 50 requests, sustained rate of 10/sec"
}

// State per client:
// {
//   "tokens": 42,
//   "last_refill": "2024-03-01T12:00:00.500Z"
// }
//
// On each request:
// 1. Calculate tokens to add since last_refill
// 2. tokens = min(tokens + added, bucket_capacity)
// 3. If tokens >= 1: consume token, allow request
// 4. If tokens < 1: reject with 429

Leaky Bucket

Requests enter a queue (bucket). The queue drains at a fixed rate. If the queue is full, new requests are rejected. Unlike token bucket, leaky bucket produces perfectly smooth output — no bursts. Requests are processed at a constant rate regardless of arrival pattern.

Aspect	Token Bucket	Leaky Bucket
Burst handling	Allows bursts up to bucket capacity	No bursts — constant output rate
Output pattern	Bursty (matches input up to capacity)	Smooth (fixed drain rate)
Implementation	Counter + timestamp	Queue + fixed-rate processor
Memory	O(1) per client	O(queue_size) per client
Best for	APIs where short bursts are acceptable	Systems needing smooth, predictable load
Used by	AWS API Gateway, Stripe	NGINX (limit_req with burst)

Token Bucket is Usually the Right Choice

Most APIs should use token bucket. It's simple, memory-efficient, and allows natural burst patterns (page load triggers multiple API calls simultaneously). Leaky bucket is better when your backend truly cannot handle any burst — like a payment processor with strict TPS limits.

Distributed Rate Limiting

With multiple gateway instances, rate limit state must be shared. If each instance tracks limits independently, a client can multiply their effective limit by the number of instances. Redis is the standard solution — atomic operations ensure accurate counting across all gateway nodes.

redis-sliding-window.luabash

-- Lua script for atomic sliding window rate limiting in Redis
-- Executed atomically — no race conditions between gateway instances

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window_ms = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local window_start = now - window_ms

-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)

-- Count requests in current window
local count = redis.call('ZCARD', key)

if count >= limit then
  return 0  -- Rate limited
end

-- Add current request
redis.call('ZADD', key, now, now .. ':' .. math.random())
redis.call('PEXPIRE', key, window_ms)

return limit - count  -- Remaining requests

Rate Limit Response Headers

rate-limit-headers.shbash

# Standard rate limit headers (RFC 6585 / draft-ietf-httpapi-ratelimit-headers)
HTTP/1.1 200 OK
X-RateLimit-Limit: 100        # Max requests in window
X-RateLimit-Remaining: 42     # Requests remaining
X-RateLimit-Reset: 1709251260 # Unix timestamp when window resets
Retry-After: 30               # Seconds until client should retry (on 429)

# When rate limited:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1709251260
Retry-After: 30
Content-Type: application/json

{"error": "rate_limit_exceeded", "retry_after": 30}

Local Cache + Sync Pattern

For ultra-low-latency rate limiting, some gateways use a hybrid: each instance maintains a local counter and periodically syncs with Redis. This avoids a Redis round-trip on every request but allows slight over-limit (by the sync interval). Acceptable when exact precision isn't critical — e.g., allowing 105 requests on a 100 limit is fine.

Circuit Breaker

A circuit breaker prevents the gateway from sending requests to a failing upstream service. Instead of letting requests pile up and timeout (wasting resources), the circuit breaker fails fast — returning an error immediately without attempting the upstream call.

State	Behavior	Transitions To
Closed (normal)	Requests flow through, failures are counted	Open (when failure threshold exceeded)
Open (tripped)	All requests fail immediately with 503	Half-Open (after timeout period)
Half-Open (testing)	Allow limited requests through to test recovery	Closed (if test succeeds) or Open (if test fails)

⚡

The Electrical Circuit Breaker

Just like a home circuit breaker trips when it detects dangerous current (preventing a fire), an API circuit breaker trips when it detects too many failures (preventing cascade failure). You don't keep pushing electricity through a short circuit — and you don't keep sending requests to a dead service. After the problem is fixed, you reset the breaker and current flows again.

envoy-circuit-breaker.yamlyaml

# Envoy circuit breaker configuration
clusters:
  - name: order-service
    connect_timeout: 5s
    circuit_breakers:
      thresholds:
        - priority: DEFAULT
          max_connections: 100
          max_pending_requests: 50
          max_requests: 200
          max_retries: 3
    outlier_detection:
      consecutive_5xx: 5           # Trip after 5 consecutive 5xx
      interval: 10s                # Check every 10 seconds
      base_ejection_time: 30s      # Eject for 30 seconds minimum
      max_ejection_percent: 50     # Never eject more than 50% of hosts
      success_rate_minimum_hosts: 3

Per-Route Circuit Breakers

Configure circuit breakers per upstream service, not globally. A failing order-service shouldn't trip the breaker for user-service. Each upstream gets its own failure counter and state machine. This provides fault isolation — one bad service doesn't take down unrelated functionality.

Throttling vs Rate Limiting

Rate limiting is a hard reject — exceed the limit and you get 429. Throttling is a softer approach — slow down requests instead of rejecting them. Both protect backends, but throttling provides a better client experience when possible.

Approach	Behavior	Client Experience	Use When
Hard rate limit	Reject with 429	Immediate error, must retry	Strict enforcement, abuse prevention
Throttling (delay)	Queue and process slowly	Slower response, no error	Burst absorption, graceful degradation
Priority queuing	Process high-priority first	Paid users unaffected	Tiered service, monetization
Graceful degradation	Return partial/cached data	Reduced quality, no error	Read-heavy APIs during overload

nginx-throttling.confnginx

# NGINX rate limiting with burst queue (throttling)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

server {
    location /api/ {
        # Allow burst of 20, delay requests beyond 10/s
        # First 10/s: immediate
        # Next 20: queued and released at 10/s (throttled)
        # Beyond 30: rejected with 503
        limit_req zone=api burst=20 delay=10;

        # Custom error response for rate-limited requests
        limit_req_status 429;

        proxy_pass http://backend;
    }
}

Graceful Degradation Strategies

✅Return cached responses when backend is overloaded
✅Reduce response payload (omit optional fields)
✅Disable expensive features (search suggestions, recommendations)
✅Prioritize authenticated users over anonymous traffic
✅Shed load from non-critical endpoints first (analytics, preferences)

Interview Questions

Q:Compare token bucket and sliding window algorithms. When would you choose each?

A: Token bucket allows bursts (up to bucket capacity) while enforcing a sustained rate — ideal for APIs where clients naturally send bursts (page loads, batch operations). Sliding window provides a strict, even limit with no burst allowance — better for protecting backends with hard capacity limits. Token bucket is simpler to implement and more forgiving; sliding window is stricter and more predictable.

Q:How do you implement rate limiting across multiple gateway instances?

A: Use Redis as a shared counter store. Each gateway instance executes atomic Lua scripts in Redis to increment counters and check limits. The Lua script ensures no race conditions between concurrent requests hitting different instances. For lower latency, use a hybrid approach: local counters with periodic Redis sync (accepting slight over-limit). Redis Cluster provides HA for the rate limit state.

Q:What's the boundary spike problem and how do you solve it?

A: With fixed window counters, a client can send the full limit at the end of one window and the full limit at the start of the next — doubling their effective rate in a short period. Solutions: (1) Sliding window counter — weights previous window's count by overlap percentage. (2) Token bucket — naturally handles this since tokens refill continuously. (3) Sliding window log — tracks exact timestamps (memory-intensive).

Q:How does a circuit breaker differ from rate limiting?

A: Rate limiting protects backends from too many requests (client-side problem). Circuit breaker protects the gateway from wasting resources on a failing backend (server-side problem). Rate limiting says 'you're sending too fast.' Circuit breaker says 'the destination is broken, I'll fail fast instead of waiting for timeouts.' They're complementary — you need both.

Q:Design a rate limiting system for a multi-tier API (free, pro, enterprise).

A: Key design: (1) Identify tier from API key/JWT claims at auth step. (2) Look up tier limits from config (free: 100/min, pro: 10K/min, enterprise: custom). (3) Use token bucket per consumer with tier-specific capacity and refill rate. (4) Store state in Redis with key pattern: ratelimit:{consumer_id}:{endpoint}. (5) Return X-RateLimit headers so clients can self-throttle. (6) Enterprise gets dedicated rate limit pools (not shared). (7) Alert on consumers consistently hitting limits — they may need an upgrade.

Common Mistakes

⚠️

Rate limiting by IP only

Using client IP as the sole rate limit key — breaks for users behind NAT/corporate proxies (thousands of users share one IP).

✅Use authenticated identity (API key, user ID) as the primary limit key. Fall back to IP only for unauthenticated endpoints. For authenticated traffic, IP-based limits are a secondary DDoS defense, not the primary fairness mechanism.

⚠️

No rate limit headers in responses

Returning 429 without telling the client their limit, remaining quota, or when to retry.

✅Always include X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After headers. Well-behaved clients use these to self-throttle, reducing the load on your rate limiter. Without them, clients retry blindly.

⚠️

Same limit for all endpoints

Applying a single rate limit (e.g., 1000 req/min) uniformly across all endpoints, including expensive search and cheap health checks.

✅Set per-endpoint limits based on cost. A /search endpoint hitting Elasticsearch might allow 10 req/min while /users allows 1000 req/min. Expensive operations (AI inference, report generation) need much tighter limits than simple CRUD.

⚠️

Circuit breaker with no half-open state

Implementing a circuit breaker that stays open until manually reset, requiring human intervention to restore traffic.

✅Always implement the half-open state: after a timeout period, allow a small number of test requests through. If they succeed, close the circuit automatically. If they fail, re-open. This enables automatic recovery without human intervention.