Rate Limiting & Throttling
Protecting your system from overload, abuse, and unfair usage. Algorithms, distributed limiting with Redis, and circuit breakers.
Table of Contents
Why Rate Limiting at the Gateway
Rate limiting at the gateway protects your entire system from a single enforcement point. Without it, a single misbehaving client can overwhelm your backends, degrade service for everyone, and potentially cause cascading failures across your microservices.
Why Rate Limit at the Gateway
- ✅Fair usage — prevent one client from consuming all capacity
- ✅Backend protection — reject excess traffic before it reaches services
- ✅Cost control — limit expensive operations (AI inference, third-party API calls)
- ✅DDoS first line — absorb volumetric attacks at the edge
- ✅Monetization — enforce plan-based limits (free: 100 req/min, pro: 10,000 req/min)
The Water Treatment Plant
Rate limiting is like a water treatment plant's intake valve. No matter how much water flows from the river (incoming requests), the valve controls how much enters the system. Without it, a flood (traffic spike) overwhelms the treatment capacity (backend services) and nothing works properly. The valve protects the entire downstream system.
Rate Limiting is Not Optional
Every public API needs rate limiting. Even internal APIs benefit from it. Without rate limiting: a bug in one client can DDoS your system, a retry storm after a partial outage can prevent recovery, and you have no way to prioritize traffic during degraded conditions. It's a safety mechanism, not a feature.
Rate Limiting Dimensions
Rate limits can be applied along multiple dimensions simultaneously. The key is choosing the right identifier — what uniquely identifies the entity you want to limit.
| Dimension | Identifier | Use Case |
|---|---|---|
| Per IP | Client IP address | Anonymous traffic, DDoS protection |
| Per API Key | API key / client ID | Third-party integrations, plan enforcement |
| Per User | Authenticated user ID | Fair usage across user sessions |
| Per Endpoint | HTTP method + path | Protect expensive endpoints differently |
| Per Tenant | Tenant/organization ID | Multi-tenant isolation |
| Combined | User + Endpoint | User can call /search 10/min but /orders 100/min |
plugins: # Global rate limit — per consumer - name: rate-limiting config: minute: 100 hour: 5000 policy: redis redis_host: redis-cluster redis_port: 6379 limit_by: consumer # per API key/consumer header_name: X-Consumer-ID # Per-route override — expensive endpoint routes: - name: search-route paths: - /api/v1/search plugins: - name: rate-limiting config: minute: 10 policy: redis redis_host: redis-cluster limit_by: consumer
Layered Rate Limits
Apply limits at multiple layers: a global limit (10,000 req/min per key), an endpoint limit (100 req/min for /search), and a burst limit (50 req/sec instantaneous). The most restrictive limit applies. This prevents both sustained overload and sudden bursts.
Fixed Window & Sliding Window
Fixed Window Counter
The simplest algorithm. Divide time into fixed windows (e.g., 1-minute intervals). Count requests in the current window. Reject when count exceeds the limit. Reset counter at window boundary.
# Fixed window: 100 requests per minute # Key: rate_limit:{client_id}:{window_timestamp} # Window = current minute (floor to minute boundary) WINDOW=$(date +%Y%m%d%H%M) KEY="rate_limit:client_123:${WINDOW}" # Atomic increment + check CURRENT=$(redis-cli INCR "$KEY") redis-cli EXPIRE "$KEY" 60 # Auto-cleanup if [ "$CURRENT" -gt 100 ]; then echo "429 Too Many Requests" fi
The Boundary Spike Problem
Fixed windows have a flaw: a client can send 100 requests at 12:00:59 and 100 more at 12:01:00 — 200 requests in 2 seconds while respecting the "100 per minute" limit. The window boundary creates a spike opportunity. Sliding window algorithms solve this.
Sliding Window Log
Store the timestamp of every request. To check the limit, count timestamps within the last N seconds. Precise but memory-intensive — storing every timestamp is expensive at high request rates.
Sliding Window Counter
A hybrid approach: use fixed window counters but weight the previous window's count by the overlap percentage. If we're 30% into the current window, the effective count is: (previous_window × 0.7) + current_window. This approximates a true sliding window with minimal memory.
| Algorithm | Memory | Precision | Boundary Spike |
|---|---|---|---|
| Fixed Window | O(1) — one counter | Low | Yes — 2x burst at boundary |
| Sliding Window Log | O(n) — all timestamps | Exact | No |
| Sliding Window Counter | O(1) — two counters | Approximate | Minimal |
Token Bucket & Leaky Bucket
Token Bucket
A bucket holds tokens (capacity = burst size). Tokens are added at a fixed rate (refill rate = sustained limit). Each request consumes one token. If the bucket is empty, the request is rejected. This allows bursts up to bucket capacity while enforcing a sustained rate.
The Arcade Token Dispenser
Imagine an arcade that gives you 10 tokens per hour, and your bucket holds 20 tokens max. You can spend 20 tokens immediately (burst), but then you wait for refills. If you pace yourself at 10/hour, you always have tokens. The token bucket allows short bursts while enforcing a long-term average rate.
{ "algorithm": "token_bucket", "bucket_capacity": 50, "refill_rate": 10, "refill_interval_ms": 1000, "comment": "Allows burst of 50 requests, sustained rate of 10/sec" } // State per client: // { // "tokens": 42, // "last_refill": "2024-03-01T12:00:00.500Z" // } // // On each request: // 1. Calculate tokens to add since last_refill // 2. tokens = min(tokens + added, bucket_capacity) // 3. If tokens >= 1: consume token, allow request // 4. If tokens < 1: reject with 429
Leaky Bucket
Requests enter a queue (bucket). The queue drains at a fixed rate. If the queue is full, new requests are rejected. Unlike token bucket, leaky bucket produces perfectly smooth output — no bursts. Requests are processed at a constant rate regardless of arrival pattern.
| Aspect | Token Bucket | Leaky Bucket |
|---|---|---|
| Burst handling | Allows bursts up to bucket capacity | No bursts — constant output rate |
| Output pattern | Bursty (matches input up to capacity) | Smooth (fixed drain rate) |
| Implementation | Counter + timestamp | Queue + fixed-rate processor |
| Memory | O(1) per client | O(queue_size) per client |
| Best for | APIs where short bursts are acceptable | Systems needing smooth, predictable load |
| Used by | AWS API Gateway, Stripe | NGINX (limit_req with burst) |
Token Bucket is Usually the Right Choice
Most APIs should use token bucket. It's simple, memory-efficient, and allows natural burst patterns (page load triggers multiple API calls simultaneously). Leaky bucket is better when your backend truly cannot handle any burst — like a payment processor with strict TPS limits.
Distributed Rate Limiting
With multiple gateway instances, rate limit state must be shared. If each instance tracks limits independently, a client can multiply their effective limit by the number of instances. Redis is the standard solution — atomic operations ensure accurate counting across all gateway nodes.
-- Lua script for atomic sliding window rate limiting in Redis -- Executed atomically — no race conditions between gateway instances local key = KEYS[1] local limit = tonumber(ARGV[1]) local window_ms = tonumber(ARGV[2]) local now = tonumber(ARGV[3]) local window_start = now - window_ms -- Remove expired entries redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start) -- Count requests in current window local count = redis.call('ZCARD', key) if count >= limit then return 0 -- Rate limited end -- Add current request redis.call('ZADD', key, now, now .. ':' .. math.random()) redis.call('PEXPIRE', key, window_ms) return limit - count -- Remaining requests
Rate Limit Response Headers
# Standard rate limit headers (RFC 6585 / draft-ietf-httpapi-ratelimit-headers) HTTP/1.1 200 OK X-RateLimit-Limit: 100 # Max requests in window X-RateLimit-Remaining: 42 # Requests remaining X-RateLimit-Reset: 1709251260 # Unix timestamp when window resets Retry-After: 30 # Seconds until client should retry (on 429) # When rate limited: HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1709251260 Retry-After: 30 Content-Type: application/json {"error": "rate_limit_exceeded", "retry_after": 30}
Local Cache + Sync Pattern
For ultra-low-latency rate limiting, some gateways use a hybrid: each instance maintains a local counter and periodically syncs with Redis. This avoids a Redis round-trip on every request but allows slight over-limit (by the sync interval). Acceptable when exact precision isn't critical — e.g., allowing 105 requests on a 100 limit is fine.
Circuit Breaker
A circuit breaker prevents the gateway from sending requests to a failing upstream service. Instead of letting requests pile up and timeout (wasting resources), the circuit breaker fails fast — returning an error immediately without attempting the upstream call.
| State | Behavior | Transitions To |
|---|---|---|
| Closed (normal) | Requests flow through, failures are counted | Open (when failure threshold exceeded) |
| Open (tripped) | All requests fail immediately with 503 | Half-Open (after timeout period) |
| Half-Open (testing) | Allow limited requests through to test recovery | Closed (if test succeeds) or Open (if test fails) |
The Electrical Circuit Breaker
Just like a home circuit breaker trips when it detects dangerous current (preventing a fire), an API circuit breaker trips when it detects too many failures (preventing cascade failure). You don't keep pushing electricity through a short circuit — and you don't keep sending requests to a dead service. After the problem is fixed, you reset the breaker and current flows again.
# Envoy circuit breaker configuration clusters: - name: order-service connect_timeout: 5s circuit_breakers: thresholds: - priority: DEFAULT max_connections: 100 max_pending_requests: 50 max_requests: 200 max_retries: 3 outlier_detection: consecutive_5xx: 5 # Trip after 5 consecutive 5xx interval: 10s # Check every 10 seconds base_ejection_time: 30s # Eject for 30 seconds minimum max_ejection_percent: 50 # Never eject more than 50% of hosts success_rate_minimum_hosts: 3
Per-Route Circuit Breakers
Configure circuit breakers per upstream service, not globally. A failing order-service shouldn't trip the breaker for user-service. Each upstream gets its own failure counter and state machine. This provides fault isolation — one bad service doesn't take down unrelated functionality.
Throttling vs Rate Limiting
Rate limiting is a hard reject — exceed the limit and you get 429. Throttling is a softer approach — slow down requests instead of rejecting them. Both protect backends, but throttling provides a better client experience when possible.
| Approach | Behavior | Client Experience | Use When |
|---|---|---|---|
| Hard rate limit | Reject with 429 | Immediate error, must retry | Strict enforcement, abuse prevention |
| Throttling (delay) | Queue and process slowly | Slower response, no error | Burst absorption, graceful degradation |
| Priority queuing | Process high-priority first | Paid users unaffected | Tiered service, monetization |
| Graceful degradation | Return partial/cached data | Reduced quality, no error | Read-heavy APIs during overload |
# NGINX rate limiting with burst queue (throttling) limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s; server { location /api/ { # Allow burst of 20, delay requests beyond 10/s # First 10/s: immediate # Next 20: queued and released at 10/s (throttled) # Beyond 30: rejected with 503 limit_req zone=api burst=20 delay=10; # Custom error response for rate-limited requests limit_req_status 429; proxy_pass http://backend; } }
Graceful Degradation Strategies
- ✅Return cached responses when backend is overloaded
- ✅Reduce response payload (omit optional fields)
- ✅Disable expensive features (search suggestions, recommendations)
- ✅Prioritize authenticated users over anonymous traffic
- ✅Shed load from non-critical endpoints first (analytics, preferences)
Interview Questions
Q:Compare token bucket and sliding window algorithms. When would you choose each?
A: Token bucket allows bursts (up to bucket capacity) while enforcing a sustained rate — ideal for APIs where clients naturally send bursts (page loads, batch operations). Sliding window provides a strict, even limit with no burst allowance — better for protecting backends with hard capacity limits. Token bucket is simpler to implement and more forgiving; sliding window is stricter and more predictable.
Q:How do you implement rate limiting across multiple gateway instances?
A: Use Redis as a shared counter store. Each gateway instance executes atomic Lua scripts in Redis to increment counters and check limits. The Lua script ensures no race conditions between concurrent requests hitting different instances. For lower latency, use a hybrid approach: local counters with periodic Redis sync (accepting slight over-limit). Redis Cluster provides HA for the rate limit state.
Q:What's the boundary spike problem and how do you solve it?
A: With fixed window counters, a client can send the full limit at the end of one window and the full limit at the start of the next — doubling their effective rate in a short period. Solutions: (1) Sliding window counter — weights previous window's count by overlap percentage. (2) Token bucket — naturally handles this since tokens refill continuously. (3) Sliding window log — tracks exact timestamps (memory-intensive).
Q:How does a circuit breaker differ from rate limiting?
A: Rate limiting protects backends from too many requests (client-side problem). Circuit breaker protects the gateway from wasting resources on a failing backend (server-side problem). Rate limiting says 'you're sending too fast.' Circuit breaker says 'the destination is broken, I'll fail fast instead of waiting for timeouts.' They're complementary — you need both.
Q:Design a rate limiting system for a multi-tier API (free, pro, enterprise).
A: Key design: (1) Identify tier from API key/JWT claims at auth step. (2) Look up tier limits from config (free: 100/min, pro: 10K/min, enterprise: custom). (3) Use token bucket per consumer with tier-specific capacity and refill rate. (4) Store state in Redis with key pattern: ratelimit:{consumer_id}:{endpoint}. (5) Return X-RateLimit headers so clients can self-throttle. (6) Enterprise gets dedicated rate limit pools (not shared). (7) Alert on consumers consistently hitting limits — they may need an upgrade.
Common Mistakes
Rate limiting by IP only
Using client IP as the sole rate limit key — breaks for users behind NAT/corporate proxies (thousands of users share one IP).
✅Use authenticated identity (API key, user ID) as the primary limit key. Fall back to IP only for unauthenticated endpoints. For authenticated traffic, IP-based limits are a secondary DDoS defense, not the primary fairness mechanism.
No rate limit headers in responses
Returning 429 without telling the client their limit, remaining quota, or when to retry.
✅Always include X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After headers. Well-behaved clients use these to self-throttle, reducing the load on your rate limiter. Without them, clients retry blindly.
Same limit for all endpoints
Applying a single rate limit (e.g., 1000 req/min) uniformly across all endpoints, including expensive search and cheap health checks.
✅Set per-endpoint limits based on cost. A /search endpoint hitting Elasticsearch might allow 10 req/min while /users allows 1000 req/min. Expensive operations (AI inference, report generation) need much tighter limits than simple CRUD.
Circuit breaker with no half-open state
Implementing a circuit breaker that stays open until manually reset, requiring human intervention to restore traffic.
✅Always implement the half-open state: after a timeout period, allow a small number of test requests through. If they succeed, close the circuit automatically. If they fail, re-open. This enables automatic recovery without human intervention.