Rate Limiting
Control traffic and prevent abuse — token bucket, leaky bucket, sliding window counters, and distributed rate limiting. Protect systems from overload while ensuring fair usage.
Table of Contents
The Big Picture — Why Rate Limiting Is Necessary
Without rate limiting, a single misbehaving client — whether it's a bug, a bot, or a deliberate attack — can consume all your system's resources and bring it down for everyone. Rate limiting is the bouncer at the door: it controls how many requests each client can make in a given time period.
The Water Tap Analogy
Your water system has a maximum flow capacity. Without a flow regulator, one house could open all taps full blast and drain pressure for the entire neighborhood. A rate limiter is the flow regulator — it ensures each house gets a fair share of water, prevents any single house from overwhelming the system, and keeps the overall pressure stable. The token bucket is a tap that fills a small tank — you can use water in bursts (empty the tank) but can't exceed the refill rate long-term. The leaky bucket is a tap with a constant drip — water flows out at a fixed rate regardless of how much is poured in.
What Happens Without Rate Limiting
System Overload
A traffic spike (legitimate or attack) overwhelms your servers. CPU maxes out, memory exhausts, database connections pool fills. The entire system crashes for all users.
Abuse & Spam
Bots create thousands of accounts per minute, scrape your entire database, or brute-force login credentials. Without limits, there's no defense.
Resource Starvation
One heavy API consumer uses 90% of your capacity. The other 10,000 users share the remaining 10%. Unfair and unsustainable.
🔥 Key Insight
Rate limiting isn't just about security — it's about fairness and stability. Every public API (Stripe, GitHub, Twitter) rate-limits clients. Every internal service should rate-limit callers. It's a fundamental building block of reliable systems.
Where Rate Limiting Sits
Client
Sends request
API Gateway
First line of defense
Rate Limiter
Allow or reject
Application
Process request
Database
Store/retrieve
Request: GET /api/users HTTP/1.1 Authorization: Bearer token_abc123 Response (rate limited): HTTP/1.1 429 Too Many Requests Retry-After: 30 X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1706140800 { "error": { "code": "RATE_LIMITED", "message": "Rate limit exceeded. 100 requests per minute allowed.", "retry_after": 30 } } Headers tell the client: → How many requests are allowed (Limit) → How many remain in this window (Remaining) → When the window resets (Reset) → How long to wait before retrying (Retry-After)
Where to Apply
- ✅API Gateway — first line, blocks bad traffic early
- ✅Per-user — 100 requests/min per API key
- ✅Per-endpoint — /api/login gets stricter limits than /api/products
- ✅Per-IP — blocks DDoS before it reaches your app
- ✅Internal services — prevent one service from overwhelming another
What to Limit By
- ✅API key / auth token (most common for APIs)
- ✅IP address (for unauthenticated endpoints)
- ✅User ID (for authenticated users)
- ✅Endpoint + method (different limits per route)
- ✅Tenant ID (for multi-tenant SaaS)
Token Bucket Algorithm
The token bucket is the most widely used rate limiting algorithm. A bucket holds tokens. Tokens are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing bursts up to that capacity.
Configuration: Bucket capacity: 10 tokens (max burst) Refill rate: 1 token/second Timeline: T=0: Bucket has 10 tokens (full) T=0: 5 requests arrive → 5 tokens consumed → 5 remaining ✅ T=0: 3 more requests → 3 tokens consumed → 2 remaining ✅ T=0: 3 more requests → only 2 tokens → 2 allowed, 1 REJECTED ❌ T=1: 1 token added (refill) → 3 remaining T=2: 1 token added → 4 remaining T=5: 3 more tokens added → 7 remaining T=10: Bucket full again (10 tokens, capped at capacity) Key behavior: → Allows bursts up to bucket capacity (10 requests instantly) → Long-term rate is limited to refill rate (1 req/sec) → After a burst, must wait for tokens to refill → Unused tokens accumulate (up to capacity)
class TokenBucket: capacity = 10 // max tokens tokens = 10 // current tokens refill_rate = 1 // tokens per second last_refill = now() function allow_request(): // Refill tokens based on elapsed time elapsed = now() - last_refill tokens = min(capacity, tokens + elapsed * refill_rate) last_refill = now() if tokens >= 1: tokens -= 1 return ALLOW else: return REJECT (429 Too Many Requests)
Strengths
- ✅Allows bursts (great for real-world traffic patterns)
- ✅Simple to implement and understand
- ✅Memory-efficient (just 2 numbers per bucket)
- ✅Used by AWS, Stripe, GitHub, most major APIs
- ✅Easy to tune: adjust capacity (burst) and rate independently
Weaknesses
- ❌Bursts can still overwhelm downstream if capacity is too high
- ❌Doesn't guarantee smooth traffic (bursty by design)
- ❌Distributed implementation needs shared state
- ❌Clock skew can cause issues in distributed setups
🎯 Interview Insight
Token bucket is the default answer for "how would you implement rate limiting?" Say: "I'd use a token bucket with capacity of 100 and refill rate of 10/sec. This allows bursts of up to 100 requests but limits sustained throughput to 10/sec. I'd store the bucket state in Redis for distributed enforcement."
Leaky Bucket Algorithm
The leaky bucket processes requests at a constant rate, regardless of how fast they arrive. Incoming requests enter a queue (the bucket). The queue drains at a fixed rate. If the queue is full, new requests are dropped.
Configuration: Queue capacity: 10 requests Drain rate: 1 request/second (constant output) Timeline: T=0: 8 requests arrive at once → all enter queue (8/10) T=0: 1 request processed (drained) → 7 in queue T=1: 1 request processed → 6 in queue T=1: 5 more requests arrive → 11 total, queue is 10 → 1 DROPPED ❌ T=2: 1 request processed → 9 in queue ... T=9: Queue empty, all processed Key behavior: → Output rate is ALWAYS constant (1 req/sec) → No bursts — traffic is smoothed → Excess requests are dropped (not queued beyond capacity) → Acts as a traffic shaper, not just a limiter
| Feature | Token Bucket | Leaky Bucket |
|---|---|---|
| Burst support | ✅ Yes (up to bucket capacity) | ❌ No (constant output rate) |
| Output pattern | Bursty (matches input pattern) | Smooth (constant rate) |
| When bucket is full | Rejects new requests | Drops new requests |
| Memory | 2 numbers (tokens, timestamp) | Queue of pending requests |
| Best for | APIs (allow bursts) | Network traffic shaping (smooth flow) |
| Real-world use | Stripe, GitHub, AWS APIs | Network routers, video streaming |
🎯 Interview Insight
Leaky bucket is less common in API rate limiting but important for traffic shaping. The key distinction: token bucket allows bursts (good for APIs where users make several requests quickly then pause), leaky bucket enforces a constant rate (good for network bandwidth or processing pipelines where smooth flow matters).
Sliding Window Counters
Sliding window counters count requests in a rolling time window. Instead of "100 requests per minute starting at :00" (fixed window), it's "100 requests in the last 60 seconds from right now" (sliding window). This is fairer and avoids the boundary spike problem.
Fixed Window vs Sliding Window
Fixed window: 100 requests per minute, window resets at :00 12:00:00 ─────────────────── 12:01:00 ─────────────────── 12:02:00 [ window 1 ][ window 2 ] User sends 0 requests from 12:00:00 to 12:00:55 User sends 100 requests at 12:00:55-12:00:59 ✅ (within limit) User sends 100 requests at 12:01:00-12:01:05 ✅ (new window!) Result: 200 requests in 10 seconds — 2x the intended rate! The boundary between windows creates a loophole. Sliding window: 100 requests in the last 60 seconds At 12:01:05, the window is [12:00:05 → 12:01:05] It counts ALL requests in that 60-second range The 100 requests from 12:00:55 are still in the window → Only 0 more allowed until 12:00:55's requests expire Result: never exceeds 100 requests in ANY 60-second period. Fair.
Sliding Window Log vs Sliding Window Counter
| Approach | How It Works | Accuracy | Memory | Best For |
|---|---|---|---|---|
| Fixed Window | Count per fixed time bucket (e.g., per minute at :00) | Low (boundary spike) | Very low (1 counter) | Simple, low-stakes limiting |
| Sliding Window Log | Store timestamp of every request, count within window | Perfect | High (1 entry per request) | When accuracy is critical |
| Sliding Window Counter | Weighted average of current and previous fixed windows | Very good (~99.7%) | Low (2 counters) | Production APIs (best balance) |
Limit: 100 requests per 60 seconds Current window: 12:01:00 - 12:02:00 → count = 30 Previous window: 12:00:00 - 12:01:00 → count = 80 Current time: 12:01:20 (20 seconds into current window) Window overlap: (60 - 20) / 60 = 66.7% of previous window Weighted count = current_count + previous_count × overlap = 30 + 80 × 0.667 = 30 + 53.3 = 83.3 83.3 < 100 → ALLOW ✅ This approximation is ~99.7% accurate and uses only 2 counters. Cloudflare uses this approach for their rate limiting.
🎯 Interview Insight
Sliding window counter is the best answer for production rate limiting. It's accurate (no boundary spikes), memory-efficient (2 counters per key), and fast (O(1) per check). Mention the weighted average formula and explain why fixed windows have the boundary spike problem.
Distributed Rate Limiting
In a single-server setup, rate limiting is simple — one counter in memory. But with 20 API servers behind a load balancer, each server has its own counter. A client sending 100 requests gets ~5 per server — each server thinks the client sent only 5. The global limit of 100 is never enforced.
Limit: 100 requests/minute per user Single server: All requests → 1 server → 1 counter → enforced correctly ✅ 20 servers behind load balancer: 100 requests → ~5 per server → each server's counter = 5 Each server thinks: "5 < 100, allow" ✅ Global total: 100 requests → all allowed But if client sends 2,000 requests: → ~100 per server → each server: "100 = 100, limit reached" → Global total: 2,000 allowed instead of 100 ❌ The fix: shared state that ALL servers read/write.
Solutions
Centralized Store (Redis)
All servers check and increment a counter in Redis. INCR + EXPIRE gives atomic counter with TTL. Every request: INCR user:42:rate_limit → if > 100 → reject. Redis handles ~100K operations/sec — sufficient for most systems.
Redis + Lua Script (Atomic)
Use a Lua script in Redis to atomically check the counter, increment it, and set TTL — all in one round trip. This prevents race conditions where two servers read the same count simultaneously and both allow the request.
Local + Sync (Hybrid)
Each server maintains a local counter and periodically syncs with a central store. Faster (no Redis call per request) but less accurate (local counters can drift). Good for high-throughput, approximate limiting.
Consistent Hashing for Sharding
Route all requests from the same user to the same server (sticky sessions via consistent hashing). That server's local counter is the global counter for that user. No shared state needed. Trade-off: uneven load distribution.
Lua script (runs atomically in Redis): local key = KEYS[1] -- "rate_limit:user:42" local limit = tonumber(ARGV[1]) -- 100 local window = tonumber(ARGV[2]) -- 60 (seconds) local current = redis.call("INCR", key) if current == 1 then redis.call("EXPIRE", key, window) end if current > limit then return 0 -- REJECT else return 1 -- ALLOW end Why Lua? The INCR + EXPIRE + comparison must be atomic. Without Lua, a race condition between INCR and the check could allow requests beyond the limit.
| Approach | Accuracy | Latency | Complexity | Best For |
|---|---|---|---|---|
| Redis (centralized) | High | +0.5ms per request | Low | Most production APIs |
| Redis + Lua | Very high (atomic) | +0.5ms per request | Medium | Strict enforcement needed |
| Local + sync | Approximate | ~0ms (local) | Medium | Very high throughput, approximate OK |
| Sticky sessions | High (per-user) | ~0ms (local) | Low | When load balancer supports it |
🎯 Interview Insight
Distributed rate limiting with Redis is the standard answer. Say: "I'd use Redis with a Lua script for atomic check-and-increment. The key is user:{id}:rate_limit with a 60-second TTL. Each API server calls Redis before processing the request. Redis handles 100K+ ops/sec, so it won't be the bottleneck."
End-to-End Scenario
Let's design the rate limiting system for a public API platform like Stripe or GitHub.
System: Public API with 50K developers, 20 API servers Rate limits: Free tier: 60 requests/minute, burst of 10 Pro tier: 1,000 requests/minute, burst of 100 Enterprise: 10,000 requests/minute, burst of 500 Algorithm: Token bucket (allows bursts, industry standard) State store: Redis cluster (shared across all 20 API servers) Request flow: 1. Client sends: GET /api/charges (API key in header) 2. API Gateway extracts API key → looks up tier 3. Rate limiter checks Redis: Key: "rate:{api_key}" → { tokens: 87, last_refill: 1706140790 } Refill tokens based on elapsed time If tokens > 0 → decrement, ALLOW If tokens = 0 → REJECT with 429 4. Response headers (always included): X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 87 X-RateLimit-Reset: 1706140860 5. If rejected: HTTP 429 Too Many Requests Retry-After: 15 Body: { "error": "Rate limit exceeded", "retry_after": 15 } Defense layers: Layer 1: IP-based rate limit at CDN/WAF (10K req/min per IP) Layer 2: API key rate limit at gateway (per-tier limits) Layer 3: Per-endpoint limits (/api/login: 10/min, /api/charges: 1000/min) Layer 4: Global system limit (circuit breaker if total QPS > capacity)
💡 This Is How Stripe / GitHub Works
Multiple layers of rate limiting, token bucket for burst support, Redis for distributed state, and clear response headers so clients can self-throttle. The rate limiter is the first thing a request hits — before authentication, before business logic, before the database.
Trade-offs & Decision Making
| Algorithm | Burst Support | Accuracy | Complexity | Memory | Best For |
|---|---|---|---|---|---|
| Token Bucket | ✅ Yes | Medium | Low | Very low | APIs (most common) |
| Leaky Bucket | ❌ No | High | Low | Medium (queue) | Traffic shaping, streaming |
| Fixed Window | ❌ No | Low (boundary spike) | Very low | Very low | Simple, non-critical limits |
| Sliding Window Log | ❌ No | Perfect | Medium | High | When accuracy is critical |
| Sliding Window Counter | Limited | Very high | Medium | Low | Production APIs (best balance) |
🎯 Choose Token Bucket When
- Clients make bursty requests (common in APIs)
- You want simple, well-understood behavior
- Burst capacity and sustained rate need independent tuning
- Industry standard is expected (Stripe, AWS, GitHub)
🎯 Choose Sliding Window When
- Fairness is critical (no boundary exploitation)
- Strict per-minute/per-hour limits needed
- Login attempt limiting (security-sensitive)
- Compliance requires exact counting
Interview Questions
Q:What's the difference between token bucket and leaky bucket?
A: Token bucket allows bursts — tokens accumulate when the client is idle, and can be spent in a burst (up to bucket capacity). The long-term rate is limited to the refill rate. Leaky bucket enforces a constant output rate — requests enter a queue and are processed at a fixed rate, regardless of arrival pattern. No bursts allowed. Use token bucket for APIs (users make several requests quickly, then pause). Use leaky bucket for traffic shaping (need smooth, constant throughput).
Q:How do you implement distributed rate limiting?
A: Use Redis as a shared counter store. All API servers check and increment the same counter in Redis before processing a request. Use a Lua script for atomic check-and-increment (prevents race conditions). Key design: 'rate_limit:{user_id}' with INCR + EXPIRE. Redis handles 100K+ ops/sec — not a bottleneck. For very high throughput where Redis latency matters, use a hybrid approach: local counters with periodic sync to Redis (approximate but faster).
Q:Why is sliding window more accurate than fixed window?
A: Fixed window resets at boundaries (e.g., every minute at :00). A client can send 100 requests at :59 and 100 more at :01 — 200 requests in 2 seconds, double the intended rate. Sliding window counts requests in the last N seconds from NOW, so there's no boundary to exploit. The sliding window counter approximation (weighted average of current and previous windows) is ~99.7% accurate with only 2 counters per key.
You're designing rate limiting for a payment API
What algorithm and limits would you use?
Answer: Token bucket with strict limits. Configuration: capacity=5 (small burst for retries), refill rate=2/sec (sustained rate). Why token bucket: payment retries come in bursts (client retries 3 times quickly). Why strict: payments are irreversible — over-allowing is worse than over-rejecting. Distributed enforcement via Redis + Lua (atomic). Per-API-key limits (not per-IP, since multiple users share IPs). Additional: idempotency keys to handle retries safely, and a separate stricter limit on /create-charge vs /list-charges.
Your rate limiter uses fixed windows and users complain about unfair limiting
What's happening and how do you fix it?
Answer: Users are hitting the boundary spike problem. A user sends 50 requests at :59 (within the 100/min limit), then 50 more at :01 (new window, also within limit). But another user who sends 100 requests evenly throughout the minute gets rate-limited at :30. The first user got 100 requests in 2 seconds; the second got limited at 50 seconds. Fix: switch to sliding window counter. It counts requests in the last 60 seconds from NOW, so both users are treated fairly regardless of when in the minute they send requests.
Common Pitfalls
Not handling distributed state
Each API server has its own in-memory rate limiter. With 20 servers, a client gets 20x the intended limit (each server allows the full quota independently). The rate limiter is effectively useless.
✅Use a shared store (Redis) for rate limit counters. All servers read/write the same counter. Use Lua scripts for atomic operations. If Redis latency is a concern, use local counters with periodic sync — approximate but still effective.
Using fixed windows (boundary spike)
Rate limit is 100/minute with fixed windows resetting at :00. Users exploit the boundary: 100 requests at :59, 100 more at :01 = 200 in 2 seconds. The system is overloaded despite the 'limit' being in place.
✅Use sliding window counters. The weighted average of current and previous windows eliminates the boundary spike with minimal overhead (2 counters per key). If you must use fixed windows, set the limit to half the actual capacity to account for boundary doubling.
Ignoring burst requirements
Using a leaky bucket (constant rate) for an API where clients naturally make bursty requests. A mobile app loads a screen and makes 10 API calls in 500ms. The leaky bucket queues 9 of them, making the app feel slow. Users complain about performance.
✅Use token bucket for APIs. Set the bucket capacity to the expected burst size (e.g., 10-20 for a mobile app's screen load). The sustained rate limits long-term usage while the burst capacity handles normal usage patterns without artificial delays.
Over-restricting legitimate users
Rate limits are too aggressive. Normal usage patterns trigger 429 errors. Users can't complete basic workflows. Support tickets pile up. Developers abandon your API for a competitor.
✅Set limits based on actual usage data, not guesses. Monitor the 95th and 99th percentile of request rates per user. Set limits at 2-3x the p99 to allow headroom. Provide clear rate limit headers so clients can self-throttle. Offer higher tiers for power users.