Token BucketLeaky BucketSliding WindowDistributed Rate LimitingAPI ThrottlingRedis

Rate Limiting

Control traffic and prevent abuse — token bucket, leaky bucket, sliding window counters, and distributed rate limiting. Protect systems from overload while ensuring fair usage.

28 min read10 sections

The Big Picture — Why Rate Limiting Is Necessary

Without rate limiting, a single misbehaving client — whether it's a bug, a bot, or a deliberate attack — can consume all your system's resources and bring it down for everyone. Rate limiting is the bouncer at the door: it controls how many requests each client can make in a given time period.

🚰

The Water Tap Analogy

Your water system has a maximum flow capacity. Without a flow regulator, one house could open all taps full blast and drain pressure for the entire neighborhood. A rate limiter is the flow regulator — it ensures each house gets a fair share of water, prevents any single house from overwhelming the system, and keeps the overall pressure stable. The token bucket is a tap that fills a small tank — you can use water in bursts (empty the tank) but can't exceed the refill rate long-term. The leaky bucket is a tap with a constant drip — water flows out at a fixed rate regardless of how much is poured in.

What Happens Without Rate Limiting

💥

System Overload

A traffic spike (legitimate or attack) overwhelms your servers. CPU maxes out, memory exhausts, database connections pool fills. The entire system crashes for all users.

🤖

Abuse & Spam

Bots create thousands of accounts per minute, scrape your entire database, or brute-force login credentials. Without limits, there's no defense.

🏜️

Resource Starvation

One heavy API consumer uses 90% of your capacity. The other 10,000 users share the remaining 10%. Unfair and unsustainable.

🔥 Key Insight

Rate limiting isn't just about security — it's about fairness and stability. Every public API (Stripe, GitHub, Twitter) rate-limits clients. Every internal service should rate-limit callers. It's a fundamental building block of reliable systems.

Where Rate Limiting Sits

👤

Client

Sends request

🚪

API Gateway

First line of defense

🚦

Rate Limiter

Allow or reject

🖥️

Application

Process request

🗄️

Database

Store/retrieve

Rate Limit Response — HTTP 429text

Request:
  GET /api/users HTTP/1.1
  Authorization: Bearer token_abc123

Response (rate limited):
  HTTP/1.1 429 Too Many Requests
  Retry-After: 30
  X-RateLimit-Limit: 100
  X-RateLimit-Remaining: 0
  X-RateLimit-Reset: 1706140800

  {
    "error": {
      "code": "RATE_LIMITED",
      "message": "Rate limit exceeded. 100 requests per minute allowed.",
      "retry_after": 30
    }
  }

Headers tell the client:
  → How many requests are allowed (Limit)
  → How many remain in this window (Remaining)
  → When the window resets (Reset)
  → How long to wait before retrying (Retry-After)

Where to Apply

✅API Gateway — first line, blocks bad traffic early
✅Per-user — 100 requests/min per API key
✅Per-endpoint — /api/login gets stricter limits than /api/products
✅Per-IP — blocks DDoS before it reaches your app
✅Internal services — prevent one service from overwhelming another

What to Limit By

✅API key / auth token (most common for APIs)
✅IP address (for unauthenticated endpoints)
✅User ID (for authenticated users)
✅Endpoint + method (different limits per route)
✅Tenant ID (for multi-tenant SaaS)

Token Bucket Algorithm

The token bucket is the most widely used rate limiting algorithm. A bucket holds tokens. Tokens are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing bursts up to that capacity.

Token Bucket — Internal Mechanicstext

Configuration:
  Bucket capacity: 10 tokens (max burst)
  Refill rate: 1 token/second

Timeline:
  T=0:  Bucket has 10 tokens (full)
  T=0:  5 requests arrive → 5 tokens consumed → 5 remaining ✅
  T=0:  3 more requests → 3 tokens consumed → 2 remaining ✅
  T=0:  3 more requests → only 2 tokens → 2 allowed, 1 REJECTED ❌
  T=1:  1 token added (refill) → 3 remaining
  T=2:  1 token added → 4 remaining
  T=5:  3 more tokens added → 7 remaining
  T=10: Bucket full again (10 tokens, capped at capacity)

Key behavior:
  → Allows bursts up to bucket capacity (10 requests instantly)
  → Long-term rate is limited to refill rate (1 req/sec)
  → After a burst, must wait for tokens to refill
  → Unused tokens accumulate (up to capacity)

Token Bucket — Pseudocodetext

class TokenBucket:
  capacity = 10          // max tokens
  tokens = 10            // current tokens
  refill_rate = 1        // tokens per second
  last_refill = now()

  function allow_request():
    // Refill tokens based on elapsed time
    elapsed = now() - last_refill
    tokens = min(capacity, tokens + elapsed * refill_rate)
    last_refill = now()

    if tokens >= 1:
      tokens -= 1
      return ALLOW
    else:
      return REJECT (429 Too Many Requests)

Strengths

✅Allows bursts (great for real-world traffic patterns)
✅Simple to implement and understand
✅Memory-efficient (just 2 numbers per bucket)
✅Used by AWS, Stripe, GitHub, most major APIs
✅Easy to tune: adjust capacity (burst) and rate independently

Weaknesses

❌Bursts can still overwhelm downstream if capacity is too high
❌Doesn't guarantee smooth traffic (bursty by design)
❌Distributed implementation needs shared state
❌Clock skew can cause issues in distributed setups

🎯 Interview Insight

Token bucket is the default answer for "how would you implement rate limiting?" Say: "I'd use a token bucket with capacity of 100 and refill rate of 10/sec. This allows bursts of up to 100 requests but limits sustained throughput to 10/sec. I'd store the bucket state in Redis for distributed enforcement."

Leaky Bucket Algorithm

The leaky bucket processes requests at a constant rate, regardless of how fast they arrive. Incoming requests enter a queue (the bucket). The queue drains at a fixed rate. If the queue is full, new requests are dropped.

Leaky Bucket — Internal Mechanicstext

Configuration:
  Queue capacity: 10 requests
  Drain rate: 1 request/second (constant output)

Timeline:
  T=0:  8 requests arrive at once → all enter queue (8/10)
  T=0:  1 request processed (drained) → 7 in queue
  T=1:  1 request processed → 6 in queue
  T=1:  5 more requests arrive → 11 total, queue is 10 → 1 DROPPED ❌
  T=2:  1 request processed → 9 in queue
  ...
  T=9:  Queue empty, all processed

Key behavior:
  → Output rate is ALWAYS constant (1 req/sec)
  → No bursts — traffic is smoothed
  → Excess requests are dropped (not queued beyond capacity)
  → Acts as a traffic shaper, not just a limiter

Feature	Token Bucket	Leaky Bucket
Burst support	✅ Yes (up to bucket capacity)	❌ No (constant output rate)
Output pattern	Bursty (matches input pattern)	Smooth (constant rate)
When bucket is full	Rejects new requests	Drops new requests
Memory	2 numbers (tokens, timestamp)	Queue of pending requests
Best for	APIs (allow bursts)	Network traffic shaping (smooth flow)
Real-world use	Stripe, GitHub, AWS APIs	Network routers, video streaming

🎯 Interview Insight

Leaky bucket is less common in API rate limiting but important for traffic shaping. The key distinction: token bucket allows bursts (good for APIs where users make several requests quickly then pause), leaky bucket enforces a constant rate (good for network bandwidth or processing pipelines where smooth flow matters).

Sliding Window Counters

Sliding window counters count requests in a rolling time window. Instead of "100 requests per minute starting at :00" (fixed window), it's "100 requests in the last 60 seconds from right now" (sliding window). This is fairer and avoids the boundary spike problem.

Fixed Window vs Sliding Window

The Boundary Spike Problemtext

Fixed window: 100 requests per minute, window resets at :00

  12:00:00 ─────────────────── 12:01:00 ─────────────────── 12:02:00
  [         window 1          ][         window 2          ]

  User sends 0 requests from 12:00:00 to 12:00:55
  User sends 100 requests at 12:00:55-12:00:59  ✅ (within limit)
  User sends 100 requests at 12:01:00-12:01:05  ✅ (new window!)

  Result: 200 requests in 10 seconds — 2x the intended rate!
  The boundary between windows creates a loophole.

Sliding window: 100 requests in the last 60 seconds

  At 12:01:05, the window is [12:00:05 → 12:01:05]
  It counts ALL requests in that 60-second range
  The 100 requests from 12:00:55 are still in the window
  → Only 0 more allowed until 12:00:55's requests expire

  Result: never exceeds 100 requests in ANY 60-second period. Fair.

Sliding Window Log vs Sliding Window Counter

Approach	How It Works	Accuracy	Memory	Best For
Fixed Window	Count per fixed time bucket (e.g., per minute at :00)	Low (boundary spike)	Very low (1 counter)	Simple, low-stakes limiting
Sliding Window Log	Store timestamp of every request, count within window	Perfect	High (1 entry per request)	When accuracy is critical
Sliding Window Counter	Weighted average of current and previous fixed windows	Very good (~99.7%)	Low (2 counters)	Production APIs (best balance)

Sliding Window Counter — Pseudocodetext

Limit: 100 requests per 60 seconds

Current window: 12:01:00 - 12:02:00 → count = 30
Previous window: 12:00:00 - 12:01:00 → count = 80

Current time: 12:01:20 (20 seconds into current window)
Window overlap: (60 - 20) / 60 = 66.7% of previous window

Weighted count = current_count + previous_count × overlap
               = 30 + 80 × 0.667
               = 30 + 53.3
               = 83.3

83.3 < 100 → ALLOW ✅

This approximation is ~99.7% accurate and uses only 2 counters.
Cloudflare uses this approach for their rate limiting.

🎯 Interview Insight

Sliding window counter is the best answer for production rate limiting. It's accurate (no boundary spikes), memory-efficient (2 counters per key), and fast (O(1) per check). Mention the weighted average formula and explain why fixed windows have the boundary spike problem.

Distributed Rate Limiting

In a single-server setup, rate limiting is simple — one counter in memory. But with 20 API servers behind a load balancer, each server has its own counter. A client sending 100 requests gets ~5 per server — each server thinks the client sent only 5. The global limit of 100 is never enforced.

The Distributed Problemtext

Limit: 100 requests/minute per user

Single server:
  All requests → 1 server → 1 counter → enforced correctly ✅

20 servers behind load balancer:
  100 requests → ~5 per server → each server's counter = 5
  Each server thinks: "5 < 100, allow" ✅
  Global total: 100 requests → all allowed
  But if client sends 2,000 requests:
  → ~100 per server → each server: "100 = 100, limit reached"
  → Global total: 2,000 allowed instead of 100 ❌

The fix: shared state that ALL servers read/write.

Solutions

Centralized Store (Redis)

All servers check and increment a counter in Redis. INCR + EXPIRE gives atomic counter with TTL. Every request: INCR user:42:rate_limit → if > 100 → reject. Redis handles ~100K operations/sec — sufficient for most systems.

Redis + Lua Script (Atomic)

Use a Lua script in Redis to atomically check the counter, increment it, and set TTL — all in one round trip. This prevents race conditions where two servers read the same count simultaneously and both allow the request.

Local + Sync (Hybrid)

Each server maintains a local counter and periodically syncs with a central store. Faster (no Redis call per request) but less accurate (local counters can drift). Good for high-throughput, approximate limiting.

Consistent Hashing for Sharding

Route all requests from the same user to the same server (sticky sessions via consistent hashing). That server's local counter is the global counter for that user. No shared state needed. Trade-off: uneven load distribution.

Redis + Lua — Atomic Rate Limitingtext

Lua script (runs atomically in Redis):

  local key = KEYS[1]           -- "rate_limit:user:42"
  local limit = tonumber(ARGV[1]) -- 100
  local window = tonumber(ARGV[2]) -- 60 (seconds)

  local current = redis.call("INCR", key)
  if current == 1 then
    redis.call("EXPIRE", key, window)
  end

  if current > limit then
    return 0  -- REJECT
  else
    return 1  -- ALLOW
  end

Why Lua? The INCR + EXPIRE + comparison must be atomic.
Without Lua, a race condition between INCR and the check
could allow requests beyond the limit.

Approach	Accuracy	Latency	Complexity	Best For
Redis (centralized)	High	+0.5ms per request	Low	Most production APIs
Redis + Lua	Very high (atomic)	+0.5ms per request	Medium	Strict enforcement needed
Local + sync	Approximate	~0ms (local)	Medium	Very high throughput, approximate OK
Sticky sessions	High (per-user)	~0ms (local)	Low	When load balancer supports it

🎯 Interview Insight

Distributed rate limiting with Redis is the standard answer. Say: "I'd use Redis with a Lua script for atomic check-and-increment. The key is user:{id}:rate_limit with a 60-second TTL. Each API server calls Redis before processing the request. Redis handles 100K+ ops/sec, so it won't be the bottleneck."

End-to-End Scenario

Let's design the rate limiting system for a public API platform like Stripe or GitHub.

Public API Rate Limiting — Full Designtext

System: Public API with 50K developers, 20 API servers

Rate limits:
  Free tier:    60 requests/minute,  burst of 10
  Pro tier:     1,000 requests/minute, burst of 100
  Enterprise:   10,000 requests/minute, burst of 500

Algorithm: Token bucket (allows bursts, industry standard)
State store: Redis cluster (shared across all 20 API servers)

Request flow:
  1. Client sends: GET /api/charges (API key in header)
  2. API Gateway extracts API key → looks up tier
  3. Rate limiter checks Redis:
     Key: "rate:{api_key}" → { tokens: 87, last_refill: 1706140790 }
     Refill tokens based on elapsed time
     If tokens > 0 → decrement, ALLOW
     If tokens = 0 → REJECT with 429

  4. Response headers (always included):
     X-RateLimit-Limit: 1000
     X-RateLimit-Remaining: 87
     X-RateLimit-Reset: 1706140860

  5. If rejected:
     HTTP 429 Too Many Requests
     Retry-After: 15
     Body: { "error": "Rate limit exceeded", "retry_after": 15 }

Defense layers:
  Layer 1: IP-based rate limit at CDN/WAF (10K req/min per IP)
  Layer 2: API key rate limit at gateway (per-tier limits)
  Layer 3: Per-endpoint limits (/api/login: 10/min, /api/charges: 1000/min)
  Layer 4: Global system limit (circuit breaker if total QPS > capacity)

💡 This Is How Stripe / GitHub Works

Multiple layers of rate limiting, token bucket for burst support, Redis for distributed state, and clear response headers so clients can self-throttle. The rate limiter is the first thing a request hits — before authentication, before business logic, before the database.

Trade-offs & Decision Making

Algorithm	Burst Support	Accuracy	Complexity	Memory	Best For
Token Bucket	✅ Yes	Medium	Low	Very low	APIs (most common)
Leaky Bucket	❌ No	High	Low	Medium (queue)	Traffic shaping, streaming
Fixed Window	❌ No	Low (boundary spike)	Very low	Very low	Simple, non-critical limits
Sliding Window Log	❌ No	Perfect	Medium	High	When accuracy is critical
Sliding Window Counter	Limited	Very high	Medium	Low	Production APIs (best balance)

🎯 Choose Token Bucket When

Clients make bursty requests (common in APIs)
You want simple, well-understood behavior
Burst capacity and sustained rate need independent tuning
Industry standard is expected (Stripe, AWS, GitHub)

🎯 Choose Sliding Window When

Fairness is critical (no boundary exploitation)
Strict per-minute/per-hour limits needed
Login attempt limiting (security-sensitive)
Compliance requires exact counting

Interview Questions

Q:What's the difference between token bucket and leaky bucket?

A: Token bucket allows bursts — tokens accumulate when the client is idle, and can be spent in a burst (up to bucket capacity). The long-term rate is limited to the refill rate. Leaky bucket enforces a constant output rate — requests enter a queue and are processed at a fixed rate, regardless of arrival pattern. No bursts allowed. Use token bucket for APIs (users make several requests quickly, then pause). Use leaky bucket for traffic shaping (need smooth, constant throughput).

Q:How do you implement distributed rate limiting?

A: Use Redis as a shared counter store. All API servers check and increment the same counter in Redis before processing a request. Use a Lua script for atomic check-and-increment (prevents race conditions). Key design: 'rate_limit:{user_id}' with INCR + EXPIRE. Redis handles 100K+ ops/sec — not a bottleneck. For very high throughput where Redis latency matters, use a hybrid approach: local counters with periodic sync to Redis (approximate but faster).

Q:Why is sliding window more accurate than fixed window?

A: Fixed window resets at boundaries (e.g., every minute at :00). A client can send 100 requests at :59 and 100 more at :01 — 200 requests in 2 seconds, double the intended rate. Sliding window counts requests in the last N seconds from NOW, so there's no boundary to exploit. The sliding window counter approximation (weighted average of current and previous windows) is ~99.7% accurate with only 2 counters per key.

You're designing rate limiting for a payment API

What algorithm and limits would you use?

Answer: Token bucket with strict limits. Configuration: capacity=5 (small burst for retries), refill rate=2/sec (sustained rate). Why token bucket: payment retries come in bursts (client retries 3 times quickly). Why strict: payments are irreversible — over-allowing is worse than over-rejecting. Distributed enforcement via Redis + Lua (atomic). Per-API-key limits (not per-IP, since multiple users share IPs). Additional: idempotency keys to handle retries safely, and a separate stricter limit on /create-charge vs /list-charges.

Your rate limiter uses fixed windows and users complain about unfair limiting

What's happening and how do you fix it?

Answer: Users are hitting the boundary spike problem. A user sends 50 requests at :59 (within the 100/min limit), then 50 more at :01 (new window, also within limit). But another user who sends 100 requests evenly throughout the minute gets rate-limited at :30. The first user got 100 requests in 2 seconds; the second got limited at 50 seconds. Fix: switch to sliding window counter. It counts requests in the last 60 seconds from NOW, so both users are treated fairly regardless of when in the minute they send requests.

Common Pitfalls

🌐

Not handling distributed state

Each API server has its own in-memory rate limiter. With 20 servers, a client gets 20x the intended limit (each server allows the full quota independently). The rate limiter is effectively useless.

✅Use a shared store (Redis) for rate limit counters. All servers read/write the same counter. Use Lua scripts for atomic operations. If Redis latency is a concern, use local counters with periodic sync — approximate but still effective.

📊

Using fixed windows (boundary spike)

Rate limit is 100/minute with fixed windows resetting at :00. Users exploit the boundary: 100 requests at :59, 100 more at :01 = 200 in 2 seconds. The system is overloaded despite the 'limit' being in place.

✅Use sliding window counters. The weighted average of current and previous windows eliminates the boundary spike with minimal overhead (2 counters per key). If you must use fixed windows, set the limit to half the actual capacity to account for boundary doubling.

💥

Ignoring burst requirements

Using a leaky bucket (constant rate) for an API where clients naturally make bursty requests. A mobile app loads a screen and makes 10 API calls in 500ms. The leaky bucket queues 9 of them, making the app feel slow. Users complain about performance.

✅Use token bucket for APIs. Set the bucket capacity to the expected burst size (e.g., 10-20 for a mobile app's screen load). The sustained rate limits long-term usage while the burst capacity handles normal usage patterns without artificial delays.

🔒

Over-restricting legitimate users

Rate limits are too aggressive. Normal usage patterns trigger 429 errors. Users can't complete basic workflows. Support tickets pile up. Developers abandon your API for a competitor.

✅Set limits based on actual usage data, not guesses. Monitor the 95th and 99th percentile of request rates per user. Set limits at 2-3x the p99 to allow headroom. Provide clear rate limit headers so clients can self-throttle. Offer higher tiers for power users.

Rate Limiting

Table of Contents

The Big Picture — Why Rate Limiting Is Necessary

The Water Tap Analogy

What Happens Without Rate Limiting

System Overload

Abuse & Spam

Resource Starvation

Where Rate Limiting Sits

Where to Apply

What to Limit By

Token Bucket Algorithm

Strengths

Weaknesses

Leaky Bucket Algorithm

Sliding Window Counters

Fixed Window vs Sliding Window

Sliding Window Log vs Sliding Window Counter

Distributed Rate Limiting

Solutions

Centralized Store (Redis)

Redis + Lua Script (Atomic)

Local + Sync (Hybrid)

Consistent Hashing for Sharding

End-to-End Scenario

Trade-offs & Decision Making

🎯 Choose Token Bucket When

🎯 Choose Sliding Window When

Interview Questions

Q:What's the difference between token bucket and leaky bucket?

Q:How do you implement distributed rate limiting?

Q:Why is sliding window more accurate than fixed window?

You're designing rate limiting for a payment API

Your rate limiter uses fixed windows and users complain about unfair limiting

Common Pitfalls

Not handling distributed state

Using fixed windows (boundary spike)

Ignoring burst requirements

Over-restricting legitimate users