Capacity Estimation
Master back-of-the-envelope calculations — QPS estimation, storage sizing, and bandwidth calculations. The math that drives every system design interview.
Table of Contents
The Big Picture — What Is Capacity Estimation?
Capacity estimation is the art of making quick, reasonable calculations about how much a system needs — how many requests per second, how much storage, how much bandwidth. It's not about being precise to the byte. It's about getting within the right order of magnitude so you can make informed design decisions.
The Restaurant Planning Analogy
You're opening a restaurant. Before building anything, you estimate: How many customers per hour? (QPS) → Determines how many tables, chefs, and waiters you need. How much food to stock? (Storage) → Determines fridge size and supply orders. How fast can the kitchen serve? (Bandwidth) → Determines if you need a bigger kitchen or faster equipment. You don't need exact numbers — you need to know: are we serving 50 people or 5,000? That's the difference between a food truck and a banquet hall. Getting the order of magnitude right is what matters.
🔥 Key Insight
In system design interviews, capacity estimation isn't a math test. It's a communication exercise. Interviewers want to see you make reasonable assumptions, state them clearly, and arrive at numbers that guide your architecture. The process matters more than the exact answer.
Why Estimations Matter
Avoid Over-Engineering
If your system handles 100 QPS, you don't need Kafka, Cassandra, and 50 microservices. A single PostgreSQL instance is fine. Estimation prevents building for scale you'll never reach.
Prevent Failure at Scale
If your system will handle 100K QPS and you designed for 1K, it will crash on launch day. Estimation reveals bottlenecks before they become outages.
Drive Architecture Decisions
The numbers tell you: do you need a cache? How many servers? Should you shard the database? Estimation turns vague requirements into concrete infrastructure.
Orders of Magnitude — The Only Precision You Need
| Power | Value | Name | Context |
|---|---|---|---|
| 10³ | 1,000 | Thousand | Small app, internal tool |
| 10⁴ | 10,000 | Ten thousand | Growing startup |
| 10⁵ | 100,000 | Hundred thousand | Medium-scale product |
| 10⁶ | 1,000,000 | Million | Large-scale product |
| 10⁷ | 10,000,000 | Ten million | Major platform |
| 10⁹ | 1,000,000,000 | Billion | Global-scale (Google, Meta) |
Keep these conversions at your fingertips. You'll use them in every estimation — seconds per day, bytes per data type, and typical sizes for common objects. Round aggressively: 86,400 seconds/day becomes 100K, and that's close enough.
Time:
1 day = 86,400 seconds ≈ 10⁵ seconds (use 100K)
1 month = 2.6M seconds ≈ 2.5 × 10⁶
1 year = 31.5M seconds ≈ 3 × 10⁷
Storage:
1 KB = 1,000 bytes (a short text message)
1 MB = 1,000 KB (a high-res photo)
1 GB = 1,000 MB (a movie)
1 TB = 1,000 GB (a small database)
1 PB = 1,000 TB (a large-scale system)
Characters:
1 char = 1 byte (ASCII) or 2-4 bytes (UTF-8)
A tweet (280 chars) ≈ 280 bytes ≈ 0.3 KB
A JSON API response ≈ 1-10 KB
A photo ≈ 200 KB - 2 MB
A video (1 min, 720p) ≈ 50-100 MB
💡 The 80/20 Rule of Estimation
Round aggressively. Use 100K instead of 86,400. Use 1 million instead of 1,048,576. The goal is speed and clarity, not precision. If your estimate is within 2-5x of reality, you've done well.
QPS / Throughput Estimation
QPS (Queries Per Second) is the number of requests your system handles every second. It's the most fundamental capacity metric — it determines how many servers, how much caching, and what database you need.
DAU → QPS Conversion
The most common starting point is converting Daily Active Users (DAU) into requests per second. The formula divides total daily actions by the number of seconds in a day. Use 100,000 seconds (not the exact 86,400) for cleaner math — precision doesn't matter at this stage. Always separate read QPS from write QPS since they have completely different scaling strategies.
QPS = (DAU × actions per user per day) / seconds per day
Where:
DAU = Daily Active Users
seconds per day ≈ 100,000 (use 10⁵ for easy math)
Example: Twitter-like service
DAU = 10 million (10⁷)
Each user reads 20 tweets/day and posts 2 tweets/day
Read QPS = (10M × 20) / 100K = 200M / 100K = 2,000 QPS
Write QPS = (10M × 2) / 100K = 20M / 100K = 200 QPS
→ Read-heavy system (10:1 read-to-write ratio)
Peak vs Average
Traffic is never evenly distributed. Peak hours can be 2-5x the average. Design for peak, not average — otherwise your system crashes during rush hour.
Average QPS = 2,000 (from above)
Peak multiplier = 3x (typical for social media)
Peak QPS = 2,000 × 3 = 6,000 QPS
For spiky events (Black Friday, viral content):
Spike multiplier = 5-10x
Spike QPS = 2,000 × 10 = 20,000 QPS
Rule of thumb:
Design for 2-3x average for normal peak
Design for 5-10x average if you expect viral/seasonal spikes
📖 Read-Heavy Systems
- Social media feeds, news sites, product catalogs
- Read:Write ratio typically 10:1 to 100:1
- Strategy:
- Cache hit rate of 80-95% reduces DB load dramatically
✏️ Write-Heavy Systems
- Logging, analytics, IoT sensor data, chat messages
- Write:Read ratio can be 10:1 or higher
- Strategy:
- Cassandra, Kafka, or append-only storage
🎯 Interview Insight
Always clarify assumptions with the interviewer: "I'll assume 10M DAU, each user makes 20 reads and 2 writes per day. Does that sound reasonable?" This shows structured thinking and gives the interviewer a chance to adjust the scope.
Storage Sizing
Storage estimation answers: how much disk space does this system need? The formula is simple, but the details matter — replication, indexes, growth over time, and hot vs cold storage.
The Core Formula
Storage estimation starts with three numbers: how many new records are created each day, how large each record is, and how long you need to keep them. Multiply these together, then layer on the real-world multipliers — replication for durability, indexes for query performance, and a growth buffer so you don't run out of disk mid-quarter.
Daily storage = (new records per day) × (size per record)
Yearly storage = daily storage × 365
Total storage = yearly storage × retention years × replication factor
Example: Chat application (like WhatsApp)
Assumptions:
DAU = 50M users
Each user sends 40 messages/day
Average message size = 100 bytes (text only)
Daily new messages = 50M × 40 = 2 billion messages/day
Daily storage = 2B × 100 bytes = 200 GB/day
Yearly storage = 200 GB × 365 = 73 TB/year
With 3x replication = 73 × 3 = 219 TB/year
With 20% index overhead = 219 × 1.2 = ~263 TB/year
5-year plan: 263 × 5 = ~1.3 PB
Don't Forget These Multipliers
| Factor | Multiplier | Why |
|---|---|---|
| Replication | 2-3x | Data is copied across nodes for durability and read performance |
| Index overhead | 1.1-1.3x | B-tree indexes, secondary indexes take additional space |
| Metadata | 1.05-1.1x | Timestamps, IDs, internal DB overhead per row |
| Growth buffer | 1.5-2x | Plan for 1-2 years of growth beyond current estimates |
| Media storage | 10-100x text | Images (200KB-2MB), videos (50-100MB) dwarf text data |
and are the most commonly forgotten multipliers in interview estimates.
Hot vs Cold Storage
🔥 Hot Storage (Frequently Accessed)
- Recent data (last 30 days of messages, recent orders)
- Stored on SSD or in-memory (Redis)
- Fast access, expensive per GB
- Typically 5-20% of total data
❄️ Cold Storage (Rarely Accessed)
- Old data (messages from 2 years ago, archived logs)
- Stored on HDD or object storage (S3)
- Slow access, cheap per GB
- Typically 80-95% of total data
When a system handles user-uploaded media (photos, videos), the storage math changes dramatically. A single image is 1,000x larger than a text message, and most systems store multiple resolutions of each image for responsive delivery. Media storage almost always dominates text storage by 100x or more.
Assumptions:
DAU = 100M users
10% post a photo daily = 10M photos/day
Average photo size = 500 KB (after compression)
Store 3 sizes: thumbnail (50KB) + medium (200KB) + original (500KB)
Daily photo storage = 10M × (50 + 200 + 500) KB
= 10M × 750 KB
= 7.5 TB/day
Yearly = 7.5 TB × 365 = ~2.7 PB/year
With replication (3x) = ~8.1 PB/year
→ This is why Instagram uses object storage (S3), not a database.
→ Photos are served via CDN, not from the origin.
🎯 Interview Insight
Always plan for growth. If the system needs 10 TB today, estimate for 3-5 years. Storage is cheap — running out of it is not. Mention hot/cold tiering to show you think about cost optimization, not just raw capacity.
Bandwidth Calculations
Bandwidth is the amount of data flowing through your system per second. It determines whether your network, servers, and CDN can handle the load — or become the bottleneck.
The Core Formula
Bandwidth is simply QPS multiplied by average response size. For API services returning JSON, this is usually modest (tens of MB/s). For media-heavy services (video streaming, image delivery), bandwidth grows into hundreds of Gbps and becomes the primary bottleneck — far before CPU or storage becomes a concern.
Bandwidth = QPS × average response size
Example: API service
QPS = 5,000 requests/sec
Average response = 10 KB (JSON)
Bandwidth = 5,000 × 10 KB = 50,000 KB/s = 50 MB/s = 400 Mbps
Example: Video streaming service
Concurrent viewers = 100,000
Bitrate = 5 Mbps (1080p)
Bandwidth = 100,000 × 5 Mbps = 500 Gbps
→ This is why Netflix uses CDN edge servers.
→ Serving 500 Gbps from a single data center is impossible.
Ingress vs Egress
⬆️ Ingress (Incoming Traffic)
- Data flowing INTO your system
- User uploads: photos, videos, files
- API requests with payloads (POST/PUT bodies)
- Usually smaller than egress for most web apps
⬇️ Egress (Outgoing Traffic)
- Data flowing OUT of your system
- API responses, page loads, media delivery
- Usually the bottleneck (and the expensive part)
- CDN offloads 60-90% of egress traffic
A CDN dramatically changes the bandwidth equation. When 90% of requests are served from edge servers (cache hits), your origin only handles 10% of the total egress. This reduces your origin bandwidth requirement by 10x and cuts costs significantly — edge delivery is cheaper than origin delivery.
Without CDN:
Total egress = 50 MB/s (all from origin servers)
Origin bandwidth cost: HIGH
With CDN (90% cache hit rate):
CDN serves: 50 × 0.9 = 45 MB/s (from edge, cheap)
Origin serves: 50 × 0.1 = 5 MB/s (cache misses only)
Origin bandwidth reduced by 90%
This is why CDN is the first thing you add when bandwidth
becomes a concern. It's cheaper and faster than adding servers.
🎯 Interview Insight
Bandwidth becomes the bottleneck for media-heavy systems (video, images) long before CPU or storage does. Always mention when discussing bandwidth. For API-only services, bandwidth is rarely the bottleneck — QPS and database throughput matter more.
End-to-End Estimation Example
Let's estimate capacity for a URL shortener (like bit.ly) end-to-end. This is a classic interview question.
Step 1: State Assumptions
Always start by explicitly listing your assumptions. This grounds the entire estimation — if the interviewer thinks your numbers are off, they can correct you here before you build on them. For a URL shortener, the key inputs are: how many URLs are created, how often they're accessed, how large each record is, and how long we keep them.
Users:
100M total URLs created per month
Read:Write ratio = 100:1 (URLs are created once, read many times)
Data:
Short URL: 7 characters = 7 bytes
Long URL: average 200 characters = 200 bytes
Metadata (created_at, user_id, etc.): ~100 bytes
Total per record: ~300 bytes
Retention: 5 years
Step 2: QPS Estimation
With 100M URLs created per month, we divide by seconds-per-month to get write QPS. Then we use the read:write ratio (100:1 for a URL shortener — URLs are created once but clicked many times) to derive read QPS. Finally, multiply by 3x for peak traffic to ensure the system can handle real-world spikes.
Write QPS (URL creation):
100M URLs/month ÷ (30 days × 100K seconds/day)
= 100M / 3M
≈ 33 writes/sec
Read QPS (URL redirects):
Read:Write = 100:1
= 33 × 100 = 3,300 reads/sec
Peak QPS (3x average):
Peak writes = ~100/sec
Peak reads = ~10,000/sec
→ Read-heavy system. Cache the most popular URLs.
Step 3: Storage Estimation
Each URL record is about 300 bytes (short code + long URL + metadata). Multiply by the number of new records per month, scale to 5 years, then add the overhead multipliers for replication and indexes. The resulting 6.5 TB is very manageable for a modern database — the real challenge is indexing 6 billion records for fast lookups.
New URLs per month: 100M
Size per URL record: 300 bytes
Monthly storage = 100M × 300 bytes = 30 GB/month
Yearly storage = 30 GB × 12 = 360 GB/year
5-year storage = 360 GB × 5 = 1.8 TB
With replication (3x): 1.8 × 3 = 5.4 TB
With index overhead (20%): 5.4 × 1.2 = ~6.5 TB
Total URLs in 5 years: 100M × 12 × 5 = 6 billion URLs
→ 6.5 TB is easily handled by a single database with sharding.
→ 6 billion records needs a good indexing strategy.
Step 4: Bandwidth Estimation
URL shorteners have tiny responses — just an HTTP 301 redirect with a Location header (~500 bytes). Even at 10K reads/sec, the bandwidth is only 40 Mbps — trivial for any modern server. This tells us bandwidth is not the bottleneck. The bottleneck is read QPS (thousands of key lookups per second), which is easily solved with an in-memory cache like Redis.
Write bandwidth (ingress):
33 writes/sec × 300 bytes = ~10 KB/s (negligible)
Read bandwidth (egress):
Each redirect response: HTTP 301 + Location header ≈ 500 bytes
3,300 reads/sec × 500 bytes = 1.65 MB/s ≈ 13 Mbps
Peak read bandwidth:
10,000 reads/sec × 500 bytes = 5 MB/s ≈ 40 Mbps
→ Bandwidth is NOT the bottleneck for a URL shortener.
→ The bottleneck is read QPS → solved with caching (Redis).
Step 5: Architecture Implications
This is the payoff — the numbers directly tell you what architecture to build. 3,300 reads/sec with tiny payloads means Redis alone can handle the hot path. 6.5 TB over 5 years means a single sharded PostgreSQL or DynamoDB instance is sufficient. 13 Mbps means no CDN is needed. Compare this to a media-heavy system like Instagram where every metric would be 1000x larger and the architecture would be fundamentally different.
QPS: ~3,300 reads/sec → Redis cache handles this easily
Storage: ~6.5 TB over 5 years → PostgreSQL with sharding, or DynamoDB
Bandwidth: ~13 Mbps → not a concern
Architecture:
→ Redis cache for hot URLs (top 20% of URLs get 80% of traffic)
→ PostgreSQL or DynamoDB for persistent storage
→ No CDN needed (responses are tiny redirects, not media)
→ Single server can handle this; add a second for redundancy
If this were Instagram (media-heavy):
→ Storage would be PBs, not TBs
→ Bandwidth would be Gbps, not Mbps
→ CDN would be essential
→ Object storage (S3) instead of database for media
🔥 This Is What Interviewers Want
State assumptions → calculate QPS → calculate storage → calculate bandwidth → derive architecture decisions. The numbers should drive the design, not the other way around. Show this process clearly and you'll ace the estimation portion.
Trade-offs & Design Decisions
Accuracy vs Speed of Estimation
| Dimension | Quick Estimate (Interview) | Detailed Estimate (Production) |
|---|---|---|
| Time spent | 2-5 minutes | Days to weeks |
| Precision | Within 2-5x of reality | Within 10-20% of reality |
| Assumptions | Round aggressively (100K sec/day) | Measure actual traffic patterns |
| Purpose | Guide architecture decisions | Size infrastructure, plan budget |
| When to use | System design interviews, early design | Capacity planning, procurement |
Overestimation vs Underestimation
Overestimation risks
- ❌Wasted money on unused infrastructure
- ❌Over-engineered architecture (complexity without need)
- ❌Slower development (building for scale you don't have)
- ❌Premature optimization
Underestimation risks
- ❌System crashes under real load
- ❌Emergency scaling (expensive, stressful)
- ❌Data loss if storage runs out
- ❌Poor user experience (slow responses, timeouts)
💡 The Sweet Spot
Estimate for 3-5x your expected load. This gives you headroom for growth and traffic spikes without massively over-provisioning. Cloud infrastructure makes this easier — you can scale up when needed, so slight underestimation is less catastrophic than it used to be.
Cost vs Performance
| Decision | Cheaper Option | Faster Option |
|---|---|---|
| Storage | HDD / S3 Standard ($0.023/GB) | SSD / S3 Express ($0.16/GB) |
| Caching | No cache (hit DB every time) | Redis cache ($$$, but 100x faster reads) |
| Replication | Single copy (risk of data loss) | 3x replication (3x storage cost, high durability) |
| CDN | Serve from origin (high latency) | CDN edge delivery (CDN cost, but 10x faster) |
| Compute | Fewer, larger servers | More, smaller servers (better fault tolerance) |
Interview Questions
Estimation-based and scenario-based questions you're likely to encounter.
Q:How do you calculate QPS from DAU?
A: QPS = (DAU × actions per user per day) / seconds per day. Use 100,000 seconds per day for easy math (actual is 86,400). Example: 10M DAU, 20 reads per user per day → 10M × 20 / 100K = 2,000 read QPS. Always separate read QPS and write QPS — they have different scaling strategies. Multiply by 2-3x for peak traffic.
Q:Estimate storage for WhatsApp messages
A: Assumptions: 2B users, 100M DAU, 50 messages per user per day, average message 100 bytes. Daily messages: 100M × 50 = 5B messages. Daily storage: 5B × 100 bytes = 500 GB/day. Yearly: 500 GB × 365 = ~180 TB/year. With 3x replication: ~540 TB/year. With media (10% of messages have a 200KB image): 100M × 50 × 0.1 × 200KB = 100 TB/day for images alone. Media dominates storage — text is negligible in comparison.
Design a notification system for 50M users
Estimate the QPS and storage requirements
Answer: Assumptions: 50M DAU, each user receives 10 notifications/day, each notification is 500 bytes (title, body, metadata, timestamp). Write QPS: 50M × 10 / 100K = 5,000 writes/sec. Peak: 15,000 writes/sec. Daily storage: 50M × 10 × 500 bytes = 250 GB/day. Yearly: ~91 TB. With 3x replication: ~273 TB. Architecture: write-optimized DB (Cassandra) for notifications, Redis for unread counts, push via WebSockets. Notifications older than 30 days → cold storage.
Your image hosting service gets 1M uploads per day
Estimate storage and bandwidth needs
Answer: Assumptions: 1M uploads/day, average image 1MB, store 3 sizes (thumbnail 50KB, medium 300KB, original 1MB). Daily storage: 1M × (50 + 300 + 1000) KB = 1M × 1.35 MB = 1.35 TB/day. Yearly: ~490 TB. With replication: ~1.5 PB/year. Read bandwidth: if each image is viewed 100 times on average, and 80% are thumbnails (50KB): 100M views/day × 50KB / 100K sec = 50 GB/s = way too much for origin. CDN is mandatory — with 95% cache hit rate, origin handles only 2.5 GB/s.
You're told the system has 500M DAU
What's the first thing you estimate?
Answer: QPS. 500M DAU is meaningless without knowing actions per user. Ask: 'What does each user do?' If it's a read-heavy feed (20 reads/day): 500M × 20 / 100K = 100K read QPS. That's serious scale — you need caching (Redis cluster), read replicas, CDN, and likely database sharding. If it's a messaging app (50 messages/day): 500M × 50 / 100K = 250K write QPS. That's write-heavy — you need Cassandra or Kafka, not PostgreSQL. The DAU alone doesn't tell you the architecture; the access pattern does.
Common Mistakes
These mistakes lead to wrong estimates and bad architecture decisions.
Ignoring peak traffic
Designing for average QPS and wondering why the system crashes at 6 PM. Average QPS is 2,000 but peak is 10,000. If your system handles 3,000, it fails during peak hours — exactly when the most users are online.
✅Always calculate peak QPS (2-3x average for normal traffic, 5-10x for events like Black Friday). Design your system to handle peak, not average. Use to handle spikes cost-effectively.
Forgetting replication and indexes
Estimating 1 TB of storage and provisioning exactly 1 TB. With 3x replication, you need 3 TB. With indexes (20% overhead), you need 3.6 TB. With growth buffer, you need 5+ TB. Running out of storage in production is a crisis.
✅Always multiply raw storage by: replication factor (2-3x) × index overhead (1.2x) × growth buffer (1.5-2x). A 1 TB estimate becomes 4-7 TB in practice.
Unrealistic assumptions
Assuming every user is active 24/7, or that all 1 billion registered users are DAU. If you have 1B registered users, DAU is typically 10-30% (100-300M). Not all users are equally active.
✅Use realistic ratios: DAU is typically 10-30% of total users. Actions per user vary by product (social media: 20-50 actions/day, e-commerce: 5-10 actions/day). State your assumptions explicitly and ask the interviewer if they're reasonable.
Not explaining reasoning
Jumping to '10,000 QPS' without showing how you got there. The interviewer can't evaluate your thinking if you just state a number. The process is more important than the answer.
✅Always show your work: 'We have 10M DAU, each user reads 20 tweets per day, so read QPS = 10M × 20 / 100K = 2,000. With 3x peak multiplier, that's 6,000 peak QPS.' This takes 30 seconds and demonstrates structured thinking.
Forgetting media dominates storage
Carefully estimating text storage (messages, metadata) and ignoring that a single image is 1000x larger than a text message. For any system with user-uploaded media, images and videos will be 95%+ of total storage.
✅Always ask: 'Does this system handle media?' If yes, estimate media storage separately — it will dwarf everything else. A chat app's text messages might be 500 GB/day, but image attachments could be 50 TB/day.