Capacity Estimation
Master back-of-the-envelope calculations — QPS estimation, storage sizing, and bandwidth calculations. The math that drives every system design interview.
Table of Contents
The Big Picture — What Is Capacity Estimation?
Capacity estimation is the art of making quick, reasonable calculations about how much a system needs — how many requests per second, how much storage, how much bandwidth. It's not about being precise to the byte. It's about getting within the right order of magnitude so you can make informed design decisions.
The Restaurant Planning Analogy
You're opening a restaurant. Before building anything, you estimate: How many customers per hour? (QPS) → Determines how many tables, chefs, and waiters you need. How much food to stock? (Storage) → Determines fridge size and supply orders. How fast can the kitchen serve? (Bandwidth) → Determines if you need a bigger kitchen or faster equipment. You don't need exact numbers — you need to know: are we serving 50 people or 5,000? That's the difference between a food truck and a banquet hall. Getting the order of magnitude right is what matters.
🔥 Key Insight
In system design interviews, capacity estimation isn't a math test. It's a communication exercise. Interviewers want to see you make reasonable assumptions, state them clearly, and arrive at numbers that guide your architecture. The process matters more than the exact answer.
Why Estimations Matter
Avoid Over-Engineering
If your system handles 100 QPS, you don't need Kafka, Cassandra, and 50 microservices. A single PostgreSQL instance is fine. Estimation prevents building for scale you'll never reach.
Prevent Failure at Scale
If your system will handle 100K QPS and you designed for 1K, it will crash on launch day. Estimation reveals bottlenecks before they become outages.
Drive Architecture Decisions
The numbers tell you: do you need a cache? How many servers? Should you shard the database? Estimation turns vague requirements into concrete infrastructure.
Orders of Magnitude — The Only Precision You Need
| Power | Value | Name | Context |
|---|---|---|---|
| 10³ | 1,000 | Thousand | Small app, internal tool |
| 10⁴ | 10,000 | Ten thousand | Growing startup |
| 10⁵ | 100,000 | Hundred thousand | Medium-scale product |
| 10⁶ | 1,000,000 | Million | Large-scale product |
| 10⁷ | 10,000,000 | Ten million | Major platform |
| 10⁹ | 1,000,000,000 | Billion | Global-scale (Google, Meta) |
Time: 1 day = 86,400 seconds ≈ 10⁵ seconds (use 100K) 1 month = 2.6M seconds ≈ 2.5 × 10⁶ 1 year = 31.5M seconds ≈ 3 × 10⁷ Storage: 1 KB = 1,000 bytes (a short text message) 1 MB = 1,000 KB (a high-res photo) 1 GB = 1,000 MB (a movie) 1 TB = 1,000 GB (a small database) 1 PB = 1,000 TB (a large-scale system) Characters: 1 char = 1 byte (ASCII) or 2-4 bytes (UTF-8) A tweet (280 chars) ≈ 280 bytes ≈ 0.3 KB A JSON API response ≈ 1-10 KB A photo ≈ 200 KB - 2 MB A video (1 min, 720p) ≈ 50-100 MB
💡 The 80/20 Rule of Estimation
Round aggressively. Use 100K instead of 86,400. Use 1 million instead of 1,048,576. The goal is speed and clarity, not precision. If your estimate is within 2-5x of reality, you've done well.
QPS / Throughput Estimation
QPS (Queries Per Second) is the number of requests your system handles every second. It's the most fundamental capacity metric — it determines how many servers, how much caching, and what database you need.
DAU → QPS Conversion
QPS = (DAU × actions per user per day) / seconds per day Where: DAU = Daily Active Users seconds per day ≈ 100,000 (use 10⁵ for easy math) Example: Twitter-like service DAU = 10 million (10⁷) Each user reads 20 tweets/day and posts 2 tweets/day Read QPS = (10M × 20) / 100K = 200M / 100K = 2,000 QPS Write QPS = (10M × 2) / 100K = 20M / 100K = 200 QPS → Read-heavy system (10:1 read-to-write ratio)
Peak vs Average
Traffic is never evenly distributed. Peak hours can be 2-5x the average. Design for peak, not average — otherwise your system crashes during rush hour.
Average QPS = 2,000 (from above) Peak multiplier = 3x (typical for social media) Peak QPS = 2,000 × 3 = 6,000 QPS For spiky events (Black Friday, viral content): Spike multiplier = 5-10x Spike QPS = 2,000 × 10 = 20,000 QPS Rule of thumb: Design for 2-3x average for normal peak Design for 5-10x average if you expect viral/seasonal spikes
📖 Read-Heavy Systems
- Social media feeds, news sites, product catalogs
- Read:Write ratio typically 10:1 to 100:1
- Strategy: caching, read replicas, CDN
- Cache hit rate of 80-95% reduces DB load dramatically
✏️ Write-Heavy Systems
- Logging, analytics, IoT sensor data, chat messages
- Write:Read ratio can be 10:1 or higher
- Strategy: write-optimized DBs, message queues, batching
- Cassandra, Kafka, or append-only storage
🎯 Interview Insight
Always clarify assumptions with the interviewer: "I'll assume 10M DAU, each user makes 20 reads and 2 writes per day. Does that sound reasonable?" This shows structured thinking and gives the interviewer a chance to adjust the scope.
Storage Sizing
Storage estimation answers: how much disk space does this system need? The formula is simple, but the details matter — replication, indexes, growth over time, and hot vs cold storage.
The Core Formula
Daily storage = (new records per day) × (size per record) Yearly storage = daily storage × 365 Total storage = yearly storage × retention years × replication factor Example: Chat application (like WhatsApp) Assumptions: DAU = 50M users Each user sends 40 messages/day Average message size = 100 bytes (text only) Daily new messages = 50M × 40 = 2 billion messages/day Daily storage = 2B × 100 bytes = 200 GB/day Yearly storage = 200 GB × 365 = 73 TB/year With 3x replication = 73 × 3 = 219 TB/year With 20% index overhead = 219 × 1.2 = ~263 TB/year 5-year plan: 263 × 5 = ~1.3 PB
Don't Forget These Multipliers
| Factor | Multiplier | Why |
|---|---|---|
| Replication | 2-3x | Data is copied across nodes for durability and read performance |
| Index overhead | 1.1-1.3x | B-tree indexes, secondary indexes take additional space |
| Metadata | 1.05-1.1x | Timestamps, IDs, internal DB overhead per row |
| Growth buffer | 1.5-2x | Plan for 1-2 years of growth beyond current estimates |
| Media storage | 10-100x text | Images (200KB-2MB), videos (50-100MB) dwarf text data |
Hot vs Cold Storage
🔥 Hot Storage (Frequently Accessed)
- Recent data (last 30 days of messages, recent orders)
- Stored on SSD or in-memory (Redis)
- Fast access, expensive per GB
- Typically 5-20% of total data
❄️ Cold Storage (Rarely Accessed)
- Old data (messages from 2 years ago, archived logs)
- Stored on HDD or object storage (S3)
- Slow access, cheap per GB
- Typically 80-95% of total data
Assumptions: DAU = 100M users 10% post a photo daily = 10M photos/day Average photo size = 500 KB (after compression) Store 3 sizes: thumbnail (50KB) + medium (200KB) + original (500KB) Daily photo storage = 10M × (50 + 200 + 500) KB = 10M × 750 KB = 7.5 TB/day Yearly = 7.5 TB × 365 = ~2.7 PB/year With replication (3x) = ~8.1 PB/year → This is why Instagram uses object storage (S3), not a database. → Photos are served via CDN, not from the origin.
🎯 Interview Insight
Always plan for growth. If the system needs 10 TB today, estimate for 3-5 years. Storage is cheap — running out of it is not. Mention hot/cold tiering to show you think about cost optimization, not just raw capacity.
Bandwidth Calculations
Bandwidth is the amount of data flowing through your system per second. It determines whether your network, servers, and CDN can handle the load — or become the bottleneck.
The Core Formula
Bandwidth = QPS × average response size Example: API service QPS = 5,000 requests/sec Average response = 10 KB (JSON) Bandwidth = 5,000 × 10 KB = 50,000 KB/s = 50 MB/s = 400 Mbps Example: Video streaming service Concurrent viewers = 100,000 Bitrate = 5 Mbps (1080p) Bandwidth = 100,000 × 5 Mbps = 500 Gbps → This is why Netflix uses CDN edge servers. → Serving 500 Gbps from a single data center is impossible.
Ingress vs Egress
⬆️ Ingress (Incoming Traffic)
- Data flowing INTO your system
- User uploads: photos, videos, files
- API requests with payloads (POST/PUT bodies)
- Usually smaller than egress for most web apps
⬇️ Egress (Outgoing Traffic)
- Data flowing OUT of your system
- API responses, page loads, media delivery
- Usually the bottleneck (and the expensive part)
- CDN offloads 60-90% of egress traffic
Without CDN: Total egress = 50 MB/s (all from origin servers) Origin bandwidth cost: HIGH With CDN (90% cache hit rate): CDN serves: 50 × 0.9 = 45 MB/s (from edge, cheap) Origin serves: 50 × 0.1 = 5 MB/s (cache misses only) Origin bandwidth reduced by 90% This is why CDN is the first thing you add when bandwidth becomes a concern. It's cheaper and faster than adding servers.
🎯 Interview Insight
Bandwidth becomes the bottleneck for media-heavy systems (video, images) long before CPU or storage does. Always mention CDN when discussing bandwidth. For API-only services, bandwidth is rarely the bottleneck — QPS and database throughput matter more.
End-to-End Estimation Example
Let's estimate capacity for a URL shortener (like bit.ly) end-to-end. This is a classic interview question.
Step 1: State Assumptions
Users: 100M total URLs created per month Read:Write ratio = 100:1 (URLs are created once, read many times) Data: Short URL: 7 characters = 7 bytes Long URL: average 200 characters = 200 bytes Metadata (created_at, user_id, etc.): ~100 bytes Total per record: ~300 bytes Retention: 5 years
Step 2: QPS Estimation
Write QPS (URL creation): 100M URLs/month ÷ (30 days × 100K seconds/day) = 100M / 3M ≈ 33 writes/sec Read QPS (URL redirects): Read:Write = 100:1 = 33 × 100 = 3,300 reads/sec Peak QPS (3x average): Peak writes = ~100/sec Peak reads = ~10,000/sec → Read-heavy system. Cache the most popular URLs.
Step 3: Storage Estimation
New URLs per month: 100M Size per URL record: 300 bytes Monthly storage = 100M × 300 bytes = 30 GB/month Yearly storage = 30 GB × 12 = 360 GB/year 5-year storage = 360 GB × 5 = 1.8 TB With replication (3x): 1.8 × 3 = 5.4 TB With index overhead (20%): 5.4 × 1.2 = ~6.5 TB Total URLs in 5 years: 100M × 12 × 5 = 6 billion URLs → 6.5 TB is easily handled by a single database with sharding. → 6 billion records needs a good indexing strategy.
Step 4: Bandwidth Estimation
Write bandwidth (ingress): 33 writes/sec × 300 bytes = ~10 KB/s (negligible) Read bandwidth (egress): Each redirect response: HTTP 301 + Location header ≈ 500 bytes 3,300 reads/sec × 500 bytes = 1.65 MB/s ≈ 13 Mbps Peak read bandwidth: 10,000 reads/sec × 500 bytes = 5 MB/s ≈ 40 Mbps → Bandwidth is NOT the bottleneck for a URL shortener. → The bottleneck is read QPS → solved with caching (Redis).
Step 5: Architecture Implications
QPS: ~3,300 reads/sec → Redis cache handles this easily Storage: ~6.5 TB over 5 years → PostgreSQL with sharding, or DynamoDB Bandwidth: ~13 Mbps → not a concern Architecture: → Redis cache for hot URLs (top 20% of URLs get 80% of traffic) → PostgreSQL or DynamoDB for persistent storage → No CDN needed (responses are tiny redirects, not media) → Single server can handle this; add a second for redundancy If this were Instagram (media-heavy): → Storage would be PBs, not TBs → Bandwidth would be Gbps, not Mbps → CDN would be essential → Object storage (S3) instead of database for media
🔥 This Is What Interviewers Want
State assumptions → calculate QPS → calculate storage → calculate bandwidth → derive architecture decisions. The numbers should drive the design, not the other way around. Show this process clearly and you'll ace the estimation portion.
Trade-offs & Design Decisions
Accuracy vs Speed of Estimation
| Dimension | Quick Estimate (Interview) | Detailed Estimate (Production) |
|---|---|---|
| Time spent | 2-5 minutes | Days to weeks |
| Precision | Within 2-5x of reality | Within 10-20% of reality |
| Assumptions | Round aggressively (100K sec/day) | Measure actual traffic patterns |
| Purpose | Guide architecture decisions | Size infrastructure, plan budget |
| When to use | System design interviews, early design | Capacity planning, procurement |
Overestimation vs Underestimation
Overestimation risks
- ❌Wasted money on unused infrastructure
- ❌Over-engineered architecture (complexity without need)
- ❌Slower development (building for scale you don't have)
- ❌Premature optimization
Underestimation risks
- ❌System crashes under real load
- ❌Emergency scaling (expensive, stressful)
- ❌Data loss if storage runs out
- ❌Poor user experience (slow responses, timeouts)
💡 The Sweet Spot
Estimate for 3-5x your expected load. This gives you headroom for growth and traffic spikes without massively over-provisioning. Cloud infrastructure makes this easier — you can scale up when needed, so slight underestimation is less catastrophic than it used to be.
Cost vs Performance
| Decision | Cheaper Option | Faster Option |
|---|---|---|
| Storage | HDD / S3 Standard ($0.023/GB) | SSD / S3 Express ($0.16/GB) |
| Caching | No cache (hit DB every time) | Redis cache ($$$, but 100x faster reads) |
| Replication | Single copy (risk of data loss) | 3x replication (3x storage cost, high durability) |
| CDN | Serve from origin (high latency) | CDN edge delivery (CDN cost, but 10x faster) |
| Compute | Fewer, larger servers | More, smaller servers (better fault tolerance) |
Interview Questions
Estimation-based and scenario-based questions you're likely to encounter.
Q:How do you calculate QPS from DAU?
A: QPS = (DAU × actions per user per day) / seconds per day. Use 100,000 seconds per day for easy math (actual is 86,400). Example: 10M DAU, 20 reads per user per day → 10M × 20 / 100K = 2,000 read QPS. Always separate read QPS and write QPS — they have different scaling strategies. Multiply by 2-3x for peak traffic.
Q:Estimate storage for WhatsApp messages
A: Assumptions: 2B users, 100M DAU, 50 messages per user per day, average message 100 bytes. Daily messages: 100M × 50 = 5B messages. Daily storage: 5B × 100 bytes = 500 GB/day. Yearly: 500 GB × 365 = ~180 TB/year. With 3x replication: ~540 TB/year. With media (10% of messages have a 200KB image): 100M × 50 × 0.1 × 200KB = 100 TB/day for images alone. Media dominates storage — text is negligible in comparison.
Design a notification system for 50M users
Estimate the QPS and storage requirements
Answer: Assumptions: 50M DAU, each user receives 10 notifications/day, each notification is 500 bytes (title, body, metadata, timestamp). Write QPS: 50M × 10 / 100K = 5,000 writes/sec. Peak: 15,000 writes/sec. Daily storage: 50M × 10 × 500 bytes = 250 GB/day. Yearly: ~91 TB. With 3x replication: ~273 TB. Architecture: write-optimized DB (Cassandra) for notifications, Redis for unread counts, push via WebSockets. Notifications older than 30 days → cold storage.
Your image hosting service gets 1M uploads per day
Estimate storage and bandwidth needs
Answer: Assumptions: 1M uploads/day, average image 1MB, store 3 sizes (thumbnail 50KB, medium 300KB, original 1MB). Daily storage: 1M × (50 + 300 + 1000) KB = 1M × 1.35 MB = 1.35 TB/day. Yearly: ~490 TB. With replication: ~1.5 PB/year. Read bandwidth: if each image is viewed 100 times on average, and 80% are thumbnails (50KB): 100M views/day × 50KB / 100K sec = 50 GB/s = way too much for origin. CDN is mandatory — with 95% cache hit rate, origin handles only 2.5 GB/s.
You're told the system has 500M DAU
What's the first thing you estimate?
Answer: QPS. 500M DAU is meaningless without knowing actions per user. Ask: 'What does each user do?' If it's a read-heavy feed (20 reads/day): 500M × 20 / 100K = 100K read QPS. That's serious scale — you need caching (Redis cluster), read replicas, CDN, and likely database sharding. If it's a messaging app (50 messages/day): 500M × 50 / 100K = 250K write QPS. That's write-heavy — you need Cassandra or Kafka, not PostgreSQL. The DAU alone doesn't tell you the architecture; the access pattern does.
Common Mistakes
These mistakes lead to wrong estimates and bad architecture decisions.
Ignoring peak traffic
Designing for average QPS and wondering why the system crashes at 6 PM. Average QPS is 2,000 but peak is 10,000. If your system handles 3,000, it fails during peak hours — exactly when the most users are online.
✅Always calculate peak QPS (2-3x average for normal traffic, 5-10x for events like Black Friday). Design your system to handle peak, not average. Use auto-scaling to handle spikes cost-effectively.
Forgetting replication and indexes
Estimating 1 TB of storage and provisioning exactly 1 TB. With 3x replication, you need 3 TB. With indexes (20% overhead), you need 3.6 TB. With growth buffer, you need 5+ TB. Running out of storage in production is a crisis.
✅Always multiply raw storage by: replication factor (2-3x) × index overhead (1.2x) × growth buffer (1.5-2x). A 1 TB estimate becomes 4-7 TB in practice.
Unrealistic assumptions
Assuming every user is active 24/7, or that all 1 billion registered users are DAU. If you have 1B registered users, DAU is typically 10-30% (100-300M). Not all users are equally active.
✅Use realistic ratios: DAU is typically 10-30% of total users. Actions per user vary by product (social media: 20-50 actions/day, e-commerce: 5-10 actions/day). State your assumptions explicitly and ask the interviewer if they're reasonable.
Not explaining reasoning
Jumping to '10,000 QPS' without showing how you got there. The interviewer can't evaluate your thinking if you just state a number. The process is more important than the answer.
✅Always show your work: 'We have 10M DAU, each user reads 20 tweets per day, so read QPS = 10M × 20 / 100K = 2,000. With 3x peak multiplier, that's 6,000 peak QPS.' This takes 30 seconds and demonstrates structured thinking.
Forgetting media dominates storage
Carefully estimating text storage (messages, metadata) and ignoring that a single image is 1000x larger than a text message. For any system with user-uploaded media, images and videos will be 95%+ of total storage.
✅Always ask: 'Does this system handle media?' If yes, estimate media storage separately — it will dwarf everything else. A chat app's text messages might be 500 GB/day, but image attachments could be 50 TB/day.