QPSStorageBandwidthThroughputBack of the EnvelopeSystem DesignEstimation

Capacity Estimation

Master back-of-the-envelope calculations — QPS estimation, storage sizing, and bandwidth calculations. The math that drives every system design interview.

22 min read9 sections

The Big Picture — What Is Capacity Estimation?

Capacity estimation is the art of making quick, reasonable calculations about how much a system needs — how many requests per second, how much storage, how much bandwidth. It's not about being precise to the byte. It's about getting within the right order of magnitude so you can make informed design decisions.

🍽️

The Restaurant Planning Analogy

You're opening a restaurant. Before building anything, you estimate: How many customers per hour? (QPS) → Determines how many tables, chefs, and waiters you need. How much food to stock? (Storage) → Determines fridge size and supply orders. How fast can the kitchen serve? (Bandwidth) → Determines if you need a bigger kitchen or faster equipment. You don't need exact numbers — you need to know: are we serving 50 people or 5,000? That's the difference between a food truck and a banquet hall. Getting the order of magnitude right is what matters.

🔥 Key Insight

In system design interviews, capacity estimation isn't a math test. It's a communication exercise. Interviewers want to see you make reasonable assumptions, state them clearly, and arrive at numbers that guide your architecture. The process matters more than the exact answer.

Why Estimations Matter

🎯

Avoid Over-Engineering

If your system handles 100 QPS, you don't need Kafka, Cassandra, and 50 microservices. A single PostgreSQL instance is fine. Estimation prevents building for scale you'll never reach.

🚨

Prevent Failure at Scale

If your system will handle 100K QPS and you designed for 1K, it will crash on launch day. Estimation reveals bottlenecks before they become outages.

🏗️

Drive Architecture Decisions

The numbers tell you: do you need a cache? How many servers? Should you shard the database? Estimation turns vague requirements into concrete infrastructure.

Orders of Magnitude — The Only Precision You Need

Power	Value	Name	Context
10³	1,000	Thousand	Small app, internal tool
10⁴	10,000	Ten thousand	Growing startup
10⁵	100,000	Hundred thousand	Medium-scale product
10⁶	1,000,000	Million	Large-scale product
10⁷	10,000,000	Ten million	Major platform
10⁹	1,000,000,000	Billion	Global-scale (Google, Meta)

Quick Reference — Useful Numberstext

Time:
  1 day    = 86,400 seconds  ≈ 10⁵ seconds (use 100K)
  1 month  = 2.6M seconds   ≈ 2.5 × 10⁶
  1 year   = 31.5M seconds  ≈ 3 × 10⁷

Storage:
  1 KB  = 1,000 bytes       (a short text message)
  1 MB  = 1,000 KB          (a high-res photo)
  1 GB  = 1,000 MB          (a movie)
  1 TB  = 1,000 GB          (a small database)
  1 PB  = 1,000 TB          (a large-scale system)

Characters:
  1 char = 1 byte (ASCII) or 2-4 bytes (UTF-8)
  A tweet (280 chars) ≈ 280 bytes ≈ 0.3 KB
  A JSON API response ≈ 1-10 KB
  A photo ≈ 200 KB - 2 MB
  A video (1 min, 720p) ≈ 50-100 MB

💡 The 80/20 Rule of Estimation

Round aggressively. Use 100K instead of 86,400. Use 1 million instead of 1,048,576. The goal is speed and clarity, not precision. If your estimate is within 2-5x of reality, you've done well.

QPS / Throughput Estimation

QPS (Queries Per Second) is the number of requests your system handles every second. It's the most fundamental capacity metric — it determines how many servers, how much caching, and what database you need.

DAU → QPS Conversion

The Core Formulatext

QPS = (DAU × actions per user per day) / seconds per day

Where:
  DAU = Daily Active Users
  seconds per day ≈ 100,000 (use 10⁵ for easy math)

Example: Twitter-like service
  DAU = 10 million (10⁷)
  Each user reads 20 tweets/day and posts 2 tweets/day

  Read QPS  = (10M × 20) / 100K = 200M / 100K = 2,000 QPS
  Write QPS = (10M × 2)  / 100K = 20M  / 100K = 200 QPS

  → Read-heavy system (10:1 read-to-write ratio)

DAU to QPS conversion — formula and example converting daily active users to queries per second

Peak vs Average

Traffic is never evenly distributed. Peak hours can be 2-5x the average. Design for peak, not average — otherwise your system crashes during rush hour.

Peak QPS Estimationtext

Average QPS = 2,000 (from above)
Peak multiplier = 3x (typical for social media)

Peak QPS = 2,000 × 3 = 6,000 QPS

For spiky events (Black Friday, viral content):
  Spike multiplier = 5-10x
  Spike QPS = 2,000 × 10 = 20,000 QPS

Rule of thumb:
  Design for 2-3x average for normal peak
  Design for 5-10x average if you expect viral/seasonal spikes

Peak vs average QPS — traffic spikes during peak hours compared to average daily throughput with multiplier factor

📖 Read-Heavy Systems

Social media feeds, news sites, product catalogs
Read:Write ratio typically 10:1 to 100:1
Strategy: caching, read replicas, CDN
Cache hit rate of 80-95% reduces DB load dramatically

✏️ Write-Heavy Systems

Logging, analytics, IoT sensor data, chat messages
Write:Read ratio can be 10:1 or higher
Strategy: write-optimized DBs, message queues, batching
Cassandra, Kafka, or append-only storage

🎯 Interview Insight

Always clarify assumptions with the interviewer: "I'll assume 10M DAU, each user makes 20 reads and 2 writes per day. Does that sound reasonable?" This shows structured thinking and gives the interviewer a chance to adjust the scope.

Storage Sizing

Storage estimation answers: how much disk space does this system need? The formula is simple, but the details matter — replication, indexes, growth over time, and hot vs cold storage.

The Core Formula

Storage Estimation Formulatext

Daily storage = (new records per day) × (size per record)
Yearly storage = daily storage × 365
Total storage  = yearly storage × retention years × replication factor

Example: Chat application (like WhatsApp)
  Assumptions:
    DAU = 50M users
    Each user sends 40 messages/day
    Average message size = 100 bytes (text only)

  Daily new messages = 50M × 40 = 2 billion messages/day
  Daily storage      = 2B × 100 bytes = 200 GB/day

  Yearly storage     = 200 GB × 365 = 73 TB/year
  With 3x replication = 73 × 3 = 219 TB/year
  With 20% index overhead = 219 × 1.2 = ~263 TB/year

  5-year plan: 263 × 5 = ~1.3 PB

Don't Forget These Multipliers

Factor	Multiplier	Why
Replication	2-3x	Data is copied across nodes for durability and read performance
Index overhead	1.1-1.3x	B-tree indexes, secondary indexes take additional space
Metadata	1.05-1.1x	Timestamps, IDs, internal DB overhead per row
Growth buffer	1.5-2x	Plan for 1-2 years of growth beyond current estimates
Media storage	10-100x text	Images (200KB-2MB), videos (50-100MB) dwarf text data

Storage sizing multipliers — base storage multiplied by replication factor, backup overhead, and index overhead

Hot vs Cold Storage

🔥 Hot Storage (Frequently Accessed)

Recent data (last 30 days of messages, recent orders)
Stored on SSD or in-memory (Redis)
Fast access, expensive per GB
Typically 5-20% of total data

❄️ Cold Storage (Rarely Accessed)

Old data (messages from 2 years ago, archived logs)
Stored on HDD or object storage (S3)
Slow access, cheap per GB
Typically 80-95% of total data

Hot vs cold storage — frequently accessed data on fast storage vs archived data on cheaper storage tiers

Storage with Media — Instagram Exampletext

Assumptions:
  DAU = 100M users
  10% post a photo daily = 10M photos/day
  Average photo size = 500 KB (after compression)
  Store 3 sizes: thumbnail (50KB) + medium (200KB) + original (500KB)

  Daily photo storage = 10M × (50 + 200 + 500) KB
                      = 10M × 750 KB
                      = 7.5 TB/day

  Yearly = 7.5 TB × 365 = ~2.7 PB/year
  With replication (3x) = ~8.1 PB/year

  → This is why Instagram uses object storage (S3), not a database.
  → Photos are served via CDN, not from the origin.

🎯 Interview Insight

Always plan for growth. If the system needs 10 TB today, estimate for 3-5 years. Storage is cheap — running out of it is not. Mention hot/cold tiering to show you think about cost optimization, not just raw capacity.

Bandwidth Calculations

Bandwidth is the amount of data flowing through your system per second. It determines whether your network, servers, and CDN can handle the load — or become the bottleneck.

The Core Formula

Bandwidth Estimationtext

Bandwidth = QPS × average response size

Example: API service
  QPS = 5,000 requests/sec
  Average response = 10 KB (JSON)

  Bandwidth = 5,000 × 10 KB = 50,000 KB/s = 50 MB/s = 400 Mbps

Example: Video streaming service
  Concurrent viewers = 100,000
  Bitrate = 5 Mbps (1080p)

  Bandwidth = 100,000 × 5 Mbps = 500 Gbps

  → This is why Netflix uses CDN edge servers.
  → Serving 500 Gbps from a single data center is impossible.

Ingress vs Egress

⬆️ Ingress (Incoming Traffic)

Data flowing INTO your system
User uploads: photos, videos, files
API requests with payloads (POST/PUT bodies)
Usually smaller than egress for most web apps

⬇️ Egress (Outgoing Traffic)

Data flowing OUT of your system
API responses, page loads, media delivery
Usually the bottleneck (and the expensive part)
CDN offloads 60-90% of egress traffic

Bandwidth ingress vs egress — incoming request data vs outgoing response data with typical ratios

Bandwidth with CDN Impacttext

Without CDN:
  Total egress = 50 MB/s (all from origin servers)
  Origin bandwidth cost: HIGH

With CDN (90% cache hit rate):
  CDN serves: 50 × 0.9 = 45 MB/s (from edge, cheap)
  Origin serves: 50 × 0.1 = 5 MB/s (cache misses only)
  Origin bandwidth reduced by 90%

This is why CDN is the first thing you add when bandwidth
becomes a concern. It's cheaper and faster than adding servers.

🎯 Interview Insight

Bandwidth becomes the bottleneck for media-heavy systems (video, images) long before CPU or storage does. Always mention CDN when discussing bandwidth. For API-only services, bandwidth is rarely the bottleneck — QPS and database throughput matter more.

End-to-End Estimation Example

Let's estimate capacity for a URL shortener (like bit.ly) end-to-end. This is a classic interview question.

Step 1: State Assumptions

Assumptionstext

Users:
  100M total URLs created per month
  Read:Write ratio = 100:1 (URLs are created once, read many times)

Data:
  Short URL: 7 characters = 7 bytes
  Long URL: average 200 characters = 200 bytes
  Metadata (created_at, user_id, etc.): ~100 bytes
  Total per record: ~300 bytes

Retention: 5 years

Step 2: QPS Estimation

QPS Calculationtext

Write QPS (URL creation):
  100M URLs/month ÷ (30 days × 100K seconds/day)
  = 100M / 3M
  ≈ 33 writes/sec

Read QPS (URL redirects):
  Read:Write = 100:1
  = 33 × 100 = 3,300 reads/sec

Peak QPS (3x average):
  Peak writes = ~100/sec
  Peak reads  = ~10,000/sec

→ Read-heavy system. Cache the most popular URLs.

Step 3: Storage Estimation

Storage Calculationtext

New URLs per month: 100M
Size per URL record: 300 bytes

Monthly storage = 100M × 300 bytes = 30 GB/month
Yearly storage  = 30 GB × 12 = 360 GB/year
5-year storage  = 360 GB × 5 = 1.8 TB

With replication (3x): 1.8 × 3 = 5.4 TB
With index overhead (20%): 5.4 × 1.2 = ~6.5 TB

Total URLs in 5 years: 100M × 12 × 5 = 6 billion URLs

→ 6.5 TB is easily handled by a single database with sharding.
→ 6 billion records needs a good indexing strategy.

Step 4: Bandwidth Estimation

Bandwidth Calculationtext

Write bandwidth (ingress):
  33 writes/sec × 300 bytes = ~10 KB/s (negligible)

Read bandwidth (egress):
  Each redirect response: HTTP 301 + Location header ≈ 500 bytes
  3,300 reads/sec × 500 bytes = 1.65 MB/s ≈ 13 Mbps

Peak read bandwidth:
  10,000 reads/sec × 500 bytes = 5 MB/s ≈ 40 Mbps

→ Bandwidth is NOT the bottleneck for a URL shortener.
→ The bottleneck is read QPS → solved with caching (Redis).

Step 5: Architecture Implications

What the Numbers Tell Ustext

QPS:     ~3,300 reads/sec → Redis cache handles this easily
Storage: ~6.5 TB over 5 years → PostgreSQL with sharding, or DynamoDB
Bandwidth: ~13 Mbps → not a concern

Architecture:
  → Redis cache for hot URLs (top 20% of URLs get 80% of traffic)
  → PostgreSQL or DynamoDB for persistent storage
  → No CDN needed (responses are tiny redirects, not media)
  → Single server can handle this; add a second for redundancy

If this were Instagram (media-heavy):
  → Storage would be PBs, not TBs
  → Bandwidth would be Gbps, not Mbps
  → CDN would be essential
  → Object storage (S3) instead of database for media

End-to-end capacity estimation pipeline — from user count through QPS, storage, bandwidth, to infrastructure sizing

🔥 This Is What Interviewers Want

State assumptions → calculate QPS → calculate storage → calculate bandwidth → derive architecture decisions. The numbers should drive the design, not the other way around. Show this process clearly and you'll ace the estimation portion.

Trade-offs & Design Decisions

Accuracy vs Speed of Estimation

Dimension	Quick Estimate (Interview)	Detailed Estimate (Production)
Time spent	2-5 minutes	Days to weeks
Precision	Within 2-5x of reality	Within 10-20% of reality
Assumptions	Round aggressively (100K sec/day)	Measure actual traffic patterns
Purpose	Guide architecture decisions	Size infrastructure, plan budget
When to use	System design interviews, early design	Capacity planning, procurement

Overestimation vs Underestimation

Overestimation risks

❌Wasted money on unused infrastructure
❌Over-engineered architecture (complexity without need)
❌Slower development (building for scale you don't have)
❌Premature optimization

Underestimation risks

❌System crashes under real load
❌Emergency scaling (expensive, stressful)
❌Data loss if storage runs out
❌Poor user experience (slow responses, timeouts)

💡 The Sweet Spot

Estimate for 3-5x your expected load. This gives you headroom for growth and traffic spikes without massively over-provisioning. Cloud infrastructure makes this easier — you can scale up when needed, so slight underestimation is less catastrophic than it used to be.

Cost vs Performance

Decision	Cheaper Option	Faster Option
Storage	HDD / S3 Standard ($0.023/GB)	SSD / S3 Express ($0.16/GB)
Caching	No cache (hit DB every time)	Redis cache ($$$, but 100x faster reads)
Replication	Single copy (risk of data loss)	3x replication (3x storage cost, high durability)
CDN	Serve from origin (high latency)	CDN edge delivery (CDN cost, but 10x faster)
Compute	Fewer, larger servers	More, smaller servers (better fault tolerance)

Interview Questions

Estimation-based and scenario-based questions you're likely to encounter.

Q:How do you calculate QPS from DAU?

A: QPS = (DAU × actions per user per day) / seconds per day. Use 100,000 seconds per day for easy math (actual is 86,400). Example: 10M DAU, 20 reads per user per day → 10M × 20 / 100K = 2,000 read QPS. Always separate read QPS and write QPS — they have different scaling strategies. Multiply by 2-3x for peak traffic.

Q:Estimate storage for WhatsApp messages

A: Assumptions: 2B users, 100M DAU, 50 messages per user per day, average message 100 bytes. Daily messages: 100M × 50 = 5B messages. Daily storage: 5B × 100 bytes = 500 GB/day. Yearly: 500 GB × 365 = ~180 TB/year. With 3x replication: ~540 TB/year. With media (10% of messages have a 200KB image): 100M × 50 × 0.1 × 200KB = 100 TB/day for images alone. Media dominates storage — text is negligible in comparison.

Design a notification system for 50M users

Estimate the QPS and storage requirements

Answer: Assumptions: 50M DAU, each user receives 10 notifications/day, each notification is 500 bytes (title, body, metadata, timestamp). Write QPS: 50M × 10 / 100K = 5,000 writes/sec. Peak: 15,000 writes/sec. Daily storage: 50M × 10 × 500 bytes = 250 GB/day. Yearly: ~91 TB. With 3x replication: ~273 TB. Architecture: write-optimized DB (Cassandra) for notifications, Redis for unread counts, push via WebSockets. Notifications older than 30 days → cold storage.

Your image hosting service gets 1M uploads per day

Estimate storage and bandwidth needs

Answer: Assumptions: 1M uploads/day, average image 1MB, store 3 sizes (thumbnail 50KB, medium 300KB, original 1MB). Daily storage: 1M × (50 + 300 + 1000) KB = 1M × 1.35 MB = 1.35 TB/day. Yearly: ~490 TB. With replication: ~1.5 PB/year. Read bandwidth: if each image is viewed 100 times on average, and 80% are thumbnails (50KB): 100M views/day × 50KB / 100K sec = 50 GB/s = way too much for origin. CDN is mandatory — with 95% cache hit rate, origin handles only 2.5 GB/s.

You're told the system has 500M DAU

What's the first thing you estimate?

Answer: QPS. 500M DAU is meaningless without knowing actions per user. Ask: 'What does each user do?' If it's a read-heavy feed (20 reads/day): 500M × 20 / 100K = 100K read QPS. That's serious scale — you need caching (Redis cluster), read replicas, CDN, and likely database sharding. If it's a messaging app (50 messages/day): 500M × 50 / 100K = 250K write QPS. That's write-heavy — you need Cassandra or Kafka, not PostgreSQL. The DAU alone doesn't tell you the architecture; the access pattern does.

Common Mistakes

These mistakes lead to wrong estimates and bad architecture decisions.

📈

Ignoring peak traffic

Designing for average QPS and wondering why the system crashes at 6 PM. Average QPS is 2,000 but peak is 10,000. If your system handles 3,000, it fails during peak hours — exactly when the most users are online.

✅Always calculate peak QPS (2-3x average for normal traffic, 5-10x for events like Black Friday). Design your system to handle peak, not average. Use auto-scaling to handle spikes cost-effectively.

💾

Forgetting replication and indexes

Estimating 1 TB of storage and provisioning exactly 1 TB. With 3x replication, you need 3 TB. With indexes (20% overhead), you need 3.6 TB. With growth buffer, you need 5+ TB. Running out of storage in production is a crisis.

✅Always multiply raw storage by: replication factor (2-3x) × index overhead (1.2x) × growth buffer (1.5-2x). A 1 TB estimate becomes 4-7 TB in practice.

🤷

Unrealistic assumptions

Assuming every user is active 24/7, or that all 1 billion registered users are DAU. If you have 1B registered users, DAU is typically 10-30% (100-300M). Not all users are equally active.

✅Use realistic ratios: DAU is typically 10-30% of total users. Actions per user vary by product (social media: 20-50 actions/day, e-commerce: 5-10 actions/day). State your assumptions explicitly and ask the interviewer if they're reasonable.

🤐

Not explaining reasoning

Jumping to '10,000 QPS' without showing how you got there. The interviewer can't evaluate your thinking if you just state a number. The process is more important than the answer.

✅Always show your work: 'We have 10M DAU, each user reads 20 tweets per day, so read QPS = 10M × 20 / 100K = 2,000. With 3x peak multiplier, that's 6,000 peak QPS.' This takes 30 seconds and demonstrates structured thinking.

📊

Forgetting media dominates storage

Carefully estimating text storage (messages, metadata) and ignoring that a single image is 1000x larger than a text message. For any system with user-uploaded media, images and videos will be 95%+ of total storage.

✅Always ask: 'Does this system handle media?' If yes, estimate media storage separately — it will dwarf everything else. A chat app's text messages might be 500 GB/day, but image attachments could be 50 TB/day.

Capacity Estimation

Table of Contents

The Big Picture — What Is Capacity Estimation?

The Restaurant Planning Analogy

Why Estimations Matter

Avoid Over-Engineering

Prevent Failure at Scale

Drive Architecture Decisions

Orders of Magnitude — The Only Precision You Need

QPS / Throughput Estimation

DAU → QPS Conversion

Peak vs Average

📖 Read-Heavy Systems

✏️ Write-Heavy Systems

Storage Sizing

The Core Formula

Don't Forget These Multipliers

Hot vs Cold Storage

🔥 Hot Storage (Frequently Accessed)

❄️ Cold Storage (Rarely Accessed)

Bandwidth Calculations

The Core Formula

Ingress vs Egress

⬆️ Ingress (Incoming Traffic)

⬇️ Egress (Outgoing Traffic)

End-to-End Estimation Example

Step 1: State Assumptions

Step 2: QPS Estimation

Step 3: Storage Estimation

Step 4: Bandwidth Estimation

Step 5: Architecture Implications

Trade-offs & Design Decisions

Accuracy vs Speed of Estimation

Overestimation vs Underestimation

Overestimation risks

Underestimation risks

Cost vs Performance

Interview Questions

Q:How do you calculate QPS from DAU?

Q:Estimate storage for WhatsApp messages

Design a notification system for 50M users

Your image hosting service gets 1M uploads per day

You're told the system has 500M DAU

Common Mistakes

Ignoring peak traffic

Forgetting replication and indexes

Unrealistic assumptions

Not explaining reasoning

Forgetting media dominates storage