QPSStorageBandwidthThroughputBack of the EnvelopeSystem DesignEstimation

Capacity Estimation

Master back-of-the-envelope calculations — QPS estimation, storage sizing, and bandwidth calculations. The math that drives every system design interview.

22 min read9 sections
01

The Big Picture — What Is Capacity Estimation?

Capacity estimation is the art of making quick, reasonable calculations about how much a system needs — how many requests per second, how much storage, how much bandwidth. It's not about being precise to the byte. It's about getting within the right order of magnitude so you can make informed design decisions.

🍽️

The Restaurant Planning Analogy

You're opening a restaurant. Before building anything, you estimate: How many customers per hour? (QPS) → Determines how many tables, chefs, and waiters you need. How much food to stock? (Storage) → Determines fridge size and supply orders. How fast can the kitchen serve? (Bandwidth) → Determines if you need a bigger kitchen or faster equipment. You don't need exact numbers — you need to know: are we serving 50 people or 5,000? That's the difference between a food truck and a banquet hall. Getting the order of magnitude right is what matters.

🔥 Key Insight

In system design interviews, capacity estimation isn't a math test. It's a communication exercise. Interviewers want to see you make reasonable assumptions, state them clearly, and arrive at numbers that guide your architecture. The process matters more than the exact answer.

02

Why Estimations Matter

🎯

Avoid Over-Engineering

If your system handles 100 QPS, you don't need Kafka, Cassandra, and 50 microservices. A single PostgreSQL instance is fine. Estimation prevents building for scale you'll never reach.

🚨

Prevent Failure at Scale

If your system will handle 100K QPS and you designed for 1K, it will crash on launch day. Estimation reveals bottlenecks before they become outages.

🏗️

Drive Architecture Decisions

The numbers tell you: do you need a cache? How many servers? Should you shard the database? Estimation turns vague requirements into concrete infrastructure.

Orders of Magnitude — The Only Precision You Need

PowerValueNameContext
10³1,000ThousandSmall app, internal tool
10⁴10,000Ten thousandGrowing startup
10⁵100,000Hundred thousandMedium-scale product
10⁶1,000,000MillionLarge-scale product
10⁷10,000,000Ten millionMajor platform
10⁹1,000,000,000BillionGlobal-scale (Google, Meta)
Quick Reference — Useful Numberstext
Time:
  1 day    = 86,400 seconds10seconds (use 100K)
  1 month  = 2.6M seconds2.5 × 10
  1 year   = 31.5M seconds3 × 10

Storage:
  1 KB  = 1,000 bytes       (a short text message)
  1 MB  = 1,000 KB          (a high-res photo)
  1 GB  = 1,000 MB          (a movie)
  1 TB  = 1,000 GB          (a small database)
  1 PB  = 1,000 TB          (a large-scale system)

Characters:
  1 char = 1 byte (ASCII) or 2-4 bytes (UTF-8)
  A tweet (280 chars) ≈ 280 bytes0.3 KB
  A JSON API response1-10 KB
  A photo200 KB - 2 MB
  A video (1 min, 720p) ≈ 50-100 MB

💡 The 80/20 Rule of Estimation

Round aggressively. Use 100K instead of 86,400. Use 1 million instead of 1,048,576. The goal is speed and clarity, not precision. If your estimate is within 2-5x of reality, you've done well.

03

QPS / Throughput Estimation

QPS (Queries Per Second) is the number of requests your system handles every second. It's the most fundamental capacity metric — it determines how many servers, how much caching, and what database you need.

DAU → QPS Conversion

The Core Formulatext
QPS = (DAU × actions per user per day) / seconds per day

Where:
  DAU = Daily Active Users
  seconds per day100,000 (use 10for easy math)

Example: Twitter-like service
  DAU = 10 million (10⁷)
  Each user reads 20 tweets/day and posts 2 tweets/day

  Read QPS  = (10M × 20) / 100K = 200M / 100K = 2,000 QPS
  Write QPS = (10M × 2)  / 100K = 20M  / 100K = 200 QPS

Read-heavy system (10:1 read-to-write ratio)

Peak vs Average

Traffic is never evenly distributed. Peak hours can be 2-5x the average. Design for peak, not average — otherwise your system crashes during rush hour.

Peak QPS Estimationtext
Average QPS = 2,000 (from above)
Peak multiplier = 3x (typical for social media)

Peak QPS = 2,000 × 3 = 6,000 QPS

For spiky events (Black Friday, viral content):
  Spike multiplier = 5-10x
  Spike QPS = 2,000 × 10 = 20,000 QPS

Rule of thumb:
  Design for 2-3x average for normal peak
  Design for 5-10x average if you expect viral/seasonal spikes

📖 Read-Heavy Systems

  • Social media feeds, news sites, product catalogs
  • Read:Write ratio typically 10:1 to 100:1
  • Strategy: caching, read replicas, CDN
  • Cache hit rate of 80-95% reduces DB load dramatically

✏️ Write-Heavy Systems

  • Logging, analytics, IoT sensor data, chat messages
  • Write:Read ratio can be 10:1 or higher
  • Strategy: write-optimized DBs, message queues, batching
  • Cassandra, Kafka, or append-only storage

🎯 Interview Insight

Always clarify assumptions with the interviewer: "I'll assume 10M DAU, each user makes 20 reads and 2 writes per day. Does that sound reasonable?" This shows structured thinking and gives the interviewer a chance to adjust the scope.

04

Storage Sizing

Storage estimation answers: how much disk space does this system need? The formula is simple, but the details matter — replication, indexes, growth over time, and hot vs cold storage.

The Core Formula

Storage Estimation Formulatext
Daily storage = (new records per day) × (size per record)
Yearly storage = daily storage × 365
Total storage  = yearly storage × retention years × replication factor

Example: Chat application (like WhatsApp)
  Assumptions:
    DAU = 50M users
    Each user sends 40 messages/day
    Average message size = 100 bytes (text only)

  Daily new messages = 50M × 40 = 2 billion messages/day
  Daily storage      = 2B × 100 bytes = 200 GB/day

  Yearly storage     = 200 GB × 365 = 73 TB/year
  With 3x replication = 73 × 3 = 219 TB/year
  With 20% index overhead = 219 × 1.2 = ~263 TB/year

  5-year plan: 263 × 5 = ~1.3 PB

Don't Forget These Multipliers

FactorMultiplierWhy
Replication2-3xData is copied across nodes for durability and read performance
Index overhead1.1-1.3xB-tree indexes, secondary indexes take additional space
Metadata1.05-1.1xTimestamps, IDs, internal DB overhead per row
Growth buffer1.5-2xPlan for 1-2 years of growth beyond current estimates
Media storage10-100x textImages (200KB-2MB), videos (50-100MB) dwarf text data

Hot vs Cold Storage

🔥 Hot Storage (Frequently Accessed)

  • Recent data (last 30 days of messages, recent orders)
  • Stored on SSD or in-memory (Redis)
  • Fast access, expensive per GB
  • Typically 5-20% of total data

❄️ Cold Storage (Rarely Accessed)

  • Old data (messages from 2 years ago, archived logs)
  • Stored on HDD or object storage (S3)
  • Slow access, cheap per GB
  • Typically 80-95% of total data
Storage with Media — Instagram Exampletext
Assumptions:
  DAU = 100M users
  10% post a photo daily = 10M photos/day
  Average photo size = 500 KB (after compression)
  Store 3 sizes: thumbnail (50KB) + medium (200KB) + original (500KB)

  Daily photo storage = 10M × (50 + 200 + 500) KB
                      = 10M × 750 KB
                      = 7.5 TB/day

  Yearly = 7.5 TB × 365 = ~2.7 PB/year
  With replication (3x) = ~8.1 PB/year

This is why Instagram uses object storage (S3), not a database.
Photos are served via CDN, not from the origin.

🎯 Interview Insight

Always plan for growth. If the system needs 10 TB today, estimate for 3-5 years. Storage is cheap — running out of it is not. Mention hot/cold tiering to show you think about cost optimization, not just raw capacity.

05

Bandwidth Calculations

Bandwidth is the amount of data flowing through your system per second. It determines whether your network, servers, and CDN can handle the load — or become the bottleneck.

The Core Formula

Bandwidth Estimationtext
Bandwidth = QPS × average response size

Example: API service
  QPS = 5,000 requests/sec
  Average response = 10 KB (JSON)

  Bandwidth = 5,000 × 10 KB = 50,000 KB/s = 50 MB/s = 400 Mbps

Example: Video streaming service
  Concurrent viewers = 100,000
  Bitrate = 5 Mbps (1080p)

  Bandwidth = 100,000 × 5 Mbps = 500 Gbps

This is why Netflix uses CDN edge servers.
Serving 500 Gbps from a single data center is impossible.

Ingress vs Egress

⬆️ Ingress (Incoming Traffic)

  • Data flowing INTO your system
  • User uploads: photos, videos, files
  • API requests with payloads (POST/PUT bodies)
  • Usually smaller than egress for most web apps

⬇️ Egress (Outgoing Traffic)

  • Data flowing OUT of your system
  • API responses, page loads, media delivery
  • Usually the bottleneck (and the expensive part)
  • CDN offloads 60-90% of egress traffic
Bandwidth with CDN Impacttext
Without CDN:
  Total egress = 50 MB/s (all from origin servers)
  Origin bandwidth cost: HIGH

With CDN (90% cache hit rate):
  CDN serves: 50 × 0.9 = 45 MB/s (from edge, cheap)
  Origin serves: 50 × 0.1 = 5 MB/s (cache misses only)
  Origin bandwidth reduced by 90%

This is why CDN is the first thing you add when bandwidth
becomes a concern. It's cheaper and faster than adding servers.

🎯 Interview Insight

Bandwidth becomes the bottleneck for media-heavy systems (video, images) long before CPU or storage does. Always mention CDN when discussing bandwidth. For API-only services, bandwidth is rarely the bottleneck — QPS and database throughput matter more.

06

End-to-End Estimation Example

Let's estimate capacity for a URL shortener (like bit.ly) end-to-end. This is a classic interview question.

Step 1: State Assumptions

Assumptionstext
Users:
  100M total URLs created per month
  Read:Write ratio = 100:1 (URLs are created once, read many times)

Data:
  Short URL: 7 characters = 7 bytes
  Long URL: average 200 characters = 200 bytes
  Metadata (created_at, user_id, etc.): ~100 bytes
  Total per record: ~300 bytes

Retention: 5 years

Step 2: QPS Estimation

QPS Calculationtext
Write QPS (URL creation):
  100M URLs/month ÷ (30 days × 100K seconds/day)
  = 100M / 3M
33 writes/sec

Read QPS (URL redirects):
  Read:Write = 100:1
  = 33 × 100 = 3,300 reads/sec

Peak QPS (3x average):
  Peak writes = ~100/sec
  Peak reads  = ~10,000/sec

Read-heavy system. Cache the most popular URLs.

Step 3: Storage Estimation

Storage Calculationtext
New URLs per month: 100M
Size per URL record: 300 bytes

Monthly storage = 100M × 300 bytes = 30 GB/month
Yearly storage  = 30 GB × 12 = 360 GB/year
5-year storage  = 360 GB × 5 = 1.8 TB

With replication (3x): 1.8 × 3 = 5.4 TB
With index overhead (20%): 5.4 × 1.2 = ~6.5 TB

Total URLs in 5 years: 100M × 12 × 5 = 6 billion URLs

6.5 TB is easily handled by a single database with sharding.
6 billion records needs a good indexing strategy.

Step 4: Bandwidth Estimation

Bandwidth Calculationtext
Write bandwidth (ingress):
  33 writes/sec × 300 bytes = ~10 KB/s (negligible)

Read bandwidth (egress):
  Each redirect response: HTTP 301 + Location header500 bytes
  3,300 reads/sec × 500 bytes = 1.65 MB/s13 Mbps

Peak read bandwidth:
  10,000 reads/sec × 500 bytes = 5 MB/s40 Mbps

Bandwidth is NOT the bottleneck for a URL shortener.
The bottleneck is read QPSsolved with caching (Redis).

Step 5: Architecture Implications

What the Numbers Tell Ustext
QPS:     ~3,300 reads/secRedis cache handles this easily
Storage: ~6.5 TB over 5 yearsPostgreSQL with sharding, or DynamoDB
Bandwidth: ~13 Mbpsnot a concern

Architecture:
Redis cache for hot URLs (top 20% of URLs get 80% of traffic)
PostgreSQL or DynamoDB for persistent storage
No CDN needed (responses are tiny redirects, not media)
Single server can handle this; add a second for redundancy

If this were Instagram (media-heavy):
Storage would be PBs, not TBs
Bandwidth would be Gbps, not Mbps
CDN would be essential
Object storage (S3) instead of database for media

🔥 This Is What Interviewers Want

State assumptions → calculate QPS → calculate storage → calculate bandwidth → derive architecture decisions. The numbers should drive the design, not the other way around. Show this process clearly and you'll ace the estimation portion.

07

Trade-offs & Design Decisions

Accuracy vs Speed of Estimation

DimensionQuick Estimate (Interview)Detailed Estimate (Production)
Time spent2-5 minutesDays to weeks
PrecisionWithin 2-5x of realityWithin 10-20% of reality
AssumptionsRound aggressively (100K sec/day)Measure actual traffic patterns
PurposeGuide architecture decisionsSize infrastructure, plan budget
When to useSystem design interviews, early designCapacity planning, procurement

Overestimation vs Underestimation

Overestimation risks

  • Wasted money on unused infrastructure
  • Over-engineered architecture (complexity without need)
  • Slower development (building for scale you don't have)
  • Premature optimization

Underestimation risks

  • System crashes under real load
  • Emergency scaling (expensive, stressful)
  • Data loss if storage runs out
  • Poor user experience (slow responses, timeouts)

💡 The Sweet Spot

Estimate for 3-5x your expected load. This gives you headroom for growth and traffic spikes without massively over-provisioning. Cloud infrastructure makes this easier — you can scale up when needed, so slight underestimation is less catastrophic than it used to be.

Cost vs Performance

DecisionCheaper OptionFaster Option
StorageHDD / S3 Standard ($0.023/GB)SSD / S3 Express ($0.16/GB)
CachingNo cache (hit DB every time)Redis cache ($$$, but 100x faster reads)
ReplicationSingle copy (risk of data loss)3x replication (3x storage cost, high durability)
CDNServe from origin (high latency)CDN edge delivery (CDN cost, but 10x faster)
ComputeFewer, larger serversMore, smaller servers (better fault tolerance)
08

Interview Questions

Estimation-based and scenario-based questions you're likely to encounter.

Q:How do you calculate QPS from DAU?

A: QPS = (DAU × actions per user per day) / seconds per day. Use 100,000 seconds per day for easy math (actual is 86,400). Example: 10M DAU, 20 reads per user per day → 10M × 20 / 100K = 2,000 read QPS. Always separate read QPS and write QPS — they have different scaling strategies. Multiply by 2-3x for peak traffic.

Q:Estimate storage for WhatsApp messages

A: Assumptions: 2B users, 100M DAU, 50 messages per user per day, average message 100 bytes. Daily messages: 100M × 50 = 5B messages. Daily storage: 5B × 100 bytes = 500 GB/day. Yearly: 500 GB × 365 = ~180 TB/year. With 3x replication: ~540 TB/year. With media (10% of messages have a 200KB image): 100M × 50 × 0.1 × 200KB = 100 TB/day for images alone. Media dominates storage — text is negligible in comparison.

1

Design a notification system for 50M users

Estimate the QPS and storage requirements

Answer: Assumptions: 50M DAU, each user receives 10 notifications/day, each notification is 500 bytes (title, body, metadata, timestamp). Write QPS: 50M × 10 / 100K = 5,000 writes/sec. Peak: 15,000 writes/sec. Daily storage: 50M × 10 × 500 bytes = 250 GB/day. Yearly: ~91 TB. With 3x replication: ~273 TB. Architecture: write-optimized DB (Cassandra) for notifications, Redis for unread counts, push via WebSockets. Notifications older than 30 days → cold storage.

2

Your image hosting service gets 1M uploads per day

Estimate storage and bandwidth needs

Answer: Assumptions: 1M uploads/day, average image 1MB, store 3 sizes (thumbnail 50KB, medium 300KB, original 1MB). Daily storage: 1M × (50 + 300 + 1000) KB = 1M × 1.35 MB = 1.35 TB/day. Yearly: ~490 TB. With replication: ~1.5 PB/year. Read bandwidth: if each image is viewed 100 times on average, and 80% are thumbnails (50KB): 100M views/day × 50KB / 100K sec = 50 GB/s = way too much for origin. CDN is mandatory — with 95% cache hit rate, origin handles only 2.5 GB/s.

3

You're told the system has 500M DAU

What's the first thing you estimate?

Answer: QPS. 500M DAU is meaningless without knowing actions per user. Ask: 'What does each user do?' If it's a read-heavy feed (20 reads/day): 500M × 20 / 100K = 100K read QPS. That's serious scale — you need caching (Redis cluster), read replicas, CDN, and likely database sharding. If it's a messaging app (50 messages/day): 500M × 50 / 100K = 250K write QPS. That's write-heavy — you need Cassandra or Kafka, not PostgreSQL. The DAU alone doesn't tell you the architecture; the access pattern does.

09

Common Mistakes

These mistakes lead to wrong estimates and bad architecture decisions.

📈

Ignoring peak traffic

Designing for average QPS and wondering why the system crashes at 6 PM. Average QPS is 2,000 but peak is 10,000. If your system handles 3,000, it fails during peak hours — exactly when the most users are online.

Always calculate peak QPS (2-3x average for normal traffic, 5-10x for events like Black Friday). Design your system to handle peak, not average. Use auto-scaling to handle spikes cost-effectively.

💾

Forgetting replication and indexes

Estimating 1 TB of storage and provisioning exactly 1 TB. With 3x replication, you need 3 TB. With indexes (20% overhead), you need 3.6 TB. With growth buffer, you need 5+ TB. Running out of storage in production is a crisis.

Always multiply raw storage by: replication factor (2-3x) × index overhead (1.2x) × growth buffer (1.5-2x). A 1 TB estimate becomes 4-7 TB in practice.

🤷

Unrealistic assumptions

Assuming every user is active 24/7, or that all 1 billion registered users are DAU. If you have 1B registered users, DAU is typically 10-30% (100-300M). Not all users are equally active.

Use realistic ratios: DAU is typically 10-30% of total users. Actions per user vary by product (social media: 20-50 actions/day, e-commerce: 5-10 actions/day). State your assumptions explicitly and ask the interviewer if they're reasonable.

🤐

Not explaining reasoning

Jumping to '10,000 QPS' without showing how you got there. The interviewer can't evaluate your thinking if you just state a number. The process is more important than the answer.

Always show your work: 'We have 10M DAU, each user reads 20 tweets per day, so read QPS = 10M × 20 / 100K = 2,000. With 3x peak multiplier, that's 6,000 peak QPS.' This takes 30 seconds and demonstrates structured thinking.

📊

Forgetting media dominates storage

Carefully estimating text storage (messages, metadata) and ignoring that a single image is 1000x larger than a text message. For any system with user-uploaded media, images and videos will be 95%+ of total storage.

Always ask: 'Does this system handle media?' If yes, estimate media storage separately — it will dwarf everything else. A chat app's text messages might be 500 GB/day, but image attachments could be 50 TB/day.