Latency Reference
Understand the latency hierarchy — from L1 cache to network round trips. Build the intuition for why caching, batching, and data locality drive every system design decision.
Table of Contents
The Big Picture — What Is Latency?
Latency is the time it takes for an operation to complete — the delay between asking for something and getting it. In backend systems, latency determines how fast your API responds, how quickly your database returns results, and ultimately how snappy your product feels to users.
The critical insight is that not all operations are created equal. Reading from CPU cache is 100,000x faster than reading from disk. A network round trip to another continent is 1,000,000x slower than reading from RAM. These aren't small differences — they're orders of magnitude that fundamentally shape how systems are designed.
The Distance Analogy
Imagine you need to fetch a piece of information. L1 cache is reaching into your pocket — instant, you already have it. L2 cache is grabbing something from your desk — a quick reach. RAM is walking to the bookshelf across the room — a few seconds. SSD is driving to a nearby warehouse — minutes. HDD is driving to a warehouse across town, but the warehouse uses a mechanical crane to find your item — much slower. Network round trip is flying to another city, finding the item, and flying back. Same-region network is a short domestic flight. Cross-continent is an international flight. Every system design decision is about keeping data as close to 'your pocket' as possible.
🔥 Why This Matters
Every system design interview involves latency trade-offs. "Why use a cache?" Because RAM is 100x faster than SSD. "Why use a CDN?" Because a nearby server is 10x faster than a distant one. "Why denormalize?" To avoid an extra disk read. If you internalize these numbers, every design decision becomes intuitive.
The Latency Hierarchy
Here are the numbers every engineer should know. These are approximate — actual values vary by hardware — but the orders of magnitude are what matter.
Operation Time Relative L1 cache reference 0.5 ns 1x (baseline) L2 cache reference 7 ns 14x RAM reference 100 ns 200x SSD random read 16 μs 32,000x HDD random read 2 ms 4,000,000x Same-datacenter round trip 0.5 ms 1,000,000x Send 1 MB over 1 Gbps network 10 ms 20,000,000x HDD sequential read (1 MB) 2 ms 4,000,000x SSD sequential read (1 MB) 0.2 ms 400,000x Same-region network round trip 1-5 ms ~5,000,000x Cross-continent round trip 50-150 ms ~200,000,000x Key insight: each level is 10-1000x slower than the previous. This is why caching exists at every layer of the stack.
Visualizing the Scale
If an L1 cache reference took 1 second (instead of 0.5 nanoseconds), here's how long everything else would take at the same scale:
L1 cache reference → 1 second L2 cache reference → 14 seconds RAM reference → 3.3 minutes SSD random read → 8.9 hours HDD random read → 46 days Same-datacenter round trip → 11.6 days Cross-continent round trip → 9.5 YEARS This is why: → Databases cache hot data in RAM (avoid 46-day disk reads) → CDNs exist (avoid 9.5-year cross-continent trips) → Redis sits in front of everything (3-minute RAM vs 46-day disk)
⚡ Fast (nanoseconds)
- L1 cache: 0.5 ns
- L2 cache: 7 ns
- RAM: 100 ns
- These are in-process, no I/O involved
🐢 Slow (micro/milliseconds)
- SSD random read: 16 μs
- HDD random read: 2 ms
- Network (same DC): 0.5 ms
- Network (cross-continent): 50-150 ms
💡 The Compounding Effect
A single 2ms disk read seems harmless. But a page load that triggers 50 database queries, each doing 2 disk reads = 200ms just in disk I/O. Add network latency, serialization, and application logic, and you're at 500ms+. This is why small delays compound catastrophically at scale.
CPU Cache (L1 / L2)
CPU cache is a tiny, ultra-fast memory built directly into the processor. It exists because RAM, despite being fast, is still too slow for the CPU. A modern CPU can execute billions of operations per second, but waiting 100ns for RAM on every operation would waste most of that speed.
Your Pocket vs Your Desk
L1 cache is your pocket — the things you use constantly (phone, keys, wallet) are right there. Instant access. L2 cache is your desk — slightly further, but still within arm's reach. RAM is the bookshelf across the room — you have to get up and walk. The CPU keeps the most frequently accessed data in L1/L2 so it doesn't have to 'walk to the bookshelf' on every operation.
L1 vs L2 — The Basics
| Feature | L1 Cache | L2 Cache |
|---|---|---|
| Latency | ~0.5 ns (1-2 CPU cycles) | ~7 ns (10-20 CPU cycles) |
| Size | 32-64 KB per core | 256 KB - 1 MB per core |
| Location | Inside the CPU core | Adjacent to the CPU core |
| Purpose | Most frequently accessed data | Overflow from L1, still very hot data |
| Analogy | Your pocket | Your desk |
Why This Matters for Code
Cache-friendly (sequential access): for (int i = 0; i < N; i++) sum += array[i]; // Elements are adjacent in memory // CPU prefetches next elements into cache // Almost every access is a cache HIT Cache-unfriendly (random access): for (int i = 0; i < N; i++) sum += array[random()]; // Elements are scattered in memory // CPU can't predict what's needed next // Almost every access is a cache MISS → RAM Performance difference: 5-10x for large arrays. This is why arrays are faster than linked lists for iteration — array elements are contiguous in memory (cache-friendly).
Cache-Friendly Patterns
- ✅Sequential array access (iterate in order)
- ✅Small, contiguous data structures
- ✅Struct of arrays (SoA) over array of structs (AoS)
- ✅Keeping hot data small (fits in cache)
- ✅Batch processing (process all data, then move on)
Cache-Unfriendly Patterns
- ❌Random access across large datasets
- ❌Pointer-heavy structures (linked lists, trees)
- ❌Large objects with rarely-used fields
- ❌Frequent context switches between data
- ❌Hash tables with poor locality
🎯 Interview Insight
You won't be asked to optimize L1 cache in a system design interview. But understanding that in-memory operations are orders of magnitude faster than disk or network explains why every system uses caching. When you say "I'd add a Redis cache here," you're moving data from the "bookshelf" (disk) to the "desk" (RAM).
RAM vs SSD vs HDD
These three storage types form the backbone of every server. Understanding their latency differences explains why databases are designed the way they are — buffer pools, write-ahead logs, page caches, and the entire caching industry.
Desk → Cupboard → Warehouse
RAM is your desk — everything you're actively working on is right there. Fast to grab, but limited space (and expensive). SSD is the cupboard in the next room — you have to walk there, but it's organized and reasonably quick. Much more space than your desk. HDD is a warehouse across town — massive storage, very cheap, but getting something requires driving there and waiting for a forklift to find your item on a spinning shelf. Every database tries to keep 'hot' data on the desk (RAM) and only goes to the cupboard (SSD) or warehouse (HDD) when necessary.
| Feature | RAM | SSD | HDD |
|---|---|---|---|
| Random Read Latency | ~100 ns | ~16 μs (160x slower) | ~2 ms (20,000x slower) |
| Sequential Read (1 MB) | ~3 μs | ~0.2 ms | ~2 ms |
| Cost per GB (approx) | $3-8 | $0.10-0.30 | $0.02-0.05 |
| Capacity (typical server) | 64-512 GB | 1-8 TB | 4-16 TB |
| Durability | Volatile (lost on power off) | Persistent | Persistent |
| Moving Parts | None | None (flash chips) | Yes (spinning platters + arm) |
| Analogy | Your desk | Nearby cupboard | Warehouse across town |
Why HDD Is So Slow
An HDD has a spinning magnetic platter and a mechanical arm that moves to the right position to read data. This physical movement (seek time) takes 2-10ms. An SSD has no moving parts — it reads from flash memory chips electronically. RAM has no I/O at all — it's directly wired to the CPU bus.
PostgreSQL query: SELECT * FROM users WHERE id = 42 Without caching (cold start): 1. Parse query → ~0.1 ms 2. Plan execution → ~0.1 ms 3. Read index from disk (SSD) → ~0.016 ms 4. Read data page from disk (SSD) → ~0.016 ms 5. Return result → ~0.1 ms Total: ~0.3 ms With buffer pool (hot data in RAM): 1. Parse query → ~0.1 ms 2. Plan execution → ~0.1 ms 3. Read index from RAM → ~0.0001 ms 4. Read data page from RAM → ~0.0001 ms 5. Return result → ~0.1 ms Total: ~0.3 ms → but steps 3-4 are 160x faster At 10,000 queries/second, this difference is: Disk: 10,000 × 0.032ms = 320ms of disk I/O per second RAM: 10,000 × 0.0002ms = 2ms of RAM reads per second This is why PostgreSQL's shared_buffers (RAM cache) is critical. This is why Redis exists — keep hot data in RAM, skip disk entirely.
Database Buffer Pool
PostgreSQL, MySQL, and every major database keeps frequently accessed pages in RAM. The buffer pool is the #1 performance lever — more RAM = fewer disk reads = faster queries.
OS Page Cache
The operating system caches recently read disk pages in unused RAM. Even without database-level caching, the OS tries to keep hot data in memory.
Application Cache (Redis)
For data that's read far more than written, Redis keeps it in RAM with sub-millisecond access. Eliminates database queries entirely for cached data.
🎯 Interview Insight
When an interviewer asks "why is this query slow?" — the first question is: "Is the data in RAM or on disk?" If the working set fits in RAM (buffer pool), queries are fast. If it doesn't, every query triggers disk I/O — 160x slower on SSD, 20,000x slower on HDD. The fix is usually: add more RAM, add a cache layer, or reduce the working set size.
Network Round Trips
A network round trip is the time for a request to travel from the client to the server and for the response to travel back. It's the single largest source of latency in most distributed systems — and the hardest to eliminate because it's bounded by the speed of light.
Sending a Letter and Waiting for a Reply
A network round trip is like mailing a letter and waiting for a reply. Same-datacenter is like sending a letter across the office — it arrives in minutes. Same-region is like mailing across the city — a few hours. Cross-continent is like international mail — days. You can't make the mail truck faster (speed of light), but you can: send fewer letters (batching), keep a copy of common replies (caching), or move closer to the recipient (CDN/edge).
| Route | Latency | Analogy | Example |
|---|---|---|---|
| Same machine (localhost) | ~0.01 ms | Talking to yourself | App → local Redis |
| Same datacenter | ~0.5 ms | Across the office | App server → database |
| Same region (e.g., us-east) | ~1-5 ms | Across the city | Service A → Service B |
| Cross-continent | ~50-150 ms | International mail | US user → EU server |
| Satellite / remote | ~500-600 ms | Space mail | Ground → satellite → ground |
Why Network Calls Are the Bottleneck
User loads a product page. The API gateway calls: 1. User Service → 2ms (same DC) 2. Product Service → 3ms (same DC) 3. Inventory Service → 2ms (same DC) 4. Pricing Service → 2ms (same DC) 5. Review Service → 4ms (same DC) Sequential calls: 2 + 3 + 2 + 2 + 4 = 13ms in network alone Plus processing time in each service: ~5ms each = 25ms Total: ~38ms Now add a database query in each service (2ms each): +10ms Total: ~48ms for ONE page load With 3 levels of microservice depth (service calls service calls service): Network hops multiply. 10 hops × 2ms = 20ms just in network. This is why: → Parallel calls (fan-out) reduce sequential latency → Caching eliminates repeated calls → Batching combines multiple calls into one → Data locality (keep data close) reduces hops
Reducing Network Latency
Caching
Cache responses from other services. If the product data hasn't changed in 5 minutes, serve it from Redis instead of making a network call to the Product Service.
Batching
Instead of 50 individual requests to the User Service (one per user ID), send one batch request with all 50 IDs. One round trip instead of fifty.
Parallel Calls
If calls to User Service and Product Service are independent, make them simultaneously. Total latency = max(2ms, 3ms) = 3ms instead of 2 + 3 = 5ms.
Data Locality
Keep data close to where it's needed. CDN for static assets. Read replicas in each region. Edge computing for latency-sensitive logic.
🎯 Interview Insight
In every system design interview, count the network round trips. If your design requires 10 sequential service calls to serve one request, that's a red flag. Interviewers want to see you reduce round trips through caching, batching, parallel calls, and denormalization.
End-to-End Example
Let's trace what happens when a user loads a product page, and where latency is introduced at every layer.
User clicks: https://shop.example.com/products/42 1. DNS Resolution ~1 ms Browser cache → OS cache → recursive resolver (cached after first visit: 0 ms) 2. TCP + TLS Handshake ~30 ms 3 round trips to establish secure connection (reused on subsequent requests: 0 ms) 3. HTTP Request travels to server ~20 ms User in NYC → server in us-east-1 (Virginia) 4. Load Balancer → App Server ~0.5 ms Same datacenter hop 5. App Server checks Redis cache ~0.2 ms GET product:42 → cache HIT? Return cached JSON (If HIT: skip steps 6-8, total so far: ~52 ms) 6. Cache MISS → Query PostgreSQL ~2 ms SELECT * FROM products WHERE id = 42 Data is in buffer pool (RAM) → fast 7. Query Reviews Service ~3 ms GET /api/reviews?product_id=42 Same-datacenter network call + DB query 8. Serialize response to JSON ~0.1 ms CPU-bound, data is in L1/L2 cache 9. Response travels back to user ~20 ms Server → user's browser 10. Browser renders the page ~50 ms Parse HTML, fetch CSS/JS (cached), render DOM TOTAL (cache miss): ~127 ms TOTAL (cache hit): ~52 ms (steps 6-8 eliminated) TOTAL (repeat visit): ~22 ms (DNS + TLS cached, Redis hit)
Where Optimization Happens
CDN eliminates network latency
Static assets (CSS, JS, images) served from a CDN edge in NYC instead of Virginia. Saves ~40ms round trip. Dynamic content can also be cached at the edge for public pages.
Redis eliminates disk I/O
Product data cached in Redis (RAM). Reads take 0.2ms instead of 2ms from PostgreSQL. At 10K requests/second, this saves 18 seconds of cumulative latency per second.
Connection pooling eliminates handshakes
Keep persistent connections to the database and between services. Eliminates TCP/TLS handshake overhead on every request. A connection pool of 20 connections serves thousands of queries.
Parallel calls reduce sequential latency
Fetch product data and reviews simultaneously instead of sequentially. Total: max(2ms, 3ms) = 3ms instead of 2 + 3 = 5ms.
💡 The 80/20 Rule of Latency
Network round trips and disk I/O account for 80%+ of latency in most systems. CPU time (parsing, serialization, business logic) is usually negligible. When optimizing, always start with: "Can I eliminate a network call?" and "Can I serve this from RAM instead of disk?"
Trade-offs & Design Decisions
Every latency optimization involves a trade-off. Faster access means higher cost, less capacity, or weaker consistency.
| Trade-off | Faster Option | Slower Option | When to Choose Faster |
|---|---|---|---|
| Memory vs Cost | RAM ($5/GB) | SSD ($0.20/GB) | Hot data accessed thousands of times/second |
| Speed vs Capacity | Redis (64 GB) | PostgreSQL (1 TB) | Working set fits in RAM, read-heavy workload |
| Caching vs Consistency | Serve stale data (0.2ms) | Query DB for fresh data (2ms) | Data can be 5-60 seconds stale (product pages, feeds) |
| Denormalization vs Simplicity | One read, duplicated data | JOIN across tables | Read-heavy, latency-sensitive paths |
| Precomputation vs Freshness | Pre-built aggregates (0.1ms) | Compute on request (50ms) | Dashboards, analytics, leaderboards |
Why Systems Prefer These Patterns
Caching
Move data from disk (ms) to RAM (ns). Trade: stale data for 100x speed. Used everywhere — browser cache, CDN, Redis, database buffer pool, CPU cache.
Denormalization
Store data redundantly to avoid JOINs. Trade: storage space and update complexity for single-read performance. Used in NoSQL, read-heavy SQL tables.
Precomputation
Calculate results ahead of time instead of on every request. Trade: freshness for instant reads. Used for dashboards, search indexes, materialized views.
🎯 Interview Framework
When making any design decision, state the latency trade-off: "I'd cache this in Redis because the data is read 1000x more than it's written. The trade-off is serving stale data for up to 60 seconds, which is acceptable for product listings but not for inventory counts."
Interview Questions
These questions test whether you've internalized the latency hierarchy and can apply it to design decisions.
Q:Why is network latency more expensive than disk latency?
A: A same-datacenter network round trip (~0.5ms) is actually comparable to an SSD random read (~0.016ms) — disk can be slower for random reads. But the real cost of network calls is: (1) They're sequential by default — each call blocks until the response arrives. (2) They compound — a microservice calling 5 other services adds 5 round trips. (3) They're unreliable — timeouts, retries, and failures add tail latency. (4) Cross-region calls are 50-150ms — orders of magnitude slower than any local operation. The key insight: a single network call isn't expensive, but systems make thousands of them per request.
Q:Why do we cache data instead of just reading from the database?
A: A database query involves: network round trip to the DB server (~0.5ms), query parsing and planning (~0.1ms), disk I/O if data isn't in buffer pool (~2ms for SSD), and response serialization (~0.1ms). Total: 0.7-2.7ms. A Redis cache read involves: network round trip to Redis (~0.2ms) and memory lookup (~0.001ms). Total: ~0.2ms. That's 3-13x faster. At 10,000 requests/second, caching saves 5-25 seconds of cumulative latency per second. Plus, it reduces database load, allowing the DB to handle writes and complex queries.
Q:What is the cost of a database query?
A: It depends on whether the data is in RAM or on disk. Best case (buffer pool hit): ~0.5ms — network to DB + RAM read + response. Typical case (index scan, some disk): ~2-5ms — network + index lookup + 1-2 disk reads + response. Worst case (full table scan, cold cache): ~100ms+ — scanning millions of rows from disk. The lesson: design your queries to hit indexes (avoid full scans), size your buffer pool to fit the working set (avoid disk), and cache hot results in Redis (avoid the query entirely).
Your API has a p99 latency of 2 seconds
How would you diagnose and fix this?
Answer: Start by tracing where time is spent: (1) Network — are there sequential service calls that could be parallelized? (2) Database — are queries hitting disk instead of buffer pool? Check slow query logs. (3) Missing cache — is the same data being fetched repeatedly? Add Redis. (4) N+1 queries — is the ORM making 100 queries instead of 1? Use eager loading or batching. (5) External calls — is a third-party API slow? Add a timeout and cache. The p99 (99th percentile) is usually caused by occasional disk I/O, garbage collection pauses, or network timeouts — not the average case.
You need to serve 100,000 reads per second with < 5ms latency
How would you architect this?
Answer: At 100K reads/sec with < 5ms, you can't hit disk on every request. Architecture: (1) Redis cluster as the primary read path — sub-millisecond reads from RAM. (2) PostgreSQL as the source of truth — writes go here. (3) Cache-aside pattern — read from Redis, on miss read from DB and populate cache. (4) TTL of 30-60 seconds — acceptable staleness for most read-heavy workloads. (5) Multiple Redis replicas — distribute read load. The math: 100K × 0.2ms (Redis) = 20 seconds of cumulative latency/sec. 100K × 2ms (DB) = 200 seconds — impossible without caching.
Common Mistakes
These mistakes come from not internalizing the latency hierarchy. Each one has caused real production incidents.
Assuming all operations take similar time
A developer treats a Redis read (0.2ms), a database query (2ms), and a cross-service network call (5ms) as roughly equivalent. They make 20 sequential service calls in a request handler and wonder why latency is 100ms+. The 25x difference between Redis and a network call, multiplied by 20 calls, is the entire problem.
✅Know the orders of magnitude. RAM is 100ns, SSD is 16μs (160x slower), network is 500μs (5,000x slower). Count your network hops and disk reads. If a request makes more than 3-5 sequential network calls, redesign with caching, batching, or parallel calls.
Overusing network calls in hot paths
A microservice architecture where every request fans out to 8 services sequentially. Each call is 'only' 3ms, but 8 × 3ms = 24ms just in network latency — before any processing. Add database queries in each service and you're at 50ms+.
✅Identify the hot path (the most common request flow) and minimize network hops. Cache aggressively. Make independent calls in parallel. Consider combining frequently-co-called services. Use async processing for non-critical work.
Not caching frequently accessed data
A product page queries the database on every request — even though the product data changes once a day. At 10K requests/second, that's 10K unnecessary database queries per second, each taking 2ms. The database is overloaded, latency spikes, and the team adds more database replicas instead of a cache.
✅If data is read 100x more than it's written, cache it. Redis with a 60-second TTL eliminates 99% of database reads. The rule of thumb: if you're querying the same data more than once per second, it should be cached.
Ignoring the compounding effect
'Each query is only 2ms, that's fine.' But a page load triggers 50 queries (ORM eager loading, N+1 problems, multiple tables). 50 × 2ms = 100ms just in database time. Add serialization, network, and rendering — the page takes 300ms. Users perceive it as slow.
✅Profile the full request path, not individual operations. Use distributed tracing (Jaeger, Datadog) to see where time is spent. Optimize the total, not the parts. Often, reducing 50 queries to 5 (via JOINs, batching, or caching) has more impact than making each query 10% faster.