LatencyL1 CacheRAMSSDHDDNetworkPerformanceStorage Hierarchy

Latency Reference

Understand the latency hierarchy — from L1 cache to network round trips. Build the intuition for why caching, batching, and data locality drive every system design decision.

25 min read9 sections
01

The Big Picture — What Is Latency?

Latency is the time it takes for an operation to complete — the delay between asking for something and getting it. In backend systems, latency determines how fast your API responds, how quickly your database returns results, and ultimately how snappy your product feels to users.

The critical insight is that not all operations are created equal. Reading from CPU cache is 100,000x faster than reading from disk. A network round trip to another continent is 1,000,000x slower than reading from RAM. These aren't small differences — they're orders of magnitude that fundamentally shape how systems are designed.

🏃

The Distance Analogy

Imagine you need to fetch a piece of information. L1 cache is reaching into your pocket — instant, you already have it. L2 cache is grabbing something from your desk — a quick reach. RAM is walking to the bookshelf across the room — a few seconds. SSD is driving to a nearby warehouse — minutes. HDD is driving to a warehouse across town, but the warehouse uses a mechanical crane to find your item — much slower. Network round trip is flying to another city, finding the item, and flying back. Same-region network is a short domestic flight. Cross-continent is an international flight. Every system design decision is about keeping data as close to 'your pocket' as possible.

🔥 Why This Matters

Every system design interview involves latency trade-offs. "Why use a cache?" Because RAM is 100x faster than SSD. "Why use a CDN?" Because a nearby server is 10x faster than a distant one. "Why denormalize?" To avoid an extra disk read. If you internalize these numbers, every design decision becomes intuitive.

02

The Latency Hierarchy

Here are the numbers every engineer should know. These are approximate — actual values vary by hardware — but the orders of magnitude are what matter.

Latency Numbers Every Engineer Should Knowtext
Operation                              Time          Relative

L1 cache reference                     0.5 ns        1x (baseline)
L2 cache reference                     7 ns          14x
RAM reference                          100 ns        200x
SSD random read                        16 μs         32,000x
HDD random read                        2 ms          4,000,000x
Same-datacenter round trip             0.5 ms        1,000,000x
Send 1 MB over 1 Gbps network         10 ms         20,000,000x
HDD sequential read (1 MB)            2 ms          4,000,000x
SSD sequential read (1 MB)            0.2 ms        400,000x
Same-region network round trip         1-5 ms        ~5,000,000x
Cross-continent round trip             50-150 ms     ~200,000,000x

Key insight: each level is 10-1000x slower than the previous.
This is why caching exists at every layer of the stack.

Visualizing the Scale

If an L1 cache reference took 1 second (instead of 0.5 nanoseconds), here's how long everything else would take at the same scale:

If L1 Cache = 1 Second...text
L1 cache reference1 second
L2 cache reference14 seconds
RAM reference3.3 minutes
SSD random read8.9 hours
HDD random read46 days
Same-datacenter round trip11.6 days
Cross-continent round trip9.5 YEARS

This is why:
Databases cache hot data in RAM (avoid 46-day disk reads)
CDNs exist (avoid 9.5-year cross-continent trips)
Redis sits in front of everything (3-minute RAM vs 46-day disk)

⚡ Fast (nanoseconds)

  • L1 cache: 0.5 ns
  • L2 cache: 7 ns
  • RAM: 100 ns
  • These are in-process, no I/O involved

🐢 Slow (micro/milliseconds)

  • SSD random read: 16 μs
  • HDD random read: 2 ms
  • Network (same DC): 0.5 ms
  • Network (cross-continent): 50-150 ms

💡 The Compounding Effect

A single 2ms disk read seems harmless. But a page load that triggers 50 database queries, each doing 2 disk reads = 200ms just in disk I/O. Add network latency, serialization, and application logic, and you're at 500ms+. This is why small delays compound catastrophically at scale.

03

CPU Cache (L1 / L2)

CPU cache is a tiny, ultra-fast memory built directly into the processor. It exists because RAM, despite being fast, is still too slow for the CPU. A modern CPU can execute billions of operations per second, but waiting 100ns for RAM on every operation would waste most of that speed.

👖

Your Pocket vs Your Desk

L1 cache is your pocket — the things you use constantly (phone, keys, wallet) are right there. Instant access. L2 cache is your desk — slightly further, but still within arm's reach. RAM is the bookshelf across the room — you have to get up and walk. The CPU keeps the most frequently accessed data in L1/L2 so it doesn't have to 'walk to the bookshelf' on every operation.

L1 vs L2 — The Basics

FeatureL1 CacheL2 Cache
Latency~0.5 ns (1-2 CPU cycles)~7 ns (10-20 CPU cycles)
Size32-64 KB per core256 KB - 1 MB per core
LocationInside the CPU coreAdjacent to the CPU core
PurposeMost frequently accessed dataOverflow from L1, still very hot data
AnalogyYour pocketYour desk

Why This Matters for Code

Cache-Friendly vs Cache-Unfriendly Codetext
Cache-friendly (sequential access):
  for (int i = 0; i < N; i++)
    sum += array[i];          // Elements are adjacent in memory
                               // CPU prefetches next elements into cache
                               // Almost every access is a cache HIT

Cache-unfriendly (random access):
  for (int i = 0; i < N; i++)
    sum += array[random()];   // Elements are scattered in memory
                               // CPU can't predict what's needed next
                               // Almost every access is a cache MISS → RAM

Performance difference: 5-10x for large arrays.
This is why arrays are faster than linked lists for iteration
array elements are contiguous in memory (cache-friendly).

Cache-Friendly Patterns

  • Sequential array access (iterate in order)
  • Small, contiguous data structures
  • Struct of arrays (SoA) over array of structs (AoS)
  • Keeping hot data small (fits in cache)
  • Batch processing (process all data, then move on)

Cache-Unfriendly Patterns

  • Random access across large datasets
  • Pointer-heavy structures (linked lists, trees)
  • Large objects with rarely-used fields
  • Frequent context switches between data
  • Hash tables with poor locality

🎯 Interview Insight

You won't be asked to optimize L1 cache in a system design interview. But understanding that in-memory operations are orders of magnitude faster than disk or network explains why every system uses caching. When you say "I'd add a Redis cache here," you're moving data from the "bookshelf" (disk) to the "desk" (RAM).

04

RAM vs SSD vs HDD

These three storage types form the backbone of every server. Understanding their latency differences explains why databases are designed the way they are — buffer pools, write-ahead logs, page caches, and the entire caching industry.

🏠

Desk → Cupboard → Warehouse

RAM is your desk — everything you're actively working on is right there. Fast to grab, but limited space (and expensive). SSD is the cupboard in the next room — you have to walk there, but it's organized and reasonably quick. Much more space than your desk. HDD is a warehouse across town — massive storage, very cheap, but getting something requires driving there and waiting for a forklift to find your item on a spinning shelf. Every database tries to keep 'hot' data on the desk (RAM) and only goes to the cupboard (SSD) or warehouse (HDD) when necessary.

FeatureRAMSSDHDD
Random Read Latency~100 ns~16 μs (160x slower)~2 ms (20,000x slower)
Sequential Read (1 MB)~3 μs~0.2 ms~2 ms
Cost per GB (approx)$3-8$0.10-0.30$0.02-0.05
Capacity (typical server)64-512 GB1-8 TB4-16 TB
DurabilityVolatile (lost on power off)PersistentPersistent
Moving PartsNoneNone (flash chips)Yes (spinning platters + arm)
AnalogyYour deskNearby cupboardWarehouse across town

Why HDD Is So Slow

An HDD has a spinning magnetic platter and a mechanical arm that moves to the right position to read data. This physical movement (seek time) takes 2-10ms. An SSD has no moving parts — it reads from flash memory chips electronically. RAM has no I/O at all — it's directly wired to the CPU bus.

Why Databases Cache Everything in RAMtext
PostgreSQL query: SELECT * FROM users WHERE id = 42

Without caching (cold start):
  1. Parse query                    → ~0.1 ms
  2. Plan execution                 → ~0.1 ms
  3. Read index from disk (SSD)     → ~0.016 ms
  4. Read data page from disk (SSD) → ~0.016 ms
  5. Return result                  → ~0.1 ms
  Total: ~0.3 ms

With buffer pool (hot data in RAM):
  1. Parse query                    → ~0.1 ms
  2. Plan execution                 → ~0.1 ms
  3. Read index from RAM            → ~0.0001 ms
  4. Read data page from RAM        → ~0.0001 ms
  5. Return result                  → ~0.1 ms
  Total: ~0.3 msbut steps 3-4 are 160x faster

At 10,000 queries/second, this difference is:
  Disk: 10,000 × 0.032ms = 320ms of disk I/O per second
  RAM:  10,000 × 0.0002ms = 2ms of RAM reads per second

This is why PostgreSQL's shared_buffers (RAM cache) is critical.
This is why Redis existskeep hot data in RAM, skip disk entirely.
🗄️

Database Buffer Pool

PostgreSQL, MySQL, and every major database keeps frequently accessed pages in RAM. The buffer pool is the #1 performance lever — more RAM = fewer disk reads = faster queries.

💻

OS Page Cache

The operating system caches recently read disk pages in unused RAM. Even without database-level caching, the OS tries to keep hot data in memory.

Application Cache (Redis)

For data that's read far more than written, Redis keeps it in RAM with sub-millisecond access. Eliminates database queries entirely for cached data.

🎯 Interview Insight

When an interviewer asks "why is this query slow?" — the first question is: "Is the data in RAM or on disk?" If the working set fits in RAM (buffer pool), queries are fast. If it doesn't, every query triggers disk I/O — 160x slower on SSD, 20,000x slower on HDD. The fix is usually: add more RAM, add a cache layer, or reduce the working set size.

05

Network Round Trips

A network round trip is the time for a request to travel from the client to the server and for the response to travel back. It's the single largest source of latency in most distributed systems — and the hardest to eliminate because it's bounded by the speed of light.

✉️

Sending a Letter and Waiting for a Reply

A network round trip is like mailing a letter and waiting for a reply. Same-datacenter is like sending a letter across the office — it arrives in minutes. Same-region is like mailing across the city — a few hours. Cross-continent is like international mail — days. You can't make the mail truck faster (speed of light), but you can: send fewer letters (batching), keep a copy of common replies (caching), or move closer to the recipient (CDN/edge).

RouteLatencyAnalogyExample
Same machine (localhost)~0.01 msTalking to yourselfApp → local Redis
Same datacenter~0.5 msAcross the officeApp server → database
Same region (e.g., us-east)~1-5 msAcross the cityService A → Service B
Cross-continent~50-150 msInternational mailUS user → EU server
Satellite / remote~500-600 msSpace mailGround → satellite → ground

Why Network Calls Are the Bottleneck

The Microservices Latency Problemtext
User loads a product page. The API gateway calls:

  1. User Service      → 2ms (same DC)
  2. Product Service   → 3ms (same DC)
  3. Inventory Service → 2ms (same DC)
  4. Pricing Service   → 2ms (same DC)
  5. Review Service    → 4ms (same DC)

Sequential calls: 2 + 3 + 2 + 2 + 4 = 13ms in network alone
Plus processing time in each service: ~5ms each = 25ms
Total: ~38ms

Now add a database query in each service (2ms each): +10ms
Total: ~48ms for ONE page load

With 3 levels of microservice depth (service calls service calls service):
  Network hops multiply. 10 hops × 2ms = 20ms just in network.

This is why:
Parallel calls (fan-out) reduce sequential latency
Caching eliminates repeated calls
Batching combines multiple calls into one
Data locality (keep data close) reduces hops

Reducing Network Latency

Caching

Cache responses from other services. If the product data hasn't changed in 5 minutes, serve it from Redis instead of making a network call to the Product Service.

📦

Batching

Instead of 50 individual requests to the User Service (one per user ID), send one batch request with all 50 IDs. One round trip instead of fifty.

🔀

Parallel Calls

If calls to User Service and Product Service are independent, make them simultaneously. Total latency = max(2ms, 3ms) = 3ms instead of 2 + 3 = 5ms.

📍

Data Locality

Keep data close to where it's needed. CDN for static assets. Read replicas in each region. Edge computing for latency-sensitive logic.

🎯 Interview Insight

In every system design interview, count the network round trips. If your design requires 10 sequential service calls to serve one request, that's a red flag. Interviewers want to see you reduce round trips through caching, batching, parallel calls, and denormalization.

06

End-to-End Example

Let's trace what happens when a user loads a product page, and where latency is introduced at every layer.

Loading a Product Page — Latency Breakdowntext
User clicks: https://shop.example.com/products/42

1. DNS Resolution                                    ~1 ms
   Browser cacheOS cacherecursive resolver
   (cached after first visit: 0 ms)

2. TCP + TLS Handshake                               ~30 ms
   3 round trips to establish secure connection
   (reused on subsequent requests: 0 ms)

3. HTTP Request travels to server                    ~20 ms
   User in NYCserver in us-east-1 (Virginia)

4. Load BalancerApp Server                        ~0.5 ms
   Same datacenter hop

5. App Server checks Redis cache                     ~0.2 ms
   GET product:42cache HIT? Return cached JSON
   (If HIT: skip steps 6-8, total so far: ~52 ms)

6. Cache MISSQuery PostgreSQL                     ~2 ms
   SELECT * FROM products WHERE id = 42
   Data is in buffer pool (RAM) → fast

7. Query Reviews Service                             ~3 ms
   GET /api/reviews?product_id=42
   Same-datacenter network call + DB query

8. Serialize response to JSON                        ~0.1 ms
   CPU-bound, data is in L1/L2 cache

9. Response travels back to user                     ~20 ms
   Serveruser's browser

10. Browser renders the page                         ~50 ms
    Parse HTML, fetch CSS/JS (cached), render DOM

TOTAL (cache miss):  ~127 ms
TOTAL (cache hit):   ~52 ms  (steps 6-8 eliminated)
TOTAL (repeat visit): ~22 ms (DNS + TLS cached, Redis hit)

Where Optimization Happens

1

CDN eliminates network latency

Static assets (CSS, JS, images) served from a CDN edge in NYC instead of Virginia. Saves ~40ms round trip. Dynamic content can also be cached at the edge for public pages.

2

Redis eliminates disk I/O

Product data cached in Redis (RAM). Reads take 0.2ms instead of 2ms from PostgreSQL. At 10K requests/second, this saves 18 seconds of cumulative latency per second.

3

Connection pooling eliminates handshakes

Keep persistent connections to the database and between services. Eliminates TCP/TLS handshake overhead on every request. A connection pool of 20 connections serves thousands of queries.

4

Parallel calls reduce sequential latency

Fetch product data and reviews simultaneously instead of sequentially. Total: max(2ms, 3ms) = 3ms instead of 2 + 3 = 5ms.

💡 The 80/20 Rule of Latency

Network round trips and disk I/O account for 80%+ of latency in most systems. CPU time (parsing, serialization, business logic) is usually negligible. When optimizing, always start with: "Can I eliminate a network call?" and "Can I serve this from RAM instead of disk?"

07

Trade-offs & Design Decisions

Every latency optimization involves a trade-off. Faster access means higher cost, less capacity, or weaker consistency.

Trade-offFaster OptionSlower OptionWhen to Choose Faster
Memory vs CostRAM ($5/GB)SSD ($0.20/GB)Hot data accessed thousands of times/second
Speed vs CapacityRedis (64 GB)PostgreSQL (1 TB)Working set fits in RAM, read-heavy workload
Caching vs ConsistencyServe stale data (0.2ms)Query DB for fresh data (2ms)Data can be 5-60 seconds stale (product pages, feeds)
Denormalization vs SimplicityOne read, duplicated dataJOIN across tablesRead-heavy, latency-sensitive paths
Precomputation vs FreshnessPre-built aggregates (0.1ms)Compute on request (50ms)Dashboards, analytics, leaderboards

Why Systems Prefer These Patterns

Caching

Move data from disk (ms) to RAM (ns). Trade: stale data for 100x speed. Used everywhere — browser cache, CDN, Redis, database buffer pool, CPU cache.

📋

Denormalization

Store data redundantly to avoid JOINs. Trade: storage space and update complexity for single-read performance. Used in NoSQL, read-heavy SQL tables.

🔮

Precomputation

Calculate results ahead of time instead of on every request. Trade: freshness for instant reads. Used for dashboards, search indexes, materialized views.

🎯 Interview Framework

When making any design decision, state the latency trade-off: "I'd cache this in Redis because the data is read 1000x more than it's written. The trade-off is serving stale data for up to 60 seconds, which is acceptable for product listings but not for inventory counts."

08

Interview Questions

These questions test whether you've internalized the latency hierarchy and can apply it to design decisions.

Q:Why is network latency more expensive than disk latency?

A: A same-datacenter network round trip (~0.5ms) is actually comparable to an SSD random read (~0.016ms) — disk can be slower for random reads. But the real cost of network calls is: (1) They're sequential by default — each call blocks until the response arrives. (2) They compound — a microservice calling 5 other services adds 5 round trips. (3) They're unreliable — timeouts, retries, and failures add tail latency. (4) Cross-region calls are 50-150ms — orders of magnitude slower than any local operation. The key insight: a single network call isn't expensive, but systems make thousands of them per request.

Q:Why do we cache data instead of just reading from the database?

A: A database query involves: network round trip to the DB server (~0.5ms), query parsing and planning (~0.1ms), disk I/O if data isn't in buffer pool (~2ms for SSD), and response serialization (~0.1ms). Total: 0.7-2.7ms. A Redis cache read involves: network round trip to Redis (~0.2ms) and memory lookup (~0.001ms). Total: ~0.2ms. That's 3-13x faster. At 10,000 requests/second, caching saves 5-25 seconds of cumulative latency per second. Plus, it reduces database load, allowing the DB to handle writes and complex queries.

Q:What is the cost of a database query?

A: It depends on whether the data is in RAM or on disk. Best case (buffer pool hit): ~0.5ms — network to DB + RAM read + response. Typical case (index scan, some disk): ~2-5ms — network + index lookup + 1-2 disk reads + response. Worst case (full table scan, cold cache): ~100ms+ — scanning millions of rows from disk. The lesson: design your queries to hit indexes (avoid full scans), size your buffer pool to fit the working set (avoid disk), and cache hot results in Redis (avoid the query entirely).

1

Your API has a p99 latency of 2 seconds

How would you diagnose and fix this?

Answer: Start by tracing where time is spent: (1) Network — are there sequential service calls that could be parallelized? (2) Database — are queries hitting disk instead of buffer pool? Check slow query logs. (3) Missing cache — is the same data being fetched repeatedly? Add Redis. (4) N+1 queries — is the ORM making 100 queries instead of 1? Use eager loading or batching. (5) External calls — is a third-party API slow? Add a timeout and cache. The p99 (99th percentile) is usually caused by occasional disk I/O, garbage collection pauses, or network timeouts — not the average case.

2

You need to serve 100,000 reads per second with < 5ms latency

How would you architect this?

Answer: At 100K reads/sec with < 5ms, you can't hit disk on every request. Architecture: (1) Redis cluster as the primary read path — sub-millisecond reads from RAM. (2) PostgreSQL as the source of truth — writes go here. (3) Cache-aside pattern — read from Redis, on miss read from DB and populate cache. (4) TTL of 30-60 seconds — acceptable staleness for most read-heavy workloads. (5) Multiple Redis replicas — distribute read load. The math: 100K × 0.2ms (Redis) = 20 seconds of cumulative latency/sec. 100K × 2ms (DB) = 200 seconds — impossible without caching.

09

Common Mistakes

These mistakes come from not internalizing the latency hierarchy. Each one has caused real production incidents.

🔢

Assuming all operations take similar time

A developer treats a Redis read (0.2ms), a database query (2ms), and a cross-service network call (5ms) as roughly equivalent. They make 20 sequential service calls in a request handler and wonder why latency is 100ms+. The 25x difference between Redis and a network call, multiplied by 20 calls, is the entire problem.

Know the orders of magnitude. RAM is 100ns, SSD is 16μs (160x slower), network is 500μs (5,000x slower). Count your network hops and disk reads. If a request makes more than 3-5 sequential network calls, redesign with caching, batching, or parallel calls.

🌐

Overusing network calls in hot paths

A microservice architecture where every request fans out to 8 services sequentially. Each call is 'only' 3ms, but 8 × 3ms = 24ms just in network latency — before any processing. Add database queries in each service and you're at 50ms+.

Identify the hot path (the most common request flow) and minimize network hops. Cache aggressively. Make independent calls in parallel. Consider combining frequently-co-called services. Use async processing for non-critical work.

💾

Not caching frequently accessed data

A product page queries the database on every request — even though the product data changes once a day. At 10K requests/second, that's 10K unnecessary database queries per second, each taking 2ms. The database is overloaded, latency spikes, and the team adds more database replicas instead of a cache.

If data is read 100x more than it's written, cache it. Redis with a 60-second TTL eliminates 99% of database reads. The rule of thumb: if you're querying the same data more than once per second, it should be cached.

📊

Ignoring the compounding effect

'Each query is only 2ms, that's fine.' But a page load triggers 50 queries (ORM eager loading, N+1 problems, multiple tables). 50 × 2ms = 100ms just in database time. Add serialization, network, and rendering — the page takes 300ms. Users perceive it as slow.

Profile the full request path, not individual operations. Use distributed tracing (Jaeger, Datadog) to see where time is spent. Optimize the total, not the parts. Often, reducing 50 queries to 5 (via JOINs, batching, or caching) has more impact than making each query 10% faster.