LatencyL1 CacheRAMSSDHDDNetworkPerformanceStorage Hierarchy

Latency Reference

Understand the latency hierarchy — from L1 cache to network round trips. Build the intuition for why caching, batching, and data locality drive every system design decision.

25 min read9 sections

The Big Picture — What Is Latency?

Latency is the time it takes for an operation to complete — the delay between asking for something and getting it. In backend systems, latency determines how fast your API responds, how quickly your database returns results, and ultimately how snappy your product feels to users.

The critical insight is that not all operations are created equal. Reading from CPU cache is 100,000x faster than reading from disk. A network round trip to another continent is 1,000,000x slower than reading from RAM. These aren't small differences — they're orders of magnitude that fundamentally shape how systems are designed.

🏃

The Distance Analogy

Imagine you need to fetch a piece of information. L1 cache is reaching into your pocket — instant, you already have it. L2 cache is grabbing something from your desk — a quick reach. RAM is walking to the bookshelf across the room — a few seconds. SSD is driving to a nearby warehouse — minutes. HDD is driving to a warehouse across town, but the warehouse uses a mechanical crane to find your item — much slower. Network round trip is flying to another city, finding the item, and flying back. Same-region network is a short domestic flight. Cross-continent is an international flight. Every system design decision is about keeping data as close to 'your pocket' as possible.

🔥 Why This Matters

Every system design interview involves latency trade-offs. "Why use a cache?" Because RAM is 100x faster than SSD. "Why use a CDN?" Because a nearby server is 10x faster than a distant one. "Why denormalize?" To avoid an extra disk read. If you internalize these numbers, every design decision becomes intuitive.

The Latency Hierarchy

Here are the numbers every engineer should know. These are approximate — actual values vary by hardware — but the orders of magnitude are what matter.

Latency Numbers Every Engineer Should Knowtext

Operation                              Time          Relative

L1 cache reference                     0.5 ns        1x (baseline)
L2 cache reference                     7 ns          14x
RAM reference                          100 ns        200x
SSD random read                        16 μs         32,000x
HDD random read                        2 ms          4,000,000x
Same-datacenter round trip             0.5 ms        1,000,000x
Send 1 MB over 1 Gbps network         10 ms         20,000,000x
HDD sequential read (1 MB)            2 ms          4,000,000x
SSD sequential read (1 MB)            0.2 ms        400,000x
Same-region network round trip         1-5 ms        ~5,000,000x
Cross-continent round trip             50-150 ms     ~200,000,000x

Key insight: each level is 10-1000x slower than the previous.
This is why caching exists at every layer of the stack.

Visualizing the Scale

If an L1 cache reference took 1 second (instead of 0.5 nanoseconds), here's how long everything else would take at the same scale:

If L1 Cache = 1 Second...text

L1 cache reference         →  1 second
L2 cache reference         →  14 seconds
RAM reference              →  3.3 minutes
SSD random read            →  8.9 hours
HDD random read            →  46 days
Same-datacenter round trip →  11.6 days
Cross-continent round trip →  9.5 YEARS

This is why:
  → Databases cache hot data in RAM (avoid 46-day disk reads)
  → CDNs exist (avoid 9.5-year cross-continent trips)
  → Redis sits in front of everything (3-minute RAM vs 46-day disk)

⚡ Fast (nanoseconds)

L1 cache: 0.5 ns
L2 cache: 7 ns
RAM: 100 ns
These are in-process, no I/O involved

🐢 Slow (micro/milliseconds)

SSD random read: 16 μs
HDD random read: 2 ms
Network (same DC): 0.5 ms
Network (cross-continent): 50-150 ms

Latency hierarchy scale — from L1 cache nanoseconds through RAM, SSD, HDD, network to cross-continent round trips

💡 The Compounding Effect

A single 2ms disk read seems harmless. But a page load that triggers 50 database queries, each doing 2 disk reads = 200ms just in disk I/O. Add network latency, serialization, and application logic, and you're at 500ms+. This is why small delays compound catastrophically at scale.

CPU Cache (L1 / L2)

CPU cache is a tiny, ultra-fast memory built directly into the processor. It exists because RAM, despite being fast, is still too slow for the CPU. A modern CPU can execute billions of operations per second, but waiting 100ns for RAM on every operation would waste most of that speed.

👖

Your Pocket vs Your Desk

L1 cache is your pocket — the things you use constantly (phone, keys, wallet) are right there. Instant access. L2 cache is your desk — slightly further, but still within arm's reach. RAM is the bookshelf across the room — you have to get up and walk. The CPU keeps the most frequently accessed data in L1/L2 so it doesn't have to 'walk to the bookshelf' on every operation.

L1 vs L2 — The Basics

Feature	L1 Cache	L2 Cache
Latency	~0.5 ns (1-2 CPU cycles)	~7 ns (10-20 CPU cycles)
Size	32-64 KB per core	256 KB - 1 MB per core
Location	Inside the CPU core	Adjacent to the CPU core
Purpose	Most frequently accessed data	Overflow from L1, still very hot data
Analogy	Your pocket	Your desk

Why This Matters for Code

Cache-Friendly vs Cache-Unfriendly Codetext

Cache-friendly (sequential access):
  for (int i = 0; i < N; i++)
    sum += array[i];          // Elements are adjacent in memory
                               // CPU prefetches next elements into cache
                               // Almost every access is a cache HIT

Cache-unfriendly (random access):
  for (int i = 0; i < N; i++)
    sum += array[random()];   // Elements are scattered in memory
                               // CPU can't predict what's needed next
                               // Almost every access is a cache MISS → RAM

Performance difference: 5-10x for large arrays.
This is why arrays are faster than linked lists for iteration —
array elements are contiguous in memory (cache-friendly).

Cache-Friendly Patterns

✅Sequential array access (iterate in order)
✅Small, contiguous data structures
✅Struct of arrays (SoA) over array of structs (AoS)
✅Keeping hot data small (fits in cache)
✅Batch processing (process all data, then move on)

Cache-Unfriendly Patterns

❌Random access across large datasets
❌Pointer-heavy structures (linked lists, trees)
❌Large objects with rarely-used fields
❌Frequent context switches between data
❌Hash tables with poor locality

🎯 Interview Insight

You won't be asked to optimize L1 cache in a system design interview. But understanding that in-memory operations are orders of magnitude faster than disk or network explains why every system uses caching. When you say "I'd add a Redis cache here," you're moving data from the "bookshelf" (disk) to the "desk" (RAM).

RAM vs SSD vs HDD

These three storage types form the backbone of every server. Understanding their latency differences explains why databases are designed the way they are — buffer pools, write-ahead logs, page caches, and the entire caching industry.

🏠

Desk → Cupboard → Warehouse

RAM is your desk — everything you're actively working on is right there. Fast to grab, but limited space (and expensive). SSD is the cupboard in the next room — you have to walk there, but it's organized and reasonably quick. Much more space than your desk. HDD is a warehouse across town — massive storage, very cheap, but getting something requires driving there and waiting for a forklift to find your item on a spinning shelf. Every database tries to keep 'hot' data on the desk (RAM) and only goes to the cupboard (SSD) or warehouse (HDD) when necessary.

Feature	RAM	SSD	HDD
Random Read Latency	~100 ns	~16 μs (160x slower)	~2 ms (20,000x slower)
Sequential Read (1 MB)	~3 μs	~0.2 ms	~2 ms
Cost per GB (approx)	$3-8	$0.10-0.30	$0.02-0.05
Capacity (typical server)	64-512 GB	1-8 TB	4-16 TB
Durability	Volatile (lost on power off)	Persistent	Persistent
Moving Parts	None	None (flash chips)	Yes (spinning platters + arm)
Analogy	Your desk	Nearby cupboard	Warehouse across town

Why HDD Is So Slow

An HDD has a spinning magnetic platter and a mechanical arm that moves to the right position to read data. This physical movement (seek time) takes 2-10ms. An SSD has no moving parts — it reads from flash memory chips electronically. RAM has no I/O at all — it's directly wired to the CPU bus.

Why Databases Cache Everything in RAMtext

PostgreSQL query: SELECT * FROM users WHERE id = 42

Without caching (cold start):
  1. Parse query                    → ~0.1 ms
  2. Plan execution                 → ~0.1 ms
  3. Read index from disk (SSD)     → ~0.016 ms
  4. Read data page from disk (SSD) → ~0.016 ms
  5. Return result                  → ~0.1 ms
  Total: ~0.3 ms

With buffer pool (hot data in RAM):
  1. Parse query                    → ~0.1 ms
  2. Plan execution                 → ~0.1 ms
  3. Read index from RAM            → ~0.0001 ms
  4. Read data page from RAM        → ~0.0001 ms
  5. Return result                  → ~0.1 ms
  Total: ~0.3 ms → but steps 3-4 are 160x faster

At 10,000 queries/second, this difference is:
  Disk: 10,000 × 0.032ms = 320ms of disk I/O per second
  RAM:  10,000 × 0.0002ms = 2ms of RAM reads per second

This is why PostgreSQL's shared_buffers (RAM cache) is critical.
This is why Redis exists — keep hot data in RAM, skip disk entirely.

🗄️

Database Buffer Pool

PostgreSQL, MySQL, and every major database keeps frequently accessed pages in RAM. The buffer pool is the #1 performance lever — more RAM = fewer disk reads = faster queries.

💻

OS Page Cache

The operating system caches recently read disk pages in unused RAM. Even without database-level caching, the OS tries to keep hot data in memory.

⚡

Application Cache (Redis)

For data that's read far more than written, Redis keeps it in RAM with sub-millisecond access. Eliminates database queries entirely for cached data.

RAM vs SSD vs HDD comparison — orders of magnitude difference in read and write latency across storage types

🎯 Interview Insight

When an interviewer asks "why is this query slow?" — the first question is: "Is the data in RAM or on disk?" If the working set fits in RAM (buffer pool), queries are fast. If it doesn't, every query triggers disk I/O — 160x slower on SSD, 20,000x slower on HDD. The fix is usually: add more RAM, add a cache layer, or reduce the working set size.

Network Round Trips

A network round trip is the time for a request to travel from the client to the server and for the response to travel back. It's the single largest source of latency in most distributed systems — and the hardest to eliminate because it's bounded by the speed of light.

✉️

Sending a Letter and Waiting for a Reply

A network round trip is like mailing a letter and waiting for a reply. Same-datacenter is like sending a letter across the office — it arrives in minutes. Same-region is like mailing across the city — a few hours. Cross-continent is like international mail — days. You can't make the mail truck faster (speed of light), but you can: send fewer letters (batching), keep a copy of common replies (caching), or move closer to the recipient (CDN/edge).

Route	Latency	Analogy	Example
Same machine (localhost)	~0.01 ms	Talking to yourself	App → local Redis
Same datacenter	~0.5 ms	Across the office	App server → database
Same region (e.g., us-east)	~1-5 ms	Across the city	Service A → Service B
Cross-continent	~50-150 ms	International mail	US user → EU server
Satellite / remote	~500-600 ms	Space mail	Ground → satellite → ground

Why Network Calls Are the Bottleneck

The Microservices Latency Problemtext

User loads a product page. The API gateway calls:

  1. User Service      → 2ms (same DC)
  2. Product Service   → 3ms (same DC)
  3. Inventory Service → 2ms (same DC)
  4. Pricing Service   → 2ms (same DC)
  5. Review Service    → 4ms (same DC)

Sequential calls: 2 + 3 + 2 + 2 + 4 = 13ms in network alone
Plus processing time in each service: ~5ms each = 25ms
Total: ~38ms

Now add a database query in each service (2ms each): +10ms
Total: ~48ms for ONE page load

With 3 levels of microservice depth (service calls service calls service):
  Network hops multiply. 10 hops × 2ms = 20ms just in network.

This is why:
  → Parallel calls (fan-out) reduce sequential latency
  → Caching eliminates repeated calls
  → Batching combines multiple calls into one
  → Data locality (keep data close) reduces hops

Reducing Network Latency

⚡

Caching

Cache responses from other services. If the product data hasn't changed in 5 minutes, serve it from Redis instead of making a network call to the Product Service.

📦

Batching

Instead of 50 individual requests to the User Service (one per user ID), send one batch request with all 50 IDs. One round trip instead of fifty.

🔀

Parallel Calls

If calls to User Service and Product Service are independent, make them simultaneously. Total latency = max(2ms, 3ms) = 3ms instead of 2 + 3 = 5ms.

📍

Data Locality

Keep data close to where it's needed. CDN for static assets. Read replicas in each region. Edge computing for latency-sensitive logic.

Network round-trip latency — same datacenter, cross-region, and cross-continent RTT comparison

🎯 Interview Insight

In every system design interview, count the network round trips. If your design requires 10 sequential service calls to serve one request, that's a red flag. Interviewers want to see you reduce round trips through caching, batching, parallel calls, and denormalization.

End-to-End Example

Let's trace what happens when a user loads a product page, and where latency is introduced at every layer.

Loading a Product Page — Latency Breakdowntext

User clicks: https://shop.example.com/products/42

1. DNS Resolution                                    ~1 ms
   Browser cache → OS cache → recursive resolver
   (cached after first visit: 0 ms)

2. TCP + TLS Handshake                               ~30 ms
   3 round trips to establish secure connection
   (reused on subsequent requests: 0 ms)

3. HTTP Request travels to server                    ~20 ms
   User in NYC → server in us-east-1 (Virginia)

4. Load Balancer → App Server                        ~0.5 ms
   Same datacenter hop

5. App Server checks Redis cache                     ~0.2 ms
   GET product:42 → cache HIT? Return cached JSON
   (If HIT: skip steps 6-8, total so far: ~52 ms)

6. Cache MISS → Query PostgreSQL                     ~2 ms
   SELECT * FROM products WHERE id = 42
   Data is in buffer pool (RAM) → fast

7. Query Reviews Service                             ~3 ms
   GET /api/reviews?product_id=42
   Same-datacenter network call + DB query

8. Serialize response to JSON                        ~0.1 ms
   CPU-bound, data is in L1/L2 cache

9. Response travels back to user                     ~20 ms
   Server → user's browser

10. Browser renders the page                         ~50 ms
    Parse HTML, fetch CSS/JS (cached), render DOM

TOTAL (cache miss):  ~127 ms
TOTAL (cache hit):   ~52 ms  (steps 6-8 eliminated)
TOTAL (repeat visit): ~22 ms (DNS + TLS cached, Redis hit)

Where Optimization Happens

CDN eliminates network latency

Static assets (CSS, JS, images) served from a CDN edge in NYC instead of Virginia. Saves ~40ms round trip. Dynamic content can also be cached at the edge for public pages.

Redis eliminates disk I/O

Product data cached in Redis (RAM). Reads take 0.2ms instead of 2ms from PostgreSQL. At 10K requests/second, this saves 18 seconds of cumulative latency per second.

Connection pooling eliminates handshakes

Keep persistent connections to the database and between services. Eliminates TCP/TLS handshake overhead on every request. A connection pool of 20 connections serves thousands of queries.

Parallel calls reduce sequential latency

Fetch product data and reviews simultaneously instead of sequentially. Total: max(2ms, 3ms) = 3ms instead of 2 + 3 = 5ms.

End-to-end request latency — breakdown of time spent in DNS, TCP handshake, TLS, server processing, and response transfer

💡 The 80/20 Rule of Latency

Network round trips and disk I/O account for 80%+ of latency in most systems. CPU time (parsing, serialization, business logic) is usually negligible. When optimizing, always start with: "Can I eliminate a network call?" and "Can I serve this from RAM instead of disk?"

Trade-offs & Design Decisions

Every latency optimization involves a trade-off. Faster access means higher cost, less capacity, or weaker consistency.

Trade-off	Faster Option	Slower Option	When to Choose Faster
Memory vs Cost	RAM ($5/GB)	SSD ($0.20/GB)	Hot data accessed thousands of times/second
Speed vs Capacity	Redis (64 GB)	PostgreSQL (1 TB)	Working set fits in RAM, read-heavy workload
Caching vs Consistency	Serve stale data (0.2ms)	Query DB for fresh data (2ms)	Data can be 5-60 seconds stale (product pages, feeds)
Denormalization vs Simplicity	One read, duplicated data	JOIN across tables	Read-heavy, latency-sensitive paths
Precomputation vs Freshness	Pre-built aggregates (0.1ms)	Compute on request (50ms)	Dashboards, analytics, leaderboards

Why Systems Prefer These Patterns

⚡

Caching

Move data from disk (ms) to RAM (ns). Trade: stale data for 100x speed. Used everywhere — browser cache, CDN, Redis, database buffer pool, CPU cache.

📋

Denormalization

Store data redundantly to avoid JOINs. Trade: storage space and update complexity for single-read performance. Used in NoSQL, read-heavy SQL tables.

🔮

Precomputation

Calculate results ahead of time instead of on every request. Trade: freshness for instant reads. Used for dashboards, search indexes, materialized views.

🎯 Interview Framework

When making any design decision, state the latency trade-off: "I'd cache this in Redis because the data is read 1000x more than it's written. The trade-off is serving stale data for up to 60 seconds, which is acceptable for product listings but not for inventory counts."

Interview Questions

These questions test whether you've internalized the latency hierarchy and can apply it to design decisions.

Q:Why is network latency more expensive than disk latency?

A: A same-datacenter network round trip (~0.5ms) is actually comparable to an SSD random read (~0.016ms) — disk can be slower for random reads. But the real cost of network calls is: (1) They're sequential by default — each call blocks until the response arrives. (2) They compound — a microservice calling 5 other services adds 5 round trips. (3) They're unreliable — timeouts, retries, and failures add tail latency. (4) Cross-region calls are 50-150ms — orders of magnitude slower than any local operation. The key insight: a single network call isn't expensive, but systems make thousands of them per request.

Q:Why do we cache data instead of just reading from the database?

A: A database query involves: network round trip to the DB server (~0.5ms), query parsing and planning (~0.1ms), disk I/O if data isn't in buffer pool (~2ms for SSD), and response serialization (~0.1ms). Total: 0.7-2.7ms. A Redis cache read involves: network round trip to Redis (~0.2ms) and memory lookup (~0.001ms). Total: ~0.2ms. That's 3-13x faster. At 10,000 requests/second, caching saves 5-25 seconds of cumulative latency per second. Plus, it reduces database load, allowing the DB to handle writes and complex queries.

Q:What is the cost of a database query?

A: It depends on whether the data is in RAM or on disk. Best case (buffer pool hit): ~0.5ms — network to DB + RAM read + response. Typical case (index scan, some disk): ~2-5ms — network + index lookup + 1-2 disk reads + response. Worst case (full table scan, cold cache): ~100ms+ — scanning millions of rows from disk. The lesson: design your queries to hit indexes (avoid full scans), size your buffer pool to fit the working set (avoid disk), and cache hot results in Redis (avoid the query entirely).

Your API has a p99 latency of 2 seconds

How would you diagnose and fix this?

Answer: Start by tracing where time is spent: (1) Network — are there sequential service calls that could be parallelized? (2) Database — are queries hitting disk instead of buffer pool? Check slow query logs. (3) Missing cache — is the same data being fetched repeatedly? Add Redis. (4) N+1 queries — is the ORM making 100 queries instead of 1? Use eager loading or batching. (5) External calls — is a third-party API slow? Add a timeout and cache. The p99 (99th percentile) is usually caused by occasional disk I/O, garbage collection pauses, or network timeouts — not the average case.

You need to serve 100,000 reads per second with < 5ms latency

How would you architect this?

Answer: At 100K reads/sec with < 5ms, you can't hit disk on every request. Architecture: (1) Redis cluster as the primary read path — sub-millisecond reads from RAM. (2) PostgreSQL as the source of truth — writes go here. (3) Cache-aside pattern — read from Redis, on miss read from DB and populate cache. (4) TTL of 30-60 seconds — acceptable staleness for most read-heavy workloads. (5) Multiple Redis replicas — distribute read load. The math: 100K × 0.2ms (Redis) = 20 seconds of cumulative latency/sec. 100K × 2ms (DB) = 200 seconds — impossible without caching.

Common Mistakes

These mistakes come from not internalizing the latency hierarchy. Each one has caused real production incidents.

🔢

Assuming all operations take similar time

A developer treats a Redis read (0.2ms), a database query (2ms), and a cross-service network call (5ms) as roughly equivalent. They make 20 sequential service calls in a request handler and wonder why latency is 100ms+. The 25x difference between Redis and a network call, multiplied by 20 calls, is the entire problem.

✅Know the orders of magnitude. RAM is 100ns, SSD is 16μs (160x slower), network is 500μs (5,000x slower). Count your network hops and disk reads. If a request makes more than 3-5 sequential network calls, redesign with caching, batching, or parallel calls.

🌐

Overusing network calls in hot paths

A microservice architecture where every request fans out to 8 services sequentially. Each call is 'only' 3ms, but 8 × 3ms = 24ms just in network latency — before any processing. Add database queries in each service and you're at 50ms+.

✅Identify the hot path (the most common request flow) and minimize network hops. Cache aggressively. Make independent calls in parallel. Consider combining frequently-co-called services. Use async processing for non-critical work.

💾

Not caching frequently accessed data

A product page queries the database on every request — even though the product data changes once a day. At 10K requests/second, that's 10K unnecessary database queries per second, each taking 2ms. The database is overloaded, latency spikes, and the team adds more database replicas instead of a cache.

✅If data is read 100x more than it's written, cache it. Redis with a 60-second TTL eliminates 99% of database reads. The rule of thumb: if you're querying the same data more than once per second, it should be cached.

📊

Ignoring the compounding effect

'Each query is only 2ms, that's fine.' But a page load triggers 50 queries (ORM eager loading, N+1 problems, multiple tables). 50 × 2ms = 100ms just in database time. Add serialization, network, and rendering — the page takes 300ms. Users perceive it as slow.

✅Profile the full request path, not individual operations. Use distributed tracing (Jaeger, Datadog) to see where time is spent. Optimize the total, not the parts. Often, reducing 50 queries to 5 (via JOINs, batching, or caching) has more impact than making each query 10% faster.

Latency Reference

Table of Contents

The Big Picture — What Is Latency?

The Distance Analogy

The Latency Hierarchy

Visualizing the Scale

⚡ Fast (nanoseconds)

🐢 Slow (micro/milliseconds)

CPU Cache (L1 / L2)

Your Pocket vs Your Desk

L1 vs L2 — The Basics

Why This Matters for Code

Cache-Friendly Patterns

Cache-Unfriendly Patterns

RAM vs SSD vs HDD

Desk → Cupboard → Warehouse

Why HDD Is So Slow

Database Buffer Pool

OS Page Cache

Application Cache (Redis)

Network Round Trips

Sending a Letter and Waiting for a Reply

Why Network Calls Are the Bottleneck

Reducing Network Latency

Caching

Batching

Parallel Calls

Data Locality

End-to-End Example

Where Optimization Happens

CDN eliminates network latency

Redis eliminates disk I/O

Connection pooling eliminates handshakes

Parallel calls reduce sequential latency

Trade-offs & Design Decisions

Why Systems Prefer These Patterns

Caching

Denormalization

Precomputation

Interview Questions

Q:Why is network latency more expensive than disk latency?

Q:Why do we cache data instead of just reading from the database?

Q:What is the cost of a database query?

Your API has a p99 latency of 2 seconds

You need to serve 100,000 reads per second with < 5ms latency

Common Mistakes

Assuming all operations take similar time

Overusing network calls in hot paths

Not caching frequently accessed data

Ignoring the compounding effect