Query Optimization
Optimize read performance with denormalization, materialized views, CQRS pattern, and search indexes. Trade write complexity for blazing-fast reads at scale.
Table of Contents
The Big Picture — Why Reads Get Slow
Normalized databases are designed for correctness — no redundancy, no anomalies, clean relationships. But correctness has a cost: to display a product page, you JOIN products, categories, reviews, inventory, and images — 5 tables, every single page load. At 50,000 page views per second, those JOINs become the bottleneck.
The Pre-Cooked Meal Analogy
A restaurant can prepare every dish from scratch — chop vegetables, marinate meat, cook rice — for each order. That's a normalized database: correct, no waste, but slow. Or the kitchen can pre-cook popular meals during off-hours and keep them warm. When a customer orders, the waiter grabs a ready plate in 10 seconds instead of waiting 30 minutes. That's denormalization and precomputation: you do extra work during writes so reads are instant.
🔥 Key Insight
Query optimization for reads is about shifting work from read time to write time. Instead of computing results on every read (expensive JOINs, aggregations), you precompute and store the results. Reads become simple lookups. The trade-off: writes are more complex, and data might be briefly stale.
The Two Paths: Write vs Read
WRITE PATH (optimize for correctness): User creates order → Normalize → Store in tables orders(id, user_id, total) order_items(order_id, product_id, qty) → Clean, no redundancy, easy to update READ PATH (optimize for speed): User views dashboard → Precomputed view → Instant response user_dashboard(user_id, order_count, total_spent, last_order_date) → Redundant, but no JOINs needed, 1ms lookup The pattern: Write → Normalized tables (source of truth) → Trigger/event → Update denormalized views, indexes, caches Read → Denormalized view / materialized view / search index → Fast response (no JOINs, no aggregations)
Reduce JOINs
Embed related data directly in the row. Instead of joining users + orders + products, store user_name and product_name in the order row. One table scan, no JOINs.
Precompute Results
Calculate aggregates during writes, not reads. When an order is placed, increment user.order_count. The dashboard reads one column instead of COUNT(*) across millions of rows.
Use Specialized Indexes
Full-text search doesn't belong in PostgreSQL. Use Elasticsearch with inverted indexes. Product filtering doesn't need 5 JOINs — use a denormalized search index.
Separate Read/Write Models
CQRS: writes go to a normalized database. An event stream updates a read-optimized store. Reads never touch the write database. Each is optimized for its purpose.
Denormalization for Reads
Denormalization stores redundant data to eliminate JOINs at read time. Instead of looking up the user's name from the users table every time you display an order, you store the name directly in the orders table.
-- NORMALIZED (correct, slow reads): -- To display "Alice ordered Laptop": SELECT u.name, p.name, o.total FROM orders o JOIN users u ON u.id = o.user_id JOIN order_items oi ON oi.order_id = o.id JOIN products p ON p.id = oi.product_id WHERE o.id = 101; -- 3 JOINs, touches 4 tables, ~20ms -- DENORMALIZED (redundant, fast reads): -- Store user_name and product_name directly in orders SELECT user_name, product_name, total FROM orders_denormalized WHERE id = 101; -- 0 JOINs, 1 table, ~1ms -- The trade-off: when Alice changes her name, -- you must update it in the users table AND in every order row. -- Write complexity increases, but reads are 20x faster.
Real-World Examples
📱 Social Media Feed
Each feed item stores: author_name, author_avatar, post_text, like_count, comment_count — all embedded. No JOINs to render a feed. When a user changes their avatar, a background job updates all their feed items.
🛒 Product Listing
Each product row stores: category_name, avg_rating, review_count, seller_name — all embedded. The listing page reads one table. When a new review is posted, update avg_rating and review_count on the product row.
Strengths
- ✅Dramatically faster reads (no JOINs)
- ✅Simpler read queries (single table)
- ✅Predictable read latency (no query plan surprises)
- ✅Works well with caching (simple key-value shape)
- ✅Standard practice in read-heavy systems
Costs
- ❌Data duplication (same data in multiple places)
- ❌Complex writes (must update all copies on change)
- ❌Risk of inconsistency (if an update is missed)
- ❌More storage (redundant columns)
- ❌Harder to change schema (data spread across tables)
🎯 Interview Insight
Denormalization is the most common read optimization in large-scale systems. In interviews, say: "I'd normalize the write path for correctness, then denormalize the read path for performance. When a user updates their name, a background job propagates the change to all denormalized copies."
Materialized Views
A materialized view is a precomputed query result stored as a table. Instead of running an expensive aggregation on every read, you run it once and store the result. Reads fetch the stored result instantly.
Base Tables: orders (200M rows) order_items (1B rows) products (5M rows) Expensive query (runs on every dashboard load): SELECT category, SUM(total) as revenue, COUNT(*) as order_count FROM orders o JOIN order_items oi ON oi.order_id = o.id JOIN products p ON p.id = oi.product_id GROUP BY category; → Scans 1.2B rows, takes 30 seconds Materialized view: CREATE MATERIALIZED VIEW category_revenue AS SELECT category, SUM(total) as revenue, COUNT(*) as order_count FROM orders o JOIN order_items oi ON oi.order_id = o.id JOIN products p ON p.id = oi.product_id GROUP BY category; Dashboard query: SELECT * FROM category_revenue; → Reads precomputed table, takes 2ms Refresh: REFRESH MATERIALIZED VIEW category_revenue; → Re-runs the expensive query, updates the stored result → Schedule: every 5 minutes, or trigger on write
Refresh Strategies
| Strategy | Freshness | Cost | Best For |
|---|---|---|---|
| On every write | Always fresh | High (refresh on each INSERT/UPDATE) | Low-write, high-read data |
| Scheduled (every N minutes) | Stale up to N minutes | Low (periodic batch) | Dashboards, reports, analytics |
| Incremental refresh | Near real-time | Medium (only process new data) | Large datasets with append-only writes |
| Manual / on-demand | Stale until triggered | Lowest | Ad-hoc reports, monthly summaries |
Strengths
- ✅Turns 30-second queries into 2ms lookups
- ✅No application code changes (it's a database feature)
- ✅Perfect for predictable, expensive queries
- ✅Supported by PostgreSQL, Oracle, SQL Server
Limitations
- ❌Data is stale between refreshes
- ❌Refresh itself is expensive (re-runs the full query)
- ❌Storage overhead (stores the full result set)
- ❌Not suitable for user-specific or highly dynamic data
🎯 Interview Insight
Materialized views are ideal when queries are expensive but predictable — dashboards, leaderboards, analytics summaries. Say: "The admin dashboard runs a 30-second aggregation. I'd create a materialized view refreshed every 5 minutes. Dashboard reads are instant, and 5-minute staleness is acceptable for analytics."
CQRS Pattern
CQRS (Command Query Responsibility Segregation) takes denormalization to its logical extreme: completely separate systems for writes and reads. The write database is normalized for correctness. The read database is denormalized for speed. An event stream keeps them in sync.
WRITE PATH READ PATH ───────── ───────── User places order User views order history │ │ ▼ ▼ ┌──────────────┐ ┌──────────────────┐ │ Write DB │ │ Read DB │ │ (normalized) │ │ (denormalized) │ │ │ │ │ │ orders │ ──── Event ────→ │ order_history │ │ order_items │ Stream │ (user_name, │ │ products │ (Kafka/SQS) │ product_name, │ │ users │ │ total, status) │ └──────────────┘ └──────────────────┘ Write: INSERT into normalized tables → publish event Event consumer: reads event → updates denormalized read model Read: SELECT from denormalized read model → instant response Write DB: PostgreSQL (ACID, normalized, source of truth) Read DB: MongoDB / Elasticsearch / Redis (optimized for queries)
Strengths
- ✅Read and write models independently optimized
- ✅Read DB can use a completely different technology
- ✅Scales reads and writes independently
- ✅Write DB stays clean and normalized
- ✅Read DB can have multiple projections for different queries
Costs
- ❌Significant complexity (two databases, event stream, consumers)
- ❌Eventual consistency (read model lags behind writes)
- ❌More infrastructure to operate and monitor
- ❌Debugging is harder (data flows through multiple systems)
- ❌Overkill for simple CRUD applications
🎯 Interview Insight
CQRS is a powerful pattern but don't reach for it by default. Use it when: (1) read and write patterns are fundamentally different, (2) you need different data models for reads vs writes, (3) you need to scale reads and writes independently. For a simple CRUD app, CQRS is over-engineering. For a social media feed with 100:1 read-to-write ratio, it's the right call.
Search Indexes (Elasticsearch)
Traditional databases use B-tree indexes — great for exact matches and ranges, terrible for full-text search. Searching "comfortable running shoes" across 5 million products with SQL LIKE is painfully slow. Elasticsearch uses inverted indexes — the same structure that powers Google Search.
Documents: doc1: "comfortable running shoes for men" doc2: "lightweight running sneakers" doc3: "comfortable leather shoes" Inverted Index (built during indexing): comfortable → [doc1, doc3] running → [doc1, doc2] shoes → [doc1, doc3] men → [doc1] lightweight → [doc2] sneakers → [doc2] leather → [doc3] Search: "comfortable running shoes" comfortable → [doc1, doc3] running → [doc1, doc2] shoes → [doc1, doc3] Intersection: doc1 (matches all 3 terms) → highest relevance SQL equivalent: WHERE name LIKE '%comfortable%' AND name LIKE '%running%' → Full table scan on 5M rows, no index used, 10+ seconds Elasticsearch: → Inverted index lookup, 3 set intersections, 5ms
Product Search
'Red Nike shoes size 10' — filters by color, brand, category, size with full-text matching on the name. Elasticsearch handles this in milliseconds across millions of products.
Log Search
'Find all ERROR logs from payment-service in the last hour.' Elasticsearch indexes log lines and enables instant search across billions of log entries.
Content Discovery
'Articles about machine learning for beginners.' Full-text search with relevance scoring, synonyms, and fuzzy matching. SQL can't do this efficiently.
Architecture: PostgreSQL (source of truth) → Change Data Capture → Elasticsearch Write path: 1. Product created/updated in PostgreSQL 2. CDC (Debezium) captures the change 3. Change event sent to Kafka 4. Consumer indexes the product in Elasticsearch Read path (search): 1. User searches "comfortable running shoes" 2. App queries Elasticsearch (not PostgreSQL) 3. Elasticsearch returns matching product IDs + relevance scores 4. App fetches full product details from PostgreSQL/cache Key insight: PostgreSQL = source of truth (writes, transactions, consistency) Elasticsearch = search index (reads, full-text, relevance) They serve different purposes. Don't replace one with the other.
Strengths
- ✅Full-text search with relevance scoring
- ✅Inverted indexes: millisecond search across millions of docs
- ✅Advanced features: fuzzy matching, synonyms, autocomplete
- ✅Faceted search (filter by category, price range, brand)
- ✅Horizontal scaling (distributed across nodes)
Costs
- ❌Extra system to maintain (cluster, monitoring, upgrades)
- ❌Data sync: must keep ES in sync with primary DB
- ❌Not a database: no ACID, no transactions, no JOINs
- ❌Memory-hungry (inverted indexes live in RAM)
- ❌Eventual consistency with the source of truth
🎯 Interview Insight
When a system needs search — product search, log search, content discovery — always mention Elasticsearch. SQL LIKE is not a search solution at scale. Say: "I'd use PostgreSQL as the source of truth and sync to Elasticsearch via CDC for search queries. This gives us ACID for writes and millisecond full-text search for reads."
End-to-End Scenario
Let's design the read optimization layer for an e-commerce platform — using all four techniques together.
🛒 E-Commerce — 100K Product Views/sec
5M products, 200M orders, 50M users. Read:Write = 100:1.
Pages: product listing, product detail, search, admin dashboard.
Product Listing → Denormalization
The listing page shows: name, price, image, avg_rating, review_count, category_name. Instead of JOINing 4 tables, store all fields in a denormalized products_listing table. One table scan, no JOINs. When a review is posted, a background job updates avg_rating and review_count.
Admin Dashboard → Materialized Views
Dashboard shows: revenue by category, top sellers, order trends. These aggregations scan 200M+ rows. Create materialized views refreshed every 5 minutes. Dashboard reads precomputed results in 2ms instead of running 30-second queries.
Product Search → Elasticsearch
'Red Nike shoes size 10 under $100' — full-text search with filters. PostgreSQL can't do this efficiently. Sync products to Elasticsearch via CDC. Search queries hit ES (5ms), product detail pages hit PostgreSQL/cache.
Order History → CQRS
Write path: orders are created in normalized PostgreSQL (ACID for payments). Read path: an event consumer builds a denormalized order_history in MongoDB, optimized for 'show me my orders with product names and images.' Reads never touch the transactional database.
WRITE PATH: Order placed → PostgreSQL (normalized, ACID) → Kafka event → Update denormalized views → Kafka event → Update Elasticsearch index → Kafka event → Update MongoDB read model READ PATH: Product listing → Denormalized table (PostgreSQL) → Redis cache Product search → Elasticsearch (inverted index) Admin dashboard → Materialized views (PostgreSQL) Order history → MongoDB read model (CQRS) Product detail → Redis cache → PostgreSQL fallback Result: Write DB (PostgreSQL): handles 1K writes/sec (orders, updates) Read layer: handles 100K reads/sec across specialized stores Each read path is optimized for its specific query pattern
Trade-offs & Decision Making
| Technique | Read Speed | Write Complexity | Consistency | Best For |
|---|---|---|---|---|
| Denormalization | Very fast (no JOINs) | Medium (update copies) | Eventual | Listings, feeds, embedded data |
| Materialized Views | Instant (precomputed) | Low (DB handles refresh) | Stale between refreshes | Dashboards, reports, aggregations |
| CQRS | Optimized per query | High (event stream, consumers) | Eventual | Complex systems, different read/write models |
| Elasticsearch | Millisecond search | Medium (sync via CDC) | Eventual | Full-text search, filtering, discovery |
When to Use What
| Scenario | Technique | Why |
|---|---|---|
| Product listing with ratings | Denormalization | Embed avg_rating in product row, avoid JOIN on every listing |
| Revenue dashboard | Materialized View | Expensive aggregation, predictable query, 5-min staleness OK |
| Order history (100:1 read:write) | CQRS | Write model (ACID) and read model (fast) need different shapes |
| Product search | Elasticsearch | Full-text + filters, SQL can't do this at scale |
| Simple CRUD app | None of the above | Normalized DB + indexes is sufficient. Don't over-engineer. |
🎯 Decision Framework
Start with a normalized database + proper indexes. When specific queries become slow, apply the right technique: JOINs too slow → denormalize. Aggregations too slow → materialized views. Search too slow → Elasticsearch. Read/write models diverging → CQRS. Don't apply all four on day one — add complexity only when the simpler approach fails.
Interview Questions
Q:Why denormalize data?
A: To eliminate JOINs at read time. In a normalized schema, displaying a product listing requires joining products, categories, reviews, and inventory — 4 tables on every page load. Denormalization embeds category_name, avg_rating, and stock_count directly in the product row. Reads become single-table lookups. The trade-off: when a category name changes, you must update it in every product row. This is acceptable when reads outnumber writes 100:1.
Q:What is a materialized view?
A: A precomputed query result stored as a table. Instead of running an expensive aggregation (SUM, COUNT, GROUP BY across millions of rows) on every dashboard load, you run it once and store the result. Dashboard reads fetch the stored result in milliseconds. The view is refreshed periodically (every 5 minutes) or on demand. Trade-off: data is stale between refreshes. Perfect for analytics dashboards, leaderboards, and reporting.
Q:When would you use CQRS?
A: When read and write patterns are fundamentally different. Example: an e-commerce system where writes are normalized (orders, items, payments — ACID transactions) but reads need a completely different shape (order history with product names, images, tracking info — denormalized). CQRS separates them: writes go to PostgreSQL, events flow to a read-optimized MongoDB. Each database is optimized for its purpose. Don't use CQRS for simple CRUD — the complexity isn't justified.
Q:Why use Elasticsearch over SQL for search?
A: SQL databases use B-tree indexes — great for exact matches (WHERE id = 42) and ranges (WHERE price > 100), terrible for full-text search. 'Comfortable running shoes' with SQL LIKE '%comfortable%' requires a full table scan — no index helps. Elasticsearch uses inverted indexes: each word maps to the documents containing it. Searching 3 words = 3 hash lookups + set intersection = 5ms across millions of documents. Plus: relevance scoring, fuzzy matching, synonyms, autocomplete — none of which SQL supports natively.
Pitfalls
Over-denormalizing
Embedding everything everywhere. User name stored in orders, reviews, comments, messages, notifications — 5 copies. When the user changes their name, you must update all 5 tables. Miss one and the data is inconsistent. The more copies, the harder it is to keep them in sync.
✅Denormalize only the fields that are read frequently and change rarely. User name in orders? Yes (orders are read often, names change rarely). Real-time stock count in product listing? No (changes every second — use a short TTL cache instead).
Ignoring data consistency in views
Creating a materialized view refreshed every hour for a dashboard that shows 'current inventory.' Users see inventory from an hour ago and make decisions on stale data. Or: CQRS read model is 30 seconds behind, and a user doesn't see their own order.
✅Match refresh frequency to the data's staleness tolerance. Inventory: refresh every 10 seconds or use a live query. Analytics: every 5 minutes is fine. For CQRS: implement read-your-own-writes — after a write, route that user's reads to the write DB briefly.
Overengineering with CQRS
Using CQRS for a simple blog or CRUD admin panel. Two databases, an event stream, event consumers, eventual consistency — all for a system that handles 100 reads/sec. The complexity far outweighs the benefit.
✅CQRS is for systems where read and write patterns are fundamentally different AND scale is significant. If a normalized PostgreSQL with proper indexes handles your reads fine, you don't need CQRS. Start simple, add complexity when the simple approach fails.
Not keeping Elasticsearch in sync
Indexing products in Elasticsearch on day one, then forgetting to update the index when products change. Search results show deleted products, wrong prices, and outdated descriptions. Users lose trust in search.
✅Use Change Data Capture (CDC) with Debezium or a similar tool. Every change in PostgreSQL automatically flows to Elasticsearch via Kafka. No manual sync, no missed updates. Monitor the sync lag — if ES falls behind, search results are stale.