Schema DesignNormalizationACIDIndexingSQLQuery OptimizationDatabase

Relational Data Modeling

Master relational database design — schema normalization, ACID transactions, indexing strategies, and query optimization for production systems.

30 min read10 sections

The Big Picture — What Are Relational Databases?

A relational database stores data in tables with rows and columns, and lets you define relationships between those tables. It's the most battle-tested way to store structured data — powering everything from banking systems to e-commerce platforms for over 40 years.

📊

The Spreadsheet Analogy

Think of a relational database as a collection of interconnected spreadsheets. Each table is a sheet — 'Users', 'Orders', 'Products'. Each row is a record — one specific user, one specific order. Each column is an attribute — name, email, price. The magic: sheets can reference each other. The 'Orders' sheet has a 'user_id' column that points to a row in the 'Users' sheet. That's a relationship. Now imagine these spreadsheets enforce rules: 'email must be unique', 'every order must belong to a real user', 'price can't be negative'. That's what makes it a relational database — not just storage, but structured, validated, interconnected storage.

Why are relational databases so dominant? Because most real-world data is inherently relational. A user has orders. An order has items. Items belong to products. Products have categories. These relationships are natural, and relational databases model them directly.

🔥 Key Insight

Relational databases aren't just about storing data — they're about enforcing correctness. Constraints, foreign keys, and transactions guarantee that your data is always in a valid state. This is why banks, hospitals, and financial systems rely on them.

Data Modeling Overview

Data modeling is the process of deciding how to structure your data — what tables to create, what columns they have, and how they relate to each other. A good data model makes your system fast, maintainable, and correct. A bad one creates bugs, performance nightmares, and technical debt that compounds over years.

Core Building Blocks

📋

Entities & Tables

Each real-world concept becomes a table: Users, Products, Orders. Each row is one instance — one user, one product. Columns define the attributes — name, email, price.

🔑

Primary Keys

Every table needs a unique identifier for each row. Usually an auto-incrementing integer (id) or a UUID. This is how you reference a specific row.

🔗

Foreign Keys

A column that references a primary key in another table. orders.user_id → users.id. This creates the relationship and enforces referential integrity — you can't create an order for a user that doesn't exist.

🛡️

Constraints

Rules the database enforces: NOT NULL (field required), UNIQUE (no duplicates), CHECK (price > 0), FOREIGN KEY (must reference a real row). Constraints prevent bad data at the database level.

Relationship Types

Type	Example	Implementation
One-to-One	User → Profile	Foreign key with UNIQUE constraint on either table
One-to-Many	User → Orders	Foreign key on the 'many' side (orders.user_id → users.id)
Many-to-Many	Students ↔ Courses	Junction table: student_courses(student_id, course_id)

E-Commerce Schema — Core Tablessql

-- Users table
CREATE TABLE users (
  id          SERIAL PRIMARY KEY,
  name        VARCHAR(100) NOT NULL,
  email       VARCHAR(255) NOT NULL UNIQUE,
  created_at  TIMESTAMP DEFAULT NOW()
);

-- Products table
CREATE TABLE products (
  id          SERIAL PRIMARY KEY,
  name        VARCHAR(200) NOT NULL,
  price       DECIMAL(10,2) NOT NULL CHECK (price > 0),
  stock       INTEGER NOT NULL DEFAULT 0
);

-- Orders table (one-to-many: user → orders)
CREATE TABLE orders (
  id          SERIAL PRIMARY KEY,
  user_id     INTEGER NOT NULL REFERENCES users(id),
  status      VARCHAR(20) DEFAULT 'pending',
  total       DECIMAL(10,2) NOT NULL,
  created_at  TIMESTAMP DEFAULT NOW()
);

-- Order items (many-to-many: orders ↔ products)
CREATE TABLE order_items (
  id          SERIAL PRIMARY KEY,
  order_id    INTEGER NOT NULL REFERENCES orders(id),
  product_id  INTEGER NOT NULL REFERENCES products(id),
  quantity    INTEGER NOT NULL CHECK (quantity > 0),
  unit_price  DECIMAL(10,2) NOT NULL
);

Entity relationship diagram for e-commerce — users, products, orders, and order items with their relationships and foreign keys

💡 Why Schema Design Matters

A poorly designed schema is like a house built on a bad foundation. You can paint the walls and add furniture, but the cracks will keep growing. Schema changes in production are expensive and risky — get it right early.

Schema Design & Normalization

Normalization is the process of organizing your tables to eliminate redundancy and prevent anomalies. The goal: every piece of data lives in exactly one place. If you need to update a user's email, you change it in one row, not in 50 rows across 5 tables.

The Problem: Unnormalized Data

Bad Design — Everything in One Tabletext

orders_flat table:
┌──────────┬───────────┬─────────────────┬──────────────┬───────┬─────────┐
│ order_id │ user_name │ user_email      │ product_name │ price │ qty     │
├──────────┼───────────┼─────────────────┼──────────────┼───────┼─────────┤
│ 1        │ Alice     │ alice@mail.com  │ Laptop       │ 999   │ 1       │
│ 1        │ Alice     │ alice@mail.com  │ Mouse        │ 29    │ 2       │
│ 2        │ Alice     │ alice@mail.com  │ Keyboard     │ 79    │ 1       │
│ 3        │ Bob       │ bob@mail.com    │ Laptop       │ 999   │ 1       │
└──────────┴───────────┴─────────────────┴──────────────┴───────┴─────────┘

Problems:
  ❌ Alice's email is stored 3 times (redundancy)
  ❌ "Laptop" price is stored 2 times (redundancy)
  ❌ If Alice changes email → update 3 rows (update anomaly)
  ❌ If we delete order #3 → we lose Bob's info (delete anomaly)
  ❌ Can't add a product without an order (insert anomaly)

Normal Forms — Step by Step

1NF — First Normal Form

Each column holds a single atomic value (no arrays, no comma-separated lists). Each row is unique (has a primary key). Example violation: a 'phone_numbers' column containing '555-1234, 555-5678'. Fix: create a separate phone_numbers table.

2NF — Second Normal Form

Must be in 1NF + every non-key column depends on the ENTIRE primary key (not just part of it). Only matters for composite keys. Example violation: in order_items(order_id, product_id, product_name), product_name depends only on product_id, not on the full key. Fix: move product_name to the products table.

3NF — Third Normal Form

Must be in 2NF + no transitive dependencies. Every non-key column depends ONLY on the primary key, not on other non-key columns. Example violation: orders(id, user_id, user_email) — user_email depends on user_id, not on order id. Fix: keep user_email only in the users table.

Normalized Design — 3NFsql

-- Each entity in its own table, no redundancy

users:       id | name  | email
products:    id | name  | price
orders:      id | user_id (FK) | status | total | created_at
order_items: id | order_id (FK) | product_id (FK) | quantity | unit_price

✅ Alice's email stored once (in users)
✅ Laptop price stored once (in products)
✅ Update email → change 1 row
✅ Delete an order → user data is safe
✅ Add a product without needing an order

Normalization pipeline — transforming unnormalized flat table through 1NF, 2NF, and 3NF into properly separated entities

When to Denormalize

Normalization optimizes for data integrity. But sometimes you need to optimize for read performance. Denormalization intentionally adds redundancy to avoid expensive JOINs.

Normalize when

✅Data integrity is critical (financial systems, healthcare)
✅Write-heavy workloads (fewer places to update)
✅Data changes frequently (avoid stale copies)
✅Storage is a concern (no redundant data)
✅The system is still evolving (flexible schema)

Denormalize when

✅Read performance is critical (dashboards, analytics)
✅JOINs across many tables are too slow
✅Data is read 100x more than it's written
✅You need precomputed aggregates (total_orders on user)
✅Caching isn't sufficient for your read patterns

Denormalization Examplesql

-- Normalized: to show "Alice — 47 orders" you need a JOIN + COUNT
SELECT u.name, COUNT(o.id) as order_count
FROM users u JOIN orders o ON o.user_id = u.id
WHERE u.id = 42
GROUP BY u.name;

-- Denormalized: add order_count directly to users table
ALTER TABLE users ADD COLUMN order_count INTEGER DEFAULT 0;

-- Now the query is instant:
SELECT name, order_count FROM users WHERE id = 42;

-- Trade-off: you must update order_count every time an order
-- is created or deleted. If you forget → stale data.

🎯 Interview Insight

In interviews, always start with a normalized design and then explain where you'd denormalize for performance. Say: "I'd normalize to 3NF first, then denormalize the user's order count because the dashboard reads it 1000x/sec but orders are created 10x/sec." This shows you understand both sides of the trade-off.

ACID Transactions

ACID is the set of guarantees that relational databases provide for transactions. A transaction is a group of operations that must succeed or fail as a unit — there's no in-between.

🏦

The Bank Transfer Analogy

You transfer $500 from Account A to Account B. Two operations: deduct $500 from A, add $500 to B. What if the system crashes after deducting from A but before adding to B? Without ACID, $500 vanishes. With ACID: either BOTH operations complete, or NEITHER does. The money is never lost, never duplicated, never in limbo.

⚛️

Atomicity

All or nothing. A transaction either completes entirely or rolls back entirely. If step 3 of 5 fails, steps 1 and 2 are undone. No partial state.

✅

Consistency

A transaction moves the database from one valid state to another. All constraints, foreign keys, and rules are satisfied after the transaction. Invalid data is rejected.

🔒

Isolation

Concurrent transactions don't interfere with each other. Transaction A doesn't see Transaction B's uncommitted changes. It's as if transactions run one at a time.

💾

Durability

Once a transaction is committed, it's permanent — even if the server crashes immediately after. Data is written to disk (WAL — Write-Ahead Log) before the commit is acknowledged.

ACID transaction flow — atomic operation with commit and rollback paths showing all-or-nothing execution

ACID in Action — E-Commerce Ordersql

-- Place an order: multiple operations that MUST succeed together

BEGIN TRANSACTION;

  -- 1. Check stock
  SELECT stock FROM products WHERE id = 42 FOR UPDATE;
  -- stock = 5, we need 2 → OK

  -- 2. Reduce stock
  UPDATE products SET stock = stock - 2 WHERE id = 42;

  -- 3. Create order
  INSERT INTO orders (user_id, total, status)
  VALUES (7, 199.98, 'confirmed');

  -- 4. Create order items
  INSERT INTO order_items (order_id, product_id, quantity, unit_price)
  VALUES (currval('orders_id_seq'), 42, 2, 99.99);

  -- 5. Charge payment (if this fails, EVERYTHING rolls back)
  -- ... payment logic ...

COMMIT;

-- If ANY step fails → ROLLBACK
-- Stock is restored, order is not created, payment is not charged
-- The database is back to its original state

Isolation Levels

Isolation is the most nuanced ACID property. Stricter isolation means more correctness but less concurrency (slower). The database lets you choose the trade-off.

Level	Dirty Reads	Non-Repeatable Reads	Phantom Reads	Performance
Read Uncommitted	✅ Possible	✅ Possible	✅ Possible	Fastest
Read Committed	❌ Prevented	✅ Possible	✅ Possible	Fast (PostgreSQL default)
Repeatable Read	❌ Prevented	❌ Prevented	✅ Possible	Moderate (MySQL default)
Serializable	❌ Prevented	❌ Prevented	❌ Prevented	Slowest (full isolation)

Isolation levels comparison — read uncommitted through serializable showing which anomalies each level prevents

👻 Dirty Read

Transaction A reads data that Transaction B has modified but not yet committed. If B rolls back, A read data that never existed.

🔄 Non-Repeatable Read

Transaction A reads a row, Transaction B updates it and commits, Transaction A reads the same row again and gets a different value.

👤 Phantom Read

Transaction A queries rows matching a condition, Transaction B inserts a new row matching that condition, Transaction A re-queries and sees a new "phantom" row.

Each Isolation Level — Explained

Read Uncommitted — No Isolation

A transaction can read data that another transaction has modified but not yet committed. If that other transaction rolls back, you've read data that never actually existed (a 'dirty read'). This is the fastest level because there's zero locking or coordination, but it's almost never used in practice — the risk of reading phantom data is too high. Use case: rough analytics where approximate numbers are acceptable.

Read Committed — See Only Committed Data

A transaction only sees data that has been committed by other transactions. No dirty reads. However, if you read the same row twice within your transaction, you might get different values — because another transaction committed a change between your two reads (a 'non-repeatable read'). This is PostgreSQL's default and the most common level in production. It's a good balance of safety and performance.

Repeatable Read — Snapshot of the World

When your transaction starts, it takes a snapshot of the database. Every read within the transaction sees data as it was at that snapshot — even if other transactions commit changes in the meantime. No dirty reads, no non-repeatable reads. However, new rows inserted by other transactions can still appear ('phantom reads'). This is MySQL/InnoDB's default. Use case: reports that need a consistent view of data across multiple queries.

Serializable — Full Isolation

Transactions execute as if they ran one after another, in serial order. No dirty reads, no non-repeatable reads, no phantom reads. The database achieves this through heavy locking or optimistic concurrency control (abort and retry on conflict). This is the slowest level — concurrent transactions block each other or get aborted frequently. Use case: critical financial operations where any anomaly is unacceptable, like transferring money between accounts.

Isolation Levels — Practical Exampletext

Table: products (id=42, stock=10, price=99.99)

Two concurrent transactions:

READ UNCOMMITTED:
  Tx A: UPDATE products SET price = 79.99 WHERE id = 42;  (not committed yet)
  Tx B: SELECT price FROM products WHERE id = 42;
  Tx B sees: 79.99  ← dirty read! Tx A might roll back.
  Tx A: ROLLBACK;
  Tx B used a price that never existed.

READ COMMITTED:
  Tx A: UPDATE products SET price = 79.99 WHERE id = 42;  (not committed yet)
  Tx B: SELECT price FROM products WHERE id = 42;
  Tx B sees: 99.99  ← only sees committed data ✅
  Tx A: COMMIT;
  Tx B: SELECT price FROM products WHERE id = 42;
  Tx B sees: 79.99  ← different value on second read (non-repeatable)

REPEATABLE READ:
  Tx B: SELECT price FROM products WHERE id = 42;
  Tx B sees: 99.99  ← snapshot taken
  Tx A: UPDATE products SET price = 79.99 WHERE id = 42; COMMIT;
  Tx B: SELECT price FROM products WHERE id = 42;
  Tx B sees: 99.99  ← still sees snapshot value ✅

SERIALIZABLE:
  Tx A and Tx B both try to update stock:
  Tx A: UPDATE products SET stock = stock - 1 WHERE id = 42;
  Tx B: UPDATE products SET stock = stock - 1 WHERE id = 42;
  → One succeeds, the other is ABORTED and must retry.
  → Guarantees no lost updates, no anomalies.

🎯 Interview Insight

Most production systems use Read Committed (PostgreSQL default) or Repeatable Read (MySQL default). Serializable is rarely used because the performance cost is too high. Know when to use stricter isolation: financial transactions, inventory management, seat booking — anywhere double-spending or double-booking is catastrophic.

Indexing Strategies

An index is a data structure that makes queries faster by avoiding full table scans. Without an index, the database reads every single row to find what you're looking for. With an index, it jumps directly to the matching rows.

📖

The Book Index Analogy

Imagine a 500-page textbook with no index. To find 'B-tree', you'd flip through every page. With an index at the back, you look up 'B-tree → page 247' and jump directly there. A database index works the same way — it's a sorted lookup structure that maps column values to row locations. The trade-off: the index takes up space (extra pages at the back of the book) and must be updated every time content changes (new edition = rebuild the index).

How Indexes Work — B-Tree Intuition

Most database indexes use a B-tree (balanced tree). Think of it as a sorted, hierarchical structure that narrows down the search at each level.

B-Tree Index — How a Lookup Workstext

Table: users (1,000,000 rows)
Query: SELECT * FROM users WHERE email = 'alice@mail.com'

Without index (full table scan):
  → Read all 1,000,000 rows
  → Check each row's email
  → Time: O(n) = ~1,000,000 comparisons

With B-tree index on email:
  Level 1 (root):     "Is 'alice' < 'm'?"  → Go left
  Level 2:            "Is 'alice' < 'f'?"  → Go left
  Level 3:            "Is 'alice' < 'c'?"  → Go left
  Level 4 (leaf):     Found! Row pointer → disk location
  → Time: O(log n) = ~20 comparisons

That's 1,000,000 vs 20. This is why indexes matter.

B-tree index lookup — hierarchical tree structure narrowing search from root to leaf node for O(log n) performance

Types of Indexes

Primary Index

Automatically created on the primary key. The table data is physically sorted by this index (in most databases). Every table has one. You don't need to create it manually.

Secondary Index

Created on any non-primary column you frequently query. Example: CREATE INDEX idx_email ON users(email). The index is a separate structure that points back to the table rows. You can have many secondary indexes per table.

Composite Index

An index on multiple columns: CREATE INDEX idx_user_status ON orders(user_id, status). The order matters — this index helps queries that filter by user_id, or by user_id AND status, but NOT by status alone. Think of it like a phone book sorted by last name, then first name.

Composite Index — Column Order Matterssql

CREATE INDEX idx_user_status ON orders(user_id, status);

-- ✅ Uses the index (leftmost prefix)
SELECT * FROM orders WHERE user_id = 42;

-- ✅ Uses the index (both columns)
SELECT * FROM orders WHERE user_id = 42 AND status = 'shipped';

-- ❌ Cannot use this index (skips user_id)
SELECT * FROM orders WHERE status = 'shipped';
-- For this query, you'd need a separate index on (status)

-- Think of it like a phone book:
-- Sorted by (last_name, first_name)
-- You can look up "Smith" → ✅
-- You can look up "Smith, John" → ✅
-- You can't look up just "John" efficiently → ❌

Index Trade-offs

Benefits

✅Dramatically faster reads (O(log n) vs O(n))
✅Enables efficient sorting (ORDER BY uses the index)
✅Speeds up JOINs (index on foreign keys)
✅Supports unique constraints (UNIQUE index)
✅Enables range queries (BETWEEN, >, <)

Costs

❌Slower writes — every INSERT/UPDATE/DELETE must update the index
❌Storage overhead — indexes take disk space (can be 10-30% of table size)
❌Too many indexes → write performance degrades significantly
❌Unused indexes waste space and slow writes for no benefit
❌Wrong index order (composite) → index is useless for your queries

🎯 How to Choose the Right Index

Index the columns you filter on (WHERE), join on (JOIN ... ON), and sort by (ORDER BY). Check your slow query log — the queries that run most often and take the longest are your indexing priorities. Don't index everything — every index slows down writes.

Query Optimization

A query that works in development with 100 rows can grind to a halt in production with 10 million rows. Query optimization is about understanding why queries are slow and how to fix them.

Why Queries Become Slow

🐌

Full Table Scans

No index on the filtered column → database reads every row. Fine for 1,000 rows, catastrophic for 10,000,000. The #1 cause of slow queries.

🔁

N+1 Query Problem

Fetching a list of users, then making a separate query for each user's orders. 1 query for users + N queries for orders = N+1 total. Should be 1 query with a JOIN.

💥

Expensive JOINs

Joining large tables without indexes on the join columns. The database must compare every row in table A with every row in table B (nested loop). Index the foreign keys.

📦

SELECT *

Fetching all columns when you only need 3. More data to read from disk, more data to transfer over the network, more memory used. Select only what you need.

The Query Planner — EXPLAIN

Every database has a query planner that decides how to execute your query. TheEXPLAINcommand shows you the plan — whether it's using an index, doing a full scan, or choosing a specific join strategy.

Using EXPLAIN to Debug Slow Queriessql

-- Slow query: "Find all orders for user 42"
EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 42;

-- BAD output (no index):
-- Seq Scan on orders  (cost=0.00..25000.00 rows=47 width=64)
--   Filter: (user_id = 42)
--   Rows Removed by Filter: 999953
--   Execution Time: 450.23 ms
-- ↑ Sequential scan = reading ALL 1M rows, filtering 999,953

-- Fix: add an index
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- GOOD output (with index):
-- Index Scan using idx_orders_user_id on orders
--   (cost=0.42..8.44 rows=47 width=64)
--   Index Cond: (user_id = 42)
--   Execution Time: 0.12 ms
-- ↑ Index scan = jumped directly to 47 matching rows

The N+1 Problem — In Detail

N+1 Problem and Fixsql

-- ❌ N+1 Problem (application code makes N+1 queries):
-- Query 1: SELECT * FROM users LIMIT 50;
-- Query 2: SELECT * FROM orders WHERE user_id = 1;
-- Query 3: SELECT * FROM orders WHERE user_id = 2;
-- ...
-- Query 51: SELECT * FROM orders WHERE user_id = 50;
-- Total: 51 queries! Each has network round-trip overhead.

-- ✅ Fix: Single query with JOIN
SELECT u.id, u.name, o.id as order_id, o.total
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.id IN (1, 2, 3, ..., 50);
-- Total: 1 query. Database handles the join efficiently.

-- ✅ Alternative fix: Batch query
SELECT * FROM users LIMIT 50;                          -- 1 query
SELECT * FROM orders WHERE user_id IN (1,2,3,...,50);  -- 1 query
-- Total: 2 queries. Application joins the results in memory.

N+1 query problem — one query to fetch parent records followed by N individual queries for each child record vs a single JOIN query

Optimization Checklist

Add indexes on WHERE, JOIN, and ORDER BY columns

This is the single biggest performance improvement. Check EXPLAIN output — if you see 'Seq Scan' on a large table, you're missing an index.

Select only the columns you need

SELECT id, name, email instead of SELECT *. Less data read from disk, less data transferred, less memory used.

Use LIMIT for paginated queries

Don't fetch 100,000 rows when the user sees 20. Use LIMIT 20 OFFSET 0 (or better: cursor-based pagination with WHERE id > last_id LIMIT 20).

Avoid N+1 — use JOINs or batch queries

If your ORM makes a query per row, switch to eager loading (JOIN) or batch loading (WHERE id IN (...)). Check your query log.

Precompute expensive aggregations

If a dashboard shows 'total revenue per month' and it's slow, precompute it into a summary table updated by a background job. Don't re-aggregate 10M rows on every page load.

💡 Real-World Example

A dashboard loads slowly because it runs:SELECT COUNT(*) FROM orders WHERE user_id = ? GROUP BY statusfor every user on the page. Fix: add an index on (user_id, status), or precompute order counts per status into a user_order_stats table.

End-to-End Scenario

Let's design a relational database for an e-commerce system end-to-end — applying everything from this page.

Step 1: Schema Design (3NF)

E-Commerce Schema — Normalizedsql

users:          id | name | email (UNIQUE) | created_at
products:       id | name | price | stock | category_id (FK)
categories:     id | name
orders:         id | user_id (FK) | status | total | created_at
order_items:    id | order_id (FK) | product_id (FK) | quantity | unit_price
payments:       id | order_id (FK) | amount | method | status | paid_at
reviews:        id | user_id (FK) | product_id (FK) | rating | comment

Step 2: Denormalization Decisions

Strategic Denormalizationsql

-- Denormalize: add order_count to users (dashboard reads it 1000x/sec)
ALTER TABLE users ADD COLUMN order_count INTEGER DEFAULT 0;

-- Denormalize: add avg_rating to products (product listing page)
ALTER TABLE products ADD COLUMN avg_rating DECIMAL(2,1) DEFAULT 0;
ALTER TABLE products ADD COLUMN review_count INTEGER DEFAULT 0;

-- Keep updated via triggers or application logic:
-- On new order  → UPDATE users SET order_count = order_count + 1
-- On new review → UPDATE products SET avg_rating = ..., review_count = ...

Step 3: Indexing Strategy

Indexes — Based on Query Patternssql

-- Users: lookup by email (login)
CREATE UNIQUE INDEX idx_users_email ON users(email);

-- Orders: filter by user (order history page)
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Orders: filter by status (admin dashboard)
CREATE INDEX idx_orders_status ON orders(status);

-- Order items: lookup by order (order detail page)
CREATE INDEX idx_order_items_order_id ON order_items(order_id);

-- Products: filter by category + sort by price (product listing)
CREATE INDEX idx_products_category_price ON products(category_id, price);

-- Reviews: filter by product (product detail page)
CREATE INDEX idx_reviews_product_id ON reviews(product_id);

-- Payments: lookup by order (payment status check)
CREATE INDEX idx_payments_order_id ON payments(order_id);

Step 4: Transaction — Placing an Order

Order Placement Transactionsql

BEGIN TRANSACTION;

  -- 1. Lock and check stock (FOR UPDATE prevents race conditions)
  SELECT stock FROM products WHERE id = 42 FOR UPDATE;
  -- stock = 5, need 2 → proceed

  -- 2. Reduce stock
  UPDATE products SET stock = stock - 2 WHERE id = 42;

  -- 3. Create order
  INSERT INTO orders (user_id, total, status)
  VALUES (7, 199.98, 'pending')
  RETURNING id;  -- returns order_id = 501

  -- 4. Create order items
  INSERT INTO order_items (order_id, product_id, quantity, unit_price)
  VALUES (501, 42, 2, 99.99);

  -- 5. Create payment record
  INSERT INTO payments (order_id, amount, method, status)
  VALUES (501, 199.98, 'credit_card', 'pending');

  -- 6. Update denormalized count
  UPDATE users SET order_count = order_count + 1 WHERE id = 7;

COMMIT;
-- If ANY step fails → ROLLBACK (stock restored, no orphan records)

Step 5: Query Optimization

Optimized Queries for Common Pagessql

-- Product listing (uses composite index on category_id, price)
SELECT id, name, price, avg_rating
FROM products
WHERE category_id = 5
ORDER BY price ASC
LIMIT 20 OFFSET 0;

-- Order history (uses index on user_id)
SELECT o.id, o.total, o.status, o.created_at
FROM orders o
WHERE o.user_id = 7
ORDER BY o.created_at DESC
LIMIT 10;

-- Order detail (uses indexes on order_id for both tables)
SELECT oi.quantity, oi.unit_price, p.name, p.price
FROM order_items oi
JOIN products p ON p.id = oi.product_id
WHERE oi.order_id = 501;

-- Dashboard: top customers (uses denormalized order_count)
SELECT name, order_count
FROM users
ORDER BY order_count DESC
LIMIT 10;
-- Without denormalization, this would need a JOIN + GROUP BY on millions of rows

💡 This Is What Interviewers Want

Walk through this exact process in an interview: (1) identify entities and relationships, (2) normalize to 3NF, (3) identify read-heavy patterns and denormalize strategically, (4) add indexes based on query patterns, (5) wrap critical operations in transactions. This shows structured thinking.

Trade-offs & Decision Making

Normalization vs Denormalization

Dimension	Normalized	Denormalized
Data integrity	High (single source of truth)	Lower (redundant copies can drift)
Write performance	Better (update one place)	Worse (update multiple places)
Read performance	Slower (JOINs needed)	Faster (precomputed, no JOINs)
Storage	Less (no redundancy)	More (duplicated data)
Complexity	Simpler schema	Complex update logic (keep copies in sync)
Best for	Write-heavy, data integrity critical	Read-heavy, dashboards, analytics

Indexing Trade-offs

Scenario	More Indexes	Fewer Indexes
Read performance	Faster (index lookups)	Slower (more full scans)
Write performance	Slower (update all indexes)	Faster (less overhead)
Storage	More disk space	Less disk space
Read-heavy workload (90% reads)	Add indexes aggressively	—
Write-heavy workload (90% writes)	—	Minimize indexes

ACID Strictness vs Performance

Scenario	Strict ACID (Serializable)	Relaxed (Read Committed)
Correctness	Maximum (no anomalies)	Good (dirty reads prevented)
Concurrency	Low (transactions block each other)	High (minimal blocking)
Throughput	Lower	Higher
Use case	Financial transactions, inventory	General web apps, dashboards

🎯 Decision Framework

Start normalized, start with Read Committed isolation, and add indexes on your most common query patterns. Then measure. Denormalize only when JOINs become a measured bottleneck. Add stricter isolation only for critical transactions. Let data drive your optimization decisions, not assumptions.

Interview Questions

Conceptual, scenario-based, and optimization questions you're likely to encounter.

Q:When would you denormalize a database?

A: When read performance is critical and the data is read far more often than it's written. Example: a product listing page that shows avg_rating and review_count. Computing these with a JOIN + GROUP BY on every page load is expensive. Instead, store them directly on the products table and update them when a new review is added. The trade-off: you must keep the denormalized data in sync — if you forget to update it, users see stale data.

Q:How do indexes improve performance?

A: Indexes create a sorted data structure (usually a B-tree) that maps column values to row locations. Without an index, the database scans every row (O(n)). With an index, it navigates the tree to find matching rows in O(log n). For a table with 10 million rows, that's the difference between checking 10,000,000 rows and checking ~23 rows. The cost: indexes slow down writes (every INSERT/UPDATE must update the index) and use additional storage.

Q:What causes slow SQL queries?

A: The most common causes: (1) Missing indexes — full table scans on large tables. (2) N+1 queries — making a separate query per row instead of using JOINs or batch queries. (3) SELECT * — fetching all columns when you need 3. (4) No LIMIT — fetching 100K rows when the user sees 20. (5) Expensive JOINs without indexes on join columns. (6) Functions in WHERE clauses that prevent index usage (WHERE YEAR(created_at) = 2024 instead of WHERE created_at >= '2024-01-01'). Use EXPLAIN ANALYZE to diagnose.

You're designing a database for a booking system (hotels)

How do you prevent double-booking the same room?

Answer: Use a transaction with SELECT ... FOR UPDATE to lock the room row before checking availability. Within the transaction: lock the room, check if it's available for the requested dates, insert the booking if available, commit. If another transaction tries to book the same room simultaneously, it waits for the lock. Also add a UNIQUE constraint on (room_id, date) in the bookings table as a safety net. Use at least Repeatable Read isolation to prevent phantom reads.

Your product search page takes 5 seconds to load

How would you diagnose and fix this?

Answer: Run EXPLAIN ANALYZE on the search query. Likely causes: (1) No index on the search columns — add a composite index on (category_id, price) or whatever the WHERE/ORDER BY uses. (2) SELECT * — change to SELECT id, name, price, image_url (only what the listing needs). (3) No LIMIT — add pagination. (4) If full-text search: the database isn't optimized for it — consider Elasticsearch for search, keep PostgreSQL for transactional data. After adding the index, the query should drop from 5s to <50ms.

A developer adds 15 indexes to a table to 'make everything fast'

What's wrong with this approach?

Answer: Every index slows down writes. Each INSERT must update all 15 indexes. Each UPDATE on an indexed column must update the affected indexes. For a write-heavy table (e.g., event logs, analytics), this can make writes 10-50x slower. Also, the query planner might choose the wrong index, actually making some queries slower. Fix: index only the columns used in frequent WHERE, JOIN, and ORDER BY clauses. Remove unused indexes (check pg_stat_user_indexes for usage stats).

Common Mistakes

These mistakes cause real production outages and performance nightmares.

🗃️

Over-normalizing

Splitting data into so many tables that every query requires 5+ JOINs. A user profile page that JOINs users, addresses, preferences, settings, and avatars — each in a separate table — is over-normalized. The query is complex and slow.

✅Normalize to 3NF as a starting point, then evaluate. If a group of fields is always read together and belongs to the same entity, it's fine to keep them in one table. Normalization is a tool, not a religion.

🔍

Missing indexes on foreign keys

Foreign keys create relationships but DON'T automatically create indexes (in PostgreSQL). A JOIN on orders.user_id without an index on user_id means a full table scan of orders for every user lookup. This is the #1 missed optimization.

✅Always create an index on every foreign key column. It's almost never wrong to do so. The write overhead is minimal compared to the read performance gain.

🔐

Misusing transactions

Wrapping too much work in a single transaction — like a transaction that sends an email, calls an external API, and updates the database. If the email service is slow, the transaction holds locks for seconds, blocking other queries.

✅Keep transactions short and focused on database operations only. Do external calls (emails, APIs, file uploads) outside the transaction. If you need coordination, use the outbox pattern: write to a local outbox table inside the transaction, then process the outbox asynchronously.

🔗

Writing inefficient JOINs

Joining large tables without indexes, using subqueries where JOINs would be faster, or joining tables you don't actually need. A common pattern: LEFT JOIN a table just to check if a row exists — use EXISTS instead.

✅Always index JOIN columns. Prefer JOINs over correlated subqueries. Use EXISTS instead of LEFT JOIN + IS NOT NULL for existence checks. Use EXPLAIN ANALYZE to verify the query plan.

📊

Ignoring query performance until production

Queries that work fine with 1,000 test rows become catastrophically slow with 10 million production rows. The O(n) full scan that took 1ms in dev takes 10 seconds in prod.

✅Test with realistic data volumes. Use EXPLAIN ANALYZE during development. Set up slow query logging (log_min_duration_statement in PostgreSQL). Monitor query performance from day one, not after users complain.