Relational Data Modeling
Master relational database design — schema normalization, ACID transactions, indexing strategies, and query optimization for production systems.
Table of Contents
The Big Picture — What Are Relational Databases?
A relational database stores data in tables with rows and columns, and lets you define relationships between those tables. It's the most battle-tested way to store structured data — powering everything from banking systems to e-commerce platforms for over 40 years.
The Spreadsheet Analogy
Think of a relational database as a collection of interconnected spreadsheets. Each table is a sheet — 'Users', 'Orders', 'Products'. Each row is a record — one specific user, one specific order. Each column is an attribute — name, email, price. The magic: sheets can reference each other. The 'Orders' sheet has a 'user_id' column that points to a row in the 'Users' sheet. That's a relationship. Now imagine these spreadsheets enforce rules: 'email must be unique', 'every order must belong to a real user', 'price can't be negative'. That's what makes it a relational database — not just storage, but structured, validated, interconnected storage.
Why are relational databases so dominant? Because most real-world data is inherently relational. A user has orders. An order has items. Items belong to products. Products have categories. These relationships are natural, and relational databases model them directly.
🔥 Key Insight
Relational databases aren't just about storing data — they're about enforcing correctness. Constraints, foreign keys, and transactions guarantee that your data is always in a valid state. This is why banks, hospitals, and financial systems rely on them.
Data Modeling Overview
Data modeling is the process of deciding how to structure your data — what tables to create, what columns they have, and how they relate to each other. A good data model makes your system fast, maintainable, and correct. A bad one creates bugs, performance nightmares, and technical debt that compounds over years.
Core Building Blocks
Entities & Tables
Each real-world concept becomes a table: Users, Products, Orders. Each row is one instance — one user, one product. Columns define the attributes — name, email, price.
Primary Keys
Every table needs a unique identifier for each row. Usually an auto-incrementing integer (id) or a UUID. This is how you reference a specific row.
Foreign Keys
A column that references a primary key in another table. orders.user_id → users.id. This creates the relationship and enforces referential integrity — you can't create an order for a user that doesn't exist.
Constraints
Rules the database enforces: NOT NULL (field required), UNIQUE (no duplicates), CHECK (price > 0), FOREIGN KEY (must reference a real row). Constraints prevent bad data at the database level.
Relationship Types
| Type | Example | Implementation |
|---|---|---|
| One-to-One | User → Profile | Foreign key with UNIQUE constraint on either table |
| One-to-Many | User → Orders | Foreign key on the 'many' side (orders.user_id → users.id) |
| Many-to-Many | Students ↔ Courses | Junction table: student_courses(student_id, course_id) |
-- Users table CREATE TABLE users ( id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL, email VARCHAR(255) NOT NULL UNIQUE, created_at TIMESTAMP DEFAULT NOW() ); -- Products table CREATE TABLE products ( id SERIAL PRIMARY KEY, name VARCHAR(200) NOT NULL, price DECIMAL(10,2) NOT NULL CHECK (price > 0), stock INTEGER NOT NULL DEFAULT 0 ); -- Orders table (one-to-many: user → orders) CREATE TABLE orders ( id SERIAL PRIMARY KEY, user_id INTEGER NOT NULL REFERENCES users(id), status VARCHAR(20) DEFAULT 'pending', total DECIMAL(10,2) NOT NULL, created_at TIMESTAMP DEFAULT NOW() ); -- Order items (many-to-many: orders ↔ products) CREATE TABLE order_items ( id SERIAL PRIMARY KEY, order_id INTEGER NOT NULL REFERENCES orders(id), product_id INTEGER NOT NULL REFERENCES products(id), quantity INTEGER NOT NULL CHECK (quantity > 0), unit_price DECIMAL(10,2) NOT NULL );
💡 Why Schema Design Matters
A poorly designed schema is like a house built on a bad foundation. You can paint the walls and add furniture, but the cracks will keep growing. Schema changes in production are expensive and risky — get it right early.
Schema Design & Normalization
Normalization is the process of organizing your tables to eliminate redundancy and prevent anomalies. The goal: every piece of data lives in exactly one place. If you need to update a user's email, you change it in one row, not in 50 rows across 5 tables.
The Problem: Unnormalized Data
orders_flat table: ┌──────────┬───────────┬─────────────────┬──────────────┬───────┬─────────┐ │ order_id │ user_name │ user_email │ product_name │ price │ qty │ ├──────────┼───────────┼─────────────────┼──────────────┼───────┼─────────┤ │ 1 │ Alice │ alice@mail.com │ Laptop │ 999 │ 1 │ │ 1 │ Alice │ alice@mail.com │ Mouse │ 29 │ 2 │ │ 2 │ Alice │ alice@mail.com │ Keyboard │ 79 │ 1 │ │ 3 │ Bob │ bob@mail.com │ Laptop │ 999 │ 1 │ └──────────┴───────────┴─────────────────┴──────────────┴───────┴─────────┘ Problems: ❌ Alice's email is stored 3 times (redundancy) ❌ "Laptop" price is stored 2 times (redundancy) ❌ If Alice changes email → update 3 rows (update anomaly) ❌ If we delete order #3 → we lose Bob's info (delete anomaly) ❌ Can't add a product without an order (insert anomaly)
Normal Forms — Step by Step
1NF — First Normal Form
Each column holds a single atomic value (no arrays, no comma-separated lists). Each row is unique (has a primary key). Example violation: a 'phone_numbers' column containing '555-1234, 555-5678'. Fix: create a separate phone_numbers table.
2NF — Second Normal Form
Must be in 1NF + every non-key column depends on the ENTIRE primary key (not just part of it). Only matters for composite keys. Example violation: in order_items(order_id, product_id, product_name), product_name depends only on product_id, not on the full key. Fix: move product_name to the products table.
3NF — Third Normal Form
Must be in 2NF + no transitive dependencies. Every non-key column depends ONLY on the primary key, not on other non-key columns. Example violation: orders(id, user_id, user_email) — user_email depends on user_id, not on order id. Fix: keep user_email only in the users table.
-- Each entity in its own table, no redundancy users: id | name | email products: id | name | price orders: id | user_id (FK) | status | total | created_at order_items: id | order_id (FK) | product_id (FK) | quantity | unit_price ✅ Alice's email stored once (in users) ✅ Laptop price stored once (in products) ✅ Update email → change 1 row ✅ Delete an order → user data is safe ✅ Add a product without needing an order
When to Denormalize
Normalization optimizes for data integrity. But sometimes you need to optimize for read performance. Denormalization intentionally adds redundancy to avoid expensive JOINs.
Normalize when
- ✅Data integrity is critical (financial systems, healthcare)
- ✅Write-heavy workloads (fewer places to update)
- ✅Data changes frequently (avoid stale copies)
- ✅Storage is a concern (no redundant data)
- ✅The system is still evolving (flexible schema)
Denormalize when
- ✅Read performance is critical (dashboards, analytics)
- ✅JOINs across many tables are too slow
- ✅Data is read 100x more than it's written
- ✅You need precomputed aggregates (total_orders on user)
- ✅Caching isn't sufficient for your read patterns
-- Normalized: to show "Alice — 47 orders" you need a JOIN + COUNT SELECT u.name, COUNT(o.id) as order_count FROM users u JOIN orders o ON o.user_id = u.id WHERE u.id = 42 GROUP BY u.name; -- Denormalized: add order_count directly to users table ALTER TABLE users ADD COLUMN order_count INTEGER DEFAULT 0; -- Now the query is instant: SELECT name, order_count FROM users WHERE id = 42; -- Trade-off: you must update order_count every time an order -- is created or deleted. If you forget → stale data.
🎯 Interview Insight
In interviews, always start with a normalized design and then explain where you'd denormalize for performance. Say: "I'd normalize to 3NF first, then denormalize the user's order count because the dashboard reads it 1000x/sec but orders are created 10x/sec." This shows you understand both sides of the trade-off.
ACID Transactions
ACID is the set of guarantees that relational databases provide for transactions. A transaction is a group of operations that must succeed or fail as a unit — there's no in-between.
The Bank Transfer Analogy
You transfer $500 from Account A to Account B. Two operations: deduct $500 from A, add $500 to B. What if the system crashes after deducting from A but before adding to B? Without ACID, $500 vanishes. With ACID: either BOTH operations complete, or NEITHER does. The money is never lost, never duplicated, never in limbo.
Atomicity
All or nothing. A transaction either completes entirely or rolls back entirely. If step 3 of 5 fails, steps 1 and 2 are undone. No partial state.
Consistency
A transaction moves the database from one valid state to another. All constraints, foreign keys, and rules are satisfied after the transaction. Invalid data is rejected.
Isolation
Concurrent transactions don't interfere with each other. Transaction A doesn't see Transaction B's uncommitted changes. It's as if transactions run one at a time.
Durability
Once a transaction is committed, it's permanent — even if the server crashes immediately after. Data is written to disk (WAL — Write-Ahead Log) before the commit is acknowledged.
-- Place an order: multiple operations that MUST succeed together BEGIN TRANSACTION; -- 1. Check stock SELECT stock FROM products WHERE id = 42 FOR UPDATE; -- stock = 5, we need 2 → OK -- 2. Reduce stock UPDATE products SET stock = stock - 2 WHERE id = 42; -- 3. Create order INSERT INTO orders (user_id, total, status) VALUES (7, 199.98, 'confirmed'); -- 4. Create order items INSERT INTO order_items (order_id, product_id, quantity, unit_price) VALUES (currval('orders_id_seq'), 42, 2, 99.99); -- 5. Charge payment (if this fails, EVERYTHING rolls back) -- ... payment logic ... COMMIT; -- If ANY step fails → ROLLBACK -- Stock is restored, order is not created, payment is not charged -- The database is back to its original state
Isolation Levels
Isolation is the most nuanced ACID property. Stricter isolation means more correctness but less concurrency (slower). The database lets you choose the trade-off.
| Level | Dirty Reads | Non-Repeatable Reads | Phantom Reads | Performance |
|---|---|---|---|---|
| Read Uncommitted | ✅ Possible | ✅ Possible | ✅ Possible | Fastest |
| Read Committed | ❌ Prevented | ✅ Possible | ✅ Possible | Fast (PostgreSQL default) |
| Repeatable Read | ❌ Prevented | ❌ Prevented | ✅ Possible | Moderate (MySQL default) |
| Serializable | ❌ Prevented | ❌ Prevented | ❌ Prevented | Slowest (full isolation) |
👻 Dirty Read
Transaction A reads data that Transaction B has modified but not yet committed. If B rolls back, A read data that never existed.
🔄 Non-Repeatable Read
Transaction A reads a row, Transaction B updates it and commits, Transaction A reads the same row again and gets a different value.
👤 Phantom Read
Transaction A queries rows matching a condition, Transaction B inserts a new row matching that condition, Transaction A re-queries and sees a new "phantom" row.
Each Isolation Level — Explained
Read Uncommitted — No Isolation
A transaction can read data that another transaction has modified but not yet committed. If that other transaction rolls back, you've read data that never actually existed (a 'dirty read'). This is the fastest level because there's zero locking or coordination, but it's almost never used in practice — the risk of reading phantom data is too high. Use case: rough analytics where approximate numbers are acceptable.
Read Committed — See Only Committed Data
A transaction only sees data that has been committed by other transactions. No dirty reads. However, if you read the same row twice within your transaction, you might get different values — because another transaction committed a change between your two reads (a 'non-repeatable read'). This is PostgreSQL's default and the most common level in production. It's a good balance of safety and performance.
Repeatable Read — Snapshot of the World
When your transaction starts, it takes a snapshot of the database. Every read within the transaction sees data as it was at that snapshot — even if other transactions commit changes in the meantime. No dirty reads, no non-repeatable reads. However, new rows inserted by other transactions can still appear ('phantom reads'). This is MySQL/InnoDB's default. Use case: reports that need a consistent view of data across multiple queries.
Serializable — Full Isolation
Transactions execute as if they ran one after another, in serial order. No dirty reads, no non-repeatable reads, no phantom reads. The database achieves this through heavy locking or optimistic concurrency control (abort and retry on conflict). This is the slowest level — concurrent transactions block each other or get aborted frequently. Use case: critical financial operations where any anomaly is unacceptable, like transferring money between accounts.
Table: products (id=42, stock=10, price=99.99) Two concurrent transactions: READ UNCOMMITTED: Tx A: UPDATE products SET price = 79.99 WHERE id = 42; (not committed yet) Tx B: SELECT price FROM products WHERE id = 42; Tx B sees: 79.99 ← dirty read! Tx A might roll back. Tx A: ROLLBACK; Tx B used a price that never existed. READ COMMITTED: Tx A: UPDATE products SET price = 79.99 WHERE id = 42; (not committed yet) Tx B: SELECT price FROM products WHERE id = 42; Tx B sees: 99.99 ← only sees committed data ✅ Tx A: COMMIT; Tx B: SELECT price FROM products WHERE id = 42; Tx B sees: 79.99 ← different value on second read (non-repeatable) REPEATABLE READ: Tx B: SELECT price FROM products WHERE id = 42; Tx B sees: 99.99 ← snapshot taken Tx A: UPDATE products SET price = 79.99 WHERE id = 42; COMMIT; Tx B: SELECT price FROM products WHERE id = 42; Tx B sees: 99.99 ← still sees snapshot value ✅ SERIALIZABLE: Tx A and Tx B both try to update stock: Tx A: UPDATE products SET stock = stock - 1 WHERE id = 42; Tx B: UPDATE products SET stock = stock - 1 WHERE id = 42; → One succeeds, the other is ABORTED and must retry. → Guarantees no lost updates, no anomalies.
🎯 Interview Insight
Most production systems use Read Committed (PostgreSQL default) or Repeatable Read (MySQL default). Serializable is rarely used because the performance cost is too high. Know when to use stricter isolation: financial transactions, inventory management, seat booking — anywhere double-spending or double-booking is catastrophic.
Indexing Strategies
An index is a data structure that makes queries faster by avoiding full table scans. Without an index, the database reads every single row to find what you're looking for. With an index, it jumps directly to the matching rows.
The Book Index Analogy
Imagine a 500-page textbook with no index. To find 'B-tree', you'd flip through every page. With an index at the back, you look up 'B-tree → page 247' and jump directly there. A database index works the same way — it's a sorted lookup structure that maps column values to row locations. The trade-off: the index takes up space (extra pages at the back of the book) and must be updated every time content changes (new edition = rebuild the index).
How Indexes Work — B-Tree Intuition
Most database indexes use a B-tree (balanced tree). Think of it as a sorted, hierarchical structure that narrows down the search at each level.
Table: users (1,000,000 rows) Query: SELECT * FROM users WHERE email = 'alice@mail.com' Without index (full table scan): → Read all 1,000,000 rows → Check each row's email → Time: O(n) = ~1,000,000 comparisons With B-tree index on email: Level 1 (root): "Is 'alice' < 'm'?" → Go left Level 2: "Is 'alice' < 'f'?" → Go left Level 3: "Is 'alice' < 'c'?" → Go left Level 4 (leaf): Found! Row pointer → disk location → Time: O(log n) = ~20 comparisons That's 1,000,000 vs 20. This is why indexes matter.
Types of Indexes
Primary Index
Automatically created on the primary key. The table data is physically sorted by this index (in most databases). Every table has one. You don't need to create it manually.
Secondary Index
Created on any non-primary column you frequently query. Example: CREATE INDEX idx_email ON users(email). The index is a separate structure that points back to the table rows. You can have many secondary indexes per table.
Composite Index
An index on multiple columns: CREATE INDEX idx_user_status ON orders(user_id, status). The order matters — this index helps queries that filter by user_id, or by user_id AND status, but NOT by status alone. Think of it like a phone book sorted by last name, then first name.
CREATE INDEX idx_user_status ON orders(user_id, status); -- ✅ Uses the index (leftmost prefix) SELECT * FROM orders WHERE user_id = 42; -- ✅ Uses the index (both columns) SELECT * FROM orders WHERE user_id = 42 AND status = 'shipped'; -- ❌ Cannot use this index (skips user_id) SELECT * FROM orders WHERE status = 'shipped'; -- For this query, you'd need a separate index on (status) -- Think of it like a phone book: -- Sorted by (last_name, first_name) -- You can look up "Smith" → ✅ -- You can look up "Smith, John" → ✅ -- You can't look up just "John" efficiently → ❌
Index Trade-offs
Benefits
- ✅Dramatically faster reads (O(log n) vs O(n))
- ✅Enables efficient sorting (ORDER BY uses the index)
- ✅Speeds up JOINs (index on foreign keys)
- ✅Supports unique constraints (UNIQUE index)
- ✅Enables range queries (BETWEEN, >, <)
Costs
- ❌Slower writes — every INSERT/UPDATE/DELETE must update the index
- ❌Storage overhead — indexes take disk space (can be 10-30% of table size)
- ❌Too many indexes → write performance degrades significantly
- ❌Unused indexes waste space and slow writes for no benefit
- ❌Wrong index order (composite) → index is useless for your queries
🎯 How to Choose the Right Index
Index the columns you filter on (WHERE), join on (JOIN ... ON), and sort by (ORDER BY). Check your slow query log — the queries that run most often and take the longest are your indexing priorities. Don't index everything — every index slows down writes.
Query Optimization
A query that works in development with 100 rows can grind to a halt in production with 10 million rows. Query optimization is about understanding why queries are slow and how to fix them.
Why Queries Become Slow
Full Table Scans
No index on the filtered column → database reads every row. Fine for 1,000 rows, catastrophic for 10,000,000. The #1 cause of slow queries.
N+1 Query Problem
Fetching a list of users, then making a separate query for each user's orders. 1 query for users + N queries for orders = N+1 total. Should be 1 query with a JOIN.
Expensive JOINs
Joining large tables without indexes on the join columns. The database must compare every row in table A with every row in table B (nested loop). Index the foreign keys.
SELECT *
Fetching all columns when you only need 3. More data to read from disk, more data to transfer over the network, more memory used. Select only what you need.
The Query Planner — EXPLAIN
Every database has a query planner that decides how to execute your query. TheEXPLAINcommand shows you the plan — whether it's using an index, doing a full scan, or choosing a specific join strategy.
-- Slow query: "Find all orders for user 42" EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42; -- BAD output (no index): -- Seq Scan on orders (cost=0.00..25000.00 rows=47 width=64) -- Filter: (user_id = 42) -- Rows Removed by Filter: 999953 -- Execution Time: 450.23 ms -- ↑ Sequential scan = reading ALL 1M rows, filtering 999,953 -- Fix: add an index CREATE INDEX idx_orders_user_id ON orders(user_id); -- GOOD output (with index): -- Index Scan using idx_orders_user_id on orders -- (cost=0.42..8.44 rows=47 width=64) -- Index Cond: (user_id = 42) -- Execution Time: 0.12 ms -- ↑ Index scan = jumped directly to 47 matching rows
The N+1 Problem — In Detail
-- ❌ N+1 Problem (application code makes N+1 queries): -- Query 1: SELECT * FROM users LIMIT 50; -- Query 2: SELECT * FROM orders WHERE user_id = 1; -- Query 3: SELECT * FROM orders WHERE user_id = 2; -- ... -- Query 51: SELECT * FROM orders WHERE user_id = 50; -- Total: 51 queries! Each has network round-trip overhead. -- ✅ Fix: Single query with JOIN SELECT u.id, u.name, o.id as order_id, o.total FROM users u LEFT JOIN orders o ON o.user_id = u.id WHERE u.id IN (1, 2, 3, ..., 50); -- Total: 1 query. Database handles the join efficiently. -- ✅ Alternative fix: Batch query SELECT * FROM users LIMIT 50; -- 1 query SELECT * FROM orders WHERE user_id IN (1,2,3,...,50); -- 1 query -- Total: 2 queries. Application joins the results in memory.
Optimization Checklist
Add indexes on WHERE, JOIN, and ORDER BY columns
This is the single biggest performance improvement. Check EXPLAIN output — if you see 'Seq Scan' on a large table, you're missing an index.
Select only the columns you need
SELECT id, name, email instead of SELECT *. Less data read from disk, less data transferred, less memory used.
Use LIMIT for paginated queries
Don't fetch 100,000 rows when the user sees 20. Use LIMIT 20 OFFSET 0 (or better: cursor-based pagination with WHERE id > last_id LIMIT 20).
Avoid N+1 — use JOINs or batch queries
If your ORM makes a query per row, switch to eager loading (JOIN) or batch loading (WHERE id IN (...)). Check your query log.
Precompute expensive aggregations
If a dashboard shows 'total revenue per month' and it's slow, precompute it into a summary table updated by a background job. Don't re-aggregate 10M rows on every page load.
💡 Real-World Example
A dashboard loads slowly because it runs:SELECT COUNT(*) FROM orders WHERE user_id = ? GROUP BY statusfor every user on the page. Fix: add an index on (user_id, status), or precompute order counts per status into a user_order_stats table.
End-to-End Scenario
Let's design a relational database for an e-commerce system end-to-end — applying everything from this page.
Step 1: Schema Design (3NF)
users: id | name | email (UNIQUE) | created_at products: id | name | price | stock | category_id (FK) categories: id | name orders: id | user_id (FK) | status | total | created_at order_items: id | order_id (FK) | product_id (FK) | quantity | unit_price payments: id | order_id (FK) | amount | method | status | paid_at reviews: id | user_id (FK) | product_id (FK) | rating | comment
Step 2: Denormalization Decisions
-- Denormalize: add order_count to users (dashboard reads it 1000x/sec) ALTER TABLE users ADD COLUMN order_count INTEGER DEFAULT 0; -- Denormalize: add avg_rating to products (product listing page) ALTER TABLE products ADD COLUMN avg_rating DECIMAL(2,1) DEFAULT 0; ALTER TABLE products ADD COLUMN review_count INTEGER DEFAULT 0; -- Keep updated via triggers or application logic: -- On new order → UPDATE users SET order_count = order_count + 1 -- On new review → UPDATE products SET avg_rating = ..., review_count = ...
Step 3: Indexing Strategy
-- Users: lookup by email (login) CREATE UNIQUE INDEX idx_users_email ON users(email); -- Orders: filter by user (order history page) CREATE INDEX idx_orders_user_id ON orders(user_id); -- Orders: filter by status (admin dashboard) CREATE INDEX idx_orders_status ON orders(status); -- Order items: lookup by order (order detail page) CREATE INDEX idx_order_items_order_id ON order_items(order_id); -- Products: filter by category + sort by price (product listing) CREATE INDEX idx_products_category_price ON products(category_id, price); -- Reviews: filter by product (product detail page) CREATE INDEX idx_reviews_product_id ON reviews(product_id); -- Payments: lookup by order (payment status check) CREATE INDEX idx_payments_order_id ON payments(order_id);
Step 4: Transaction — Placing an Order
BEGIN TRANSACTION; -- 1. Lock and check stock (FOR UPDATE prevents race conditions) SELECT stock FROM products WHERE id = 42 FOR UPDATE; -- stock = 5, need 2 → proceed -- 2. Reduce stock UPDATE products SET stock = stock - 2 WHERE id = 42; -- 3. Create order INSERT INTO orders (user_id, total, status) VALUES (7, 199.98, 'pending') RETURNING id; -- returns order_id = 501 -- 4. Create order items INSERT INTO order_items (order_id, product_id, quantity, unit_price) VALUES (501, 42, 2, 99.99); -- 5. Create payment record INSERT INTO payments (order_id, amount, method, status) VALUES (501, 199.98, 'credit_card', 'pending'); -- 6. Update denormalized count UPDATE users SET order_count = order_count + 1 WHERE id = 7; COMMIT; -- If ANY step fails → ROLLBACK (stock restored, no orphan records)
Step 5: Query Optimization
-- Product listing (uses composite index on category_id, price) SELECT id, name, price, avg_rating FROM products WHERE category_id = 5 ORDER BY price ASC LIMIT 20 OFFSET 0; -- Order history (uses index on user_id) SELECT o.id, o.total, o.status, o.created_at FROM orders o WHERE o.user_id = 7 ORDER BY o.created_at DESC LIMIT 10; -- Order detail (uses indexes on order_id for both tables) SELECT oi.quantity, oi.unit_price, p.name, p.price FROM order_items oi JOIN products p ON p.id = oi.product_id WHERE oi.order_id = 501; -- Dashboard: top customers (uses denormalized order_count) SELECT name, order_count FROM users ORDER BY order_count DESC LIMIT 10; -- Without denormalization, this would need a JOIN + GROUP BY on millions of rows
💡 This Is What Interviewers Want
Walk through this exact process in an interview: (1) identify entities and relationships, (2) normalize to 3NF, (3) identify read-heavy patterns and denormalize strategically, (4) add indexes based on query patterns, (5) wrap critical operations in transactions. This shows structured thinking.
Trade-offs & Decision Making
Normalization vs Denormalization
| Dimension | Normalized | Denormalized |
|---|---|---|
| Data integrity | High (single source of truth) | Lower (redundant copies can drift) |
| Write performance | Better (update one place) | Worse (update multiple places) |
| Read performance | Slower (JOINs needed) | Faster (precomputed, no JOINs) |
| Storage | Less (no redundancy) | More (duplicated data) |
| Complexity | Simpler schema | Complex update logic (keep copies in sync) |
| Best for | Write-heavy, data integrity critical | Read-heavy, dashboards, analytics |
Indexing Trade-offs
| Scenario | More Indexes | Fewer Indexes |
|---|---|---|
| Read performance | Faster (index lookups) | Slower (more full scans) |
| Write performance | Slower (update all indexes) | Faster (less overhead) |
| Storage | More disk space | Less disk space |
| Read-heavy workload (90% reads) | Add indexes aggressively | — |
| Write-heavy workload (90% writes) | — | Minimize indexes |
ACID Strictness vs Performance
| Scenario | Strict ACID (Serializable) | Relaxed (Read Committed) |
|---|---|---|
| Correctness | Maximum (no anomalies) | Good (dirty reads prevented) |
| Concurrency | Low (transactions block each other) | High (minimal blocking) |
| Throughput | Lower | Higher |
| Use case | Financial transactions, inventory | General web apps, dashboards |
🎯 Decision Framework
Start normalized, start with Read Committed isolation, and add indexes on your most common query patterns. Then measure. Denormalize only when JOINs become a measured bottleneck. Add stricter isolation only for critical transactions. Let data drive your optimization decisions, not assumptions.
Interview Questions
Conceptual, scenario-based, and optimization questions you're likely to encounter.
Q:When would you denormalize a database?
A: When read performance is critical and the data is read far more often than it's written. Example: a product listing page that shows avg_rating and review_count. Computing these with a JOIN + GROUP BY on every page load is expensive. Instead, store them directly on the products table and update them when a new review is added. The trade-off: you must keep the denormalized data in sync — if you forget to update it, users see stale data.
Q:How do indexes improve performance?
A: Indexes create a sorted data structure (usually a B-tree) that maps column values to row locations. Without an index, the database scans every row (O(n)). With an index, it navigates the tree to find matching rows in O(log n). For a table with 10 million rows, that's the difference between checking 10,000,000 rows and checking ~23 rows. The cost: indexes slow down writes (every INSERT/UPDATE must update the index) and use additional storage.
Q:What causes slow SQL queries?
A: The most common causes: (1) Missing indexes — full table scans on large tables. (2) N+1 queries — making a separate query per row instead of using JOINs or batch queries. (3) SELECT * — fetching all columns when you need 3. (4) No LIMIT — fetching 100K rows when the user sees 20. (5) Expensive JOINs without indexes on join columns. (6) Functions in WHERE clauses that prevent index usage (WHERE YEAR(created_at) = 2024 instead of WHERE created_at >= '2024-01-01'). Use EXPLAIN ANALYZE to diagnose.
You're designing a database for a booking system (hotels)
How do you prevent double-booking the same room?
Answer: Use a transaction with SELECT ... FOR UPDATE to lock the room row before checking availability. Within the transaction: lock the room, check if it's available for the requested dates, insert the booking if available, commit. If another transaction tries to book the same room simultaneously, it waits for the lock. Also add a UNIQUE constraint on (room_id, date) in the bookings table as a safety net. Use at least Repeatable Read isolation to prevent phantom reads.
Your product search page takes 5 seconds to load
How would you diagnose and fix this?
Answer: Run EXPLAIN ANALYZE on the search query. Likely causes: (1) No index on the search columns — add a composite index on (category_id, price) or whatever the WHERE/ORDER BY uses. (2) SELECT * — change to SELECT id, name, price, image_url (only what the listing needs). (3) No LIMIT — add pagination. (4) If full-text search: the database isn't optimized for it — consider Elasticsearch for search, keep PostgreSQL for transactional data. After adding the index, the query should drop from 5s to <50ms.
A developer adds 15 indexes to a table to 'make everything fast'
What's wrong with this approach?
Answer: Every index slows down writes. Each INSERT must update all 15 indexes. Each UPDATE on an indexed column must update the affected indexes. For a write-heavy table (e.g., event logs, analytics), this can make writes 10-50x slower. Also, the query planner might choose the wrong index, actually making some queries slower. Fix: index only the columns used in frequent WHERE, JOIN, and ORDER BY clauses. Remove unused indexes (check pg_stat_user_indexes for usage stats).
Common Mistakes
These mistakes cause real production outages and performance nightmares.
Over-normalizing
Splitting data into so many tables that every query requires 5+ JOINs. A user profile page that JOINs users, addresses, preferences, settings, and avatars — each in a separate table — is over-normalized. The query is complex and slow.
✅Normalize to 3NF as a starting point, then evaluate. If a group of fields is always read together and belongs to the same entity, it's fine to keep them in one table. Normalization is a tool, not a religion.
Missing indexes on foreign keys
Foreign keys create relationships but DON'T automatically create indexes (in PostgreSQL). A JOIN on orders.user_id without an index on user_id means a full table scan of orders for every user lookup. This is the #1 missed optimization.
✅Always create an index on every foreign key column. It's almost never wrong to do so. The write overhead is minimal compared to the read performance gain.
Misusing transactions
Wrapping too much work in a single transaction — like a transaction that sends an email, calls an external API, and updates the database. If the email service is slow, the transaction holds locks for seconds, blocking other queries.
✅Keep transactions short and focused on database operations only. Do external calls (emails, APIs, file uploads) outside the transaction. If you need coordination, use the outbox pattern: write to a local outbox table inside the transaction, then process the outbox asynchronously.
Writing inefficient JOINs
Joining large tables without indexes, using subqueries where JOINs would be faster, or joining tables you don't actually need. A common pattern: LEFT JOIN a table just to check if a row exists — use EXISTS instead.
✅Always index JOIN columns. Prefer JOINs over correlated subqueries. Use EXISTS instead of LEFT JOIN + IS NOT NULL for existence checks. Use EXPLAIN ANALYZE to verify the query plan.
Ignoring query performance until production
Queries that work fine with 1,000 test rows become catastrophically slow with 10 million production rows. The O(n) full scan that took 1ms in dev takes 10 seconds in prod.
✅Test with realistic data volumes. Use EXPLAIN ANALYZE during development. Set up slow query logging (log_min_duration_statement in PostgreSQL). Monitor query performance from day one, not after users complain.