EXPLAIN ANALYZEScan TypesJoin StrategiesCTEsWindow FunctionsLateral JoinsKeyset Pagination

Query Planner & Advanced SQL

Understanding how PostgreSQL executes queries is essential for writing fast SQL. Master EXPLAIN, join strategies, CTEs, window functions, and lateral joins.

50 min read9 sections
01

Query Lifecycle & Statistics

Every SQL statement goes through five stages: Parse (text → parse tree), Analyze (resolve names, check types), Rewrite (apply rules/views), Plan (choose execution strategy), Execute (run the plan). The planner is where performance is won or lost.

StatisticWhat It Tells the PlannerUpdated By
null_fracFraction of NULL values in the columnANALYZE
n_distinctNumber of distinct values (or ratio if negative)ANALYZE
most_common_valsMost frequent values and their frequenciesANALYZE
histogram_boundsDistribution of non-MCV values in equal-frequency bucketsANALYZE
correlationPhysical ordering correlation (-1 to 1)ANALYZE

Stale Statistics = Bad Plans

The planner makes decisions based on statistics collected by ANALYZE. If statistics are stale (table changed significantly since last ANALYZE), the planner may choose catastrophically wrong plans — like a nested loop on a million-row table because it thinks there are only 10 rows. Autovacuum runs ANALYZE automatically, but after bulk loads, run ANALYZE manually.

Viewing and updating statisticssql
-- View column statistics
SELECT attname, null_frac, n_distinct, most_common_vals
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

-- Manually update statistics
ANALYZE orders;

-- Increase statistics target for skewed columns (default: 100)
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;
02

EXPLAIN & EXPLAIN ANALYZE

CommandWhat It DoesWhen to Use
EXPLAINShows planned execution without runningQuick check — safe for production
EXPLAIN ANALYZEExecutes query and shows actual vs estimatedDebugging — shows real timings
EXPLAIN (ANALYZE, BUFFERS)Adds buffer hit/read countsI/O analysis — find disk-bound queries
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)Machine-readable full detailTools like pgMustard, explain.dalibo.com
Reading an EXPLAIN plansql
EXPLAIN ANALYZE
SELECT o.id, o.total, u.email
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.status = 'pending' AND o.created_at > '2024-01-01';

-- Output (simplified):
-- Nested Loop (cost=0.87..156.23 rows=42 width=52) (actual time=0.05..1.2 rows=38 loops=1)
--   -> Index Scan using idx_orders_status_date on orders o
--        (cost=0.43..89.12 rows=42 width=28) (actual rows=38)
--        Filter: status = 'pending' AND created_at > '2024-01-01'
--   -> Index Scan using users_pkey on users u
--        (cost=0.43..1.59 rows=1 width=24) (actual rows=1 loops=38)
--        Index Cond: (id = o.user_id)
-- Planning Time: 0.3 ms
-- Execution Time: 1.5 ms

Spotting Bad Plans

Signs of Bad Query Plans

  • Row estimate mismatch: estimated rows=1, actual rows=100000 — stale statistics
  • Sequential scan on large table when few rows needed — missing index
  • Nested loop with large outer table — should be hash join
  • Sort on large result set with high work_mem usage — consider index for ordering
  • Bitmap heap scan with many recheck conditions — index not selective enough

Cost Units

Cost numbers (cost=0.00..156.23) are in arbitrary units relative toseq_page_cost (default 1.0). The first number is startup cost (before first row returned), the second is total cost. These are estimates — EXPLAIN ANALYZE shows actual execution time in milliseconds.

03

Scan Types

Scan TypeHow It WorksBest When
Sequential ScanReads every page of the table in orderReturning large % of rows, small tables
Index ScanFollows index pointers to heap — random I/OHighly selective queries (< 5% of rows)
Index-Only ScanAnswers from index alone, no heap accessAll needed columns are in the index + recent VACUUM
Bitmap Index ScanCollects TIDs from index, sorts by page, then fetchesMedium selectivity, or combining multiple indexes
Bitmap Heap ScanFetches heap pages in physical order after bitmap scanReduces random I/O for medium-selectivity queries
📚

Library Analogy for Scan Types

Sequential scan: walk through every shelf in order. Index scan: look up the book in the catalog, walk directly to that shelf (but if you need 100 books, you're walking back and forth). Bitmap scan: look up all 100 books in the catalog, sort them by shelf location, then walk through shelves in order — one trip. Index-only scan: the catalog itself has all the info you need (author, title, year) — you never visit the shelves.

Bitmap scan combining multiple indexessql
-- PostgreSQL can combine multiple indexes with BitmapAnd/BitmapOr
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 42 AND status = 'pending';

-- If separate indexes exist on user_id and status:
-- BitmapAnd
--   -> Bitmap Index Scan on idx_orders_user_id (rows matching user_id = 42)
--   -> Bitmap Index Scan on idx_orders_status (rows matching status = 'pending')
-- -> Bitmap Heap Scan (fetch rows in both bitmaps)

-- A composite index (user_id, status) would be more efficient here
04

Join Strategies

StrategyHow It WorksBest When
Nested LoopFor each row in outer, scan inner (often via index)Small outer table, or index on inner join column
Hash JoinBuild hash table on smaller side, probe with largerNo useful index, medium-to-large tables, equality joins
Merge JoinSort both sides, then merge in orderBoth sides already sorted (e.g., index scan), or very large tables
Join strategy examplessql
-- Nested Loop: small orders result joined to users via PK index
EXPLAIN ANALYZE
SELECT o.*, u.email FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.id = 42;
-- Nested Loop: 1 order rowindex lookup on users_pkey

-- Hash Join: joining two large tables without useful index
EXPLAIN ANALYZE
SELECT o.*, p.name FROM orders o
JOIN products p ON p.id = o.product_id
WHERE o.created_at > '2024-01-01';
-- Hash Join: build hash on products, probe with filtered orders

-- Merge Join: both sides have sorted index
EXPLAIN ANALYZE
SELECT a.*, b.* FROM table_a a
JOIN table_b b ON a.sort_key = b.sort_key
ORDER BY a.sort_key;
-- Merge Join: both already sorted by sort_key via index
05

CTEs & Recursive Queries

Common Table Expressions (WITH clause) are named subqueries. Since PostgreSQL 12, CTEs can be inlined by the planner (previously they were always materialized as optimization fences).

CTE basicssql
-- Simple CTE (PG 12+ can inline this)
WITH active_users AS (
  SELECT id, email FROM users WHERE status = 'active'
)
SELECT au.email, count(o.id) AS order_count
FROM active_users au
JOIN orders o ON o.user_id = au.id
GROUP BY au.email;

-- Force materialization (optimization fence)
WITH active_users AS MATERIALIZED (
  SELECT id, email FROM users WHERE status = 'active'
)
SELECT * FROM active_users WHERE email LIKE '%@company.com';

-- Force inlining (let planner push predicates into CTE)
WITH active_users AS NOT MATERIALIZED (
  SELECT id, email FROM users WHERE status = 'active'
)
SELECT * FROM active_users WHERE email LIKE '%@company.com';

Recursive CTEs

Recursive CTE — tree traversalsql
-- Organizational hierarchy: find all reports under a manager
WITH RECURSIVE org_tree AS (
  -- Base case: the manager
  SELECT id, name, manager_id, 1 AS depth
  FROM employees
  WHERE id = 1  -- CEO

  UNION ALL

  -- Recursive case: find direct reports of current level
  SELECT e.id, e.name, e.manager_id, ot.depth + 1
  FROM employees e
  JOIN org_tree ot ON e.manager_id = ot.id
  WHERE ot.depth < 10  -- prevent infinite recursion
)
SELECT * FROM org_tree ORDER BY depth, name;

-- Category tree with path
WITH RECURSIVE cat_tree AS (
  SELECT id, name, parent_id, name::text AS path
  FROM categories WHERE parent_id IS NULL

  UNION ALL

  SELECT c.id, c.name, c.parent_id, ct.path || ' > ' || c.name
  FROM categories c
  JOIN cat_tree ct ON c.parent_id = ct.id
)
SELECT * FROM cat_tree;
06

Window Functions

Window functions operate over a set of rows related to the current row without collapsing them into a single output row (unlike GROUP BY). They're essential for rankings, running totals, and gap detection.

Window function examplessql
-- ROW_NUMBER: assign sequential number within partition
SELECT
  user_id,
  order_id,
  total,
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM orders;
-- rn = 1 is the most recent order per user

-- Running total
SELECT
  date,
  revenue,
  SUM(revenue) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) AS running_total
FROM daily_revenue;

-- LAG/LEAD: compare with previous/next row
SELECT
  date,
  revenue,
  revenue - LAG(revenue) OVER (ORDER BY date) AS day_over_day_change
FROM daily_revenue;

-- RANK vs DENSE_RANK
SELECT
  name, score,
  RANK() OVER (ORDER BY score DESC) AS rank,        -- 1, 2, 2, 4 (gaps)
  DENSE_RANK() OVER (ORDER BY score DESC) AS dense  -- 1, 2, 2, 3 (no gaps)
FROM leaderboard;

-- Top 3 products per category
SELECT * FROM (
  SELECT
    category_id, product_name, revenue,
    ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY revenue DESC) AS rn
  FROM products
) ranked
WHERE rn <= 3;

ROWS vs RANGE Frame

ROWS BETWEEN counts physical rows. RANGE BETWEENincludes all rows with the same ORDER BY value. For running totals with possible duplicate dates, RANGE gives the correct result (includes all ties). For most cases, ROWS is more predictable and performant.

07

Lateral Joins & Keyset Pagination

LATERAL Joins

LATERAL allows a subquery to reference columns from preceding FROM items. It's like a correlated subquery but in the FROM clause — enabling "for each row, apply this function" patterns.

LATERAL join — top N per groupsql
-- Top 3 most recent orders per user (efficient with LATERAL)
SELECT u.id, u.email, recent_orders.*
FROM users u
CROSS JOIN LATERAL (
  SELECT o.id AS order_id, o.total, o.created_at
  FROM orders o
  WHERE o.user_id = u.id
  ORDER BY o.created_at DESC
  LIMIT 3
) recent_orders
WHERE u.status = 'active';

-- This is much more efficient than window functions for top-N per group
-- because it can use an index on orders(user_id, created_at DESC)

Keyset Pagination

OFFSET-based pagination degrades linearly — page 1000 must scan and discard 999 pages of results. Keyset pagination uses a WHERE clause to skip directly to the next page, with consistent O(1) performance regardless of page depth.

Keyset paginationsql
-- ❌ OFFSET paginationgets slower with depth
SELECT * FROM orders ORDER BY created_at DESC, id DESC
LIMIT 20 OFFSET 10000;  -- must scan 10,000 rows to discard them

-- ✅ Keyset paginationconstant performance
-- First page:
SELECT * FROM orders
ORDER BY created_at DESC, id DESC
LIMIT 20;

-- Next page (use last row's values as cursor):
SELECT * FROM orders
WHERE (created_at, id) < ('2024-03-15 10:30:00+00', 98765)
ORDER BY created_at DESC, id DESC
LIMIT 20;

-- Requires: composite index on (created_at DESC, id DESC)
-- Performance: O(log N) regardless of page depth
-- Trade-off: no "jump to page 50"only next/previous
AspectOFFSETKeyset
Page 1 performanceFastFast
Page 1000 performanceSlow (scans 20,000 rows)Fast (same as page 1)
Jump to arbitrary page✅ Supported❌ Not supported
Consistent results during inserts❌ Rows shift between pages✅ Stable cursor
Index requirementOptional (but helps)Required on sort columns
08

Interview Questions

Q:How do you identify a slow query in PostgreSQL?

A: (1) Enable pg_stat_statements extension — shows top queries by total_time, calls, mean_time. (2) Set log_min_duration_statement = 100 to log queries over 100ms. (3) Use EXPLAIN (ANALYZE, BUFFERS) on suspect queries to see actual execution plan, row estimates vs actuals, and buffer I/O. Look for: row estimate mismatches, sequential scans on large tables, nested loops on large result sets.

Q:When does PostgreSQL choose a sequential scan over an index scan?

A: When the planner estimates the query will return a large fraction of the table (typically > 5-10%). Sequential I/O is faster than random I/O — reading 50% of a table sequentially is cheaper than 50% random index lookups. Also: very small tables (a few pages), queries with no suitable index, or when random_page_cost is set too high for SSD storage.

Q:Explain the difference between Hash Join and Merge Join.

A: Hash Join: builds a hash table on the smaller relation, then probes it with each row from the larger relation. O(N+M) but needs memory for the hash table. Best for equality joins without useful indexes. Merge Join: sorts both inputs on the join key, then merges them in order. O(N log N + M log M) for sorting, but if inputs are already sorted (via index), it's very efficient. Best when both sides are pre-sorted.

Q:Why is OFFSET pagination problematic and what's the alternative?

A: OFFSET N requires scanning and discarding N rows before returning results. Page 1000 with 20 rows/page scans 20,000 rows. Performance degrades linearly with page depth. Alternative: keyset pagination — use WHERE (sort_col, id) < (last_seen_value, last_seen_id) ORDER BY sort_col DESC, id DESC LIMIT 20. This uses an index to jump directly to the next page in O(log N) regardless of depth.

Q:What changed with CTEs in PostgreSQL 12?

A: Before PG 12, CTEs were always materialized — they acted as optimization fences. The planner couldn't push predicates into them or inline them. Since PG 12, non-recursive CTEs without side effects are automatically inlined (treated as subqueries). You can force old behavior with AS MATERIALIZED or force inlining with AS NOT MATERIALIZED.

09

Common Mistakes

⚠️

Not running ANALYZE after bulk loads

Loading millions of rows and immediately running queries — planner uses stale statistics from before the load.

Run ANALYZE table_name immediately after bulk INSERT or COPY operations. Autovacuum will eventually catch up, but for immediate query performance, manual ANALYZE is essential.

⚠️

Using OFFSET for deep pagination

SELECT * FROM products ORDER BY created_at OFFSET 100000 LIMIT 20 — scans 100,000 rows to return 20.

Use keyset pagination: WHERE (created_at, id) < (last_cursor_date, last_cursor_id) ORDER BY created_at DESC, id DESC LIMIT 20. Requires a composite index but gives constant performance at any depth.

⚠️

SELECT * in production queries

Using SELECT * when you only need 3 columns — fetches all columns including large text/jsonb, prevents index-only scans.

Select only needed columns. This enables index-only scans (if all columns are in the index), reduces I/O (skips TOAST decompression for large columns), and reduces network transfer.

⚠️

Using COUNT(*) > 0 instead of EXISTS

SELECT CASE WHEN COUNT(*) > 0 THEN true END FROM orders WHERE user_id = 42 — counts ALL matching rows.

Use EXISTS: SELECT EXISTS(SELECT 1 FROM orders WHERE user_id = 42). EXISTS stops at the first matching row — O(1) with an index. COUNT(*) must scan all matching rows even though you only care about 'at least one'.