Multi-ProcessShared BuffersWALMVCCTOASTCheckpointsPage CacheBackground Workers

Architecture & Internals

How PostgreSQL actually works under the hood — the process model, memory architecture, storage engine, and the WAL that makes durability possible.

45 min read9 sections
01

PostgreSQL's Origin & Philosophy

PostgreSQL traces back to 1986 at UC Berkeley. Michael Stonebraker's POSTGRES project (Post-Ingres) aimed to solve the limitations of the relational model by adding extensibility — custom types, operators, and access methods. The "object-relational" in PostgreSQL means you can define your own data types and teach the database how to index them.

The core philosophy: correctness first, performance second. PostgreSQL will never sacrifice data integrity for speed. This is why it's the default choice when you need to trust your data — financial systems, healthcare, anything where "eventually consistent" isn't acceptable.

🏛️

The Bank Vault vs The Cash Register

MySQL is like a fast cash register — optimized for quick transactions, sometimes at the cost of strict guarantees. PostgreSQL is like a bank vault — every operation is verified, every constraint enforced, every transaction truly atomic. You pay slightly more in overhead, but you never lose money to a race condition.

The 'Just Use Postgres' Default

For most new systems, PostgreSQL is the correct starting database. It handles OLTP, has decent full-text search, supports JSON documents via jsonb, does geospatial with PostGIS, and scales reads via streaming replication. You only need a specialized database when PostgreSQL demonstrably cannot handle your specific access pattern at your specific scale.

When PostgreSQL is the Wrong Choice

When PostgreSQL is the Wrong Choice

  • Extreme write throughput (millions/sec) — Cassandra or ScyllaDB territory
  • Simple key-value at massive scale with single-digit ms — DynamoDB territory
  • Full-text search as the primary feature at huge scale — Elasticsearch territory
  • Time-series with billions of data points — TimescaleDB extension or InfluxDB
  • Graph traversal as the primary query pattern — Neo4j or Neptune
02

Process Architecture

PostgreSQL uses a multi-process model — one OS process per client connection. This is fundamentally different from thread-based databases like MySQL. Each backend process has its own memory space, which provides isolation but costs ~5–10 MB per connection.

ProcessRoleCount
PostmasterParent process — listens for connections, forks backends1
BackendHandles one client connection — parses, plans, executes queries1 per connection
WAL WriterFlushes WAL buffers to disk periodically1
CheckpointerWrites dirty pages to disk at checkpoint intervals1
Background WriterGradually writes dirty buffers to reduce checkpoint spikes1
Autovacuum LauncherSpawns autovacuum workers to reclaim dead tuples1
Stats CollectorCollects table/index usage statistics1

The Connection Problem

Because each connection is a full OS process, PostgreSQL degrades badly beyond ~200–300 connections. At 1000 connections, you're burning 5–10 GB of RAM just on process overhead. This is why connection pooling (PgBouncer) is mandatory in production — it's not optional, it's a requirement.

Check current connectionssql
-- See all active connections
SELECT pid, usename, application_name, state, query_start
FROM pg_stat_activity
WHERE state = 'active';

-- Check max connections setting
SHOW max_connections;  -- default: 100

-- Connection count by state
SELECT state, COUNT(*)
FROM pg_stat_activity
GROUP BY state;
03

Memory Architecture

PostgreSQL's memory is split into shared memory (accessible by all processes) and per-process memory (private to each backend). Understanding this split is essential for tuning.

ParameterWhat It DoesRecommended
shared_buffersPostgreSQL's internal page cache — hot data lives here25% of RAM
effective_cache_sizeHint to planner about total cache (shared_buffers + OS cache)75% of RAM
work_memMemory per sort/hash operation per query4–64 MB (careful!)
maintenance_work_memMemory for VACUUM, CREATE INDEX, ALTER TABLE256 MB – 1 GB
wal_buffersIn-memory buffer before WAL hits disk64 MB

The work_mem Trap

work_mem is per operation, not per query. A complex query with 5 sorts and 3 hash joins uses 8× work_mem. With 100 concurrent connections, that's potentially 100 × 8 × work_mem. Setting work_mem = 256 MB sounds reasonable until you realize it could consume 200 GB across concurrent queries. Start low (4–16 MB), increase only for specific workloads.

🏠

Two Layers of Cache

PostgreSQL has a double-buffering architecture. shared_buffers is PostgreSQL's own cache (the house's pantry). The OS page cache is the second layer (the neighborhood store). When PostgreSQL reads a page not in shared_buffers, it goes to the OS — which might have it cached. This is why effective_cache_size should be 75% of RAM: it tells the planner 'between my cache and the OS cache, this much data is probably already in memory.'

Check buffer usagesql
-- See shared buffer hit ratio (should be > 99%)
SELECT
  sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) AS hit_ratio
FROM pg_stat_database
WHERE datname = current_database();

-- Check current memory settings
SHOW shared_buffers;
SHOW work_mem;
SHOW effective_cache_size;
04

Storage Architecture

Tables in PostgreSQL are stored as heap files — unordered collections of 8 KB pages. Each page contains a header, item pointers (line pointers), free space, and the actual tuple data. There is no clustered index by default — rows are stored in insertion order.

ConceptDescription
Page (Block)8 KB unit of I/O — everything is read/written in pages
TupleA single row version — header (23 bytes) + null bitmap + data
Item Pointer4-byte pointer from page header to tuple location within page
CTIDPhysical location of a tuple: (page_number, item_number)
Fill FactorHow full to pack pages (default 100%) — leave room for HOT updates
TablespaceMap tables/indexes to specific storage devices

TOAST — The Oversized Attribute Storage Technique

When a row exceeds ~2 KB, PostgreSQL automatically moves large column values to a separate TOAST table. This keeps the main table's pages efficient for scanning. TOAST uses compression (LZ) by default and stores values out-of-line.

TOAST Storage Strategies

  • PLAIN — no compression, no out-of-line storage (fixed-length types)
  • EXTENDED — compress first, then store out-of-line if still too large (default for text/jsonb)
  • EXTERNAL — store out-of-line without compression (useful for pre-compressed data)
  • MAIN — try compression, avoid out-of-line unless absolutely necessary
Inspect table storagesql
-- See physical size of a table (including TOAST and indexes)
SELECT
  pg_size_pretty(pg_total_relation_size('orders')) AS total,
  pg_size_pretty(pg_relation_size('orders')) AS table_only,
  pg_size_pretty(pg_indexes_size('orders')) AS indexes;

-- See TOAST table for a relation
SELECT relname, reltoastrelid::regclass
FROM pg_class
WHERE relname = 'orders';
05

Write-Ahead Log (WAL)

The WAL is PostgreSQL's fundamental durability mechanism. The rule is simple: write the log before writing the data. Before any change to a data page is flushed to disk, the corresponding WAL record must be on stable storage. This guarantees that after a crash, PostgreSQL can replay the WAL to recover any committed transaction.

📓

The Captain's Log

Imagine a ship's captain who writes every order in a logbook before the crew executes it. If the ship sinks and is recovered, you can replay the logbook to know exactly what happened. The WAL is that logbook — sequential, append-only, and written before any action is taken. Even if the data files are corrupted, the WAL can reconstruct the correct state.

WAL ConceptDescription
WAL Segment16 MB file — WAL is split into segments for management
LSN (Log Sequence Number)Monotonically increasing pointer into the WAL stream
WAL WriterBackground process that flushes WAL buffers to disk
CheckpointFlush all dirty pages to disk + write checkpoint record
checkpoint_completion_targetSpread checkpoint I/O over time (default 0.9 = 90% of interval)
WAL ArchivingCopy completed WAL segments to safe storage for PITR

WAL is the Foundation of Everything

The WAL isn't just for crash recovery. It's the foundation for: streaming replication (standbys replay WAL from primary), point-in-time recovery (replay WAL to any timestamp), and logical decoding (extract row-level changes from WAL for CDC). Understanding WAL is understanding how PostgreSQL achieves durability, replication, and backup — all three.

WAL monitoringsql
-- Current WAL position
SELECT pg_current_wal_lsn();

-- WAL generation rate
SELECT pg_wal_lsn_diff(
  pg_current_wal_lsn(),
  '0/0'
) / 1024 / 1024 AS total_wal_mb;

-- Check checkpoint frequency
SELECT checkpoints_timed, checkpoints_req,
       checkpoint_write_time, checkpoint_sync_time
FROM pg_stat_bgwriter;
06

MVCC — The Core Mechanism

MVCC (Multi-Version Concurrency Control) is how PostgreSQL achieves non-blocking reads. Instead of locking rows when they're being modified, PostgreSQL keeps multiple versions of each row. Readers see a consistent snapshot without waiting for writers. Writers create new row versions without blocking readers.

FieldMeaning
xminTransaction ID that created this row version (INSERT or UPDATE)
xmaxTransaction ID that deleted/updated this row version (0 if still live)
Visibility RuleRow visible if: xmin committed before my snapshot AND xmax not committed before my snapshot
Dead TupleOld row version no longer visible to any active transaction
VACUUMReclaims space from dead tuples — the cost of MVCC
📸

Snapshot Photography

Each transaction gets a 'photograph' of the database at the moment it starts (or at each statement, depending on isolation level). Other transactions can modify data freely — your photograph doesn't change. When you commit your own changes, you create a new version that future photographs will see. Old versions (that no photograph needs anymore) are eventually cleaned up by VACUUM.

The Cost of MVCC

MVCC is not free. Every UPDATE creates a new row version and leaves the old one as a dead tuple. Every DELETE marks a row as dead but doesn't reclaim space. Without VACUUM, tables bloat indefinitely. This is PostgreSQL's fundamental operational trade-off: non-blocking reads in exchange for mandatory background maintenance.

See MVCC in actionsql
-- See xmin/xmax on rows
SELECT xmin, xmax, * FROM orders LIMIT 5;

-- See dead tuples accumulating
SELECT relname, n_live_tup, n_dead_tup,
       round(n_dead_tup * 100.0 / nullif(n_live_tup, 0), 1) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

-- Check transaction ID age (wraparound risk)
SELECT datname, age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY xid_age DESC;
07

PostgreSQL vs MySQL

This comparison matters because it's the most common interview question about database choice. The differences are architectural, not superficial.

AspectPostgreSQLMySQL (InnoDB)
MVCC ImplementationStores old versions in main heap — requires VACUUMStores old versions in undo log — auto-purged
Storage EngineSingle engine, heap-basedPluggable engines (InnoDB default, clustered index)
Clustered IndexNo clustered index — heap storageInnoDB always has clustered index on PK
Standards ComplianceStrict SQL standard complianceHistorically loose, improved over time
Extension SystemRich: custom types, operators, index methods, FDWLimited: plugins, UDFs
ReplicationStreaming (physical) + LogicalBinary log (logical) + Group Replication
JSON Supportjsonb — binary, indexed, full operator setJSON — text-based, limited indexing
Full-text SearchBuilt-in tsvector/tsquery with GINBuilt-in but less flexible
PartitioningDeclarative (range, list, hash) since PG 10Range, list, hash, key
Connection ModelProcess per connection (needs pooler)Thread per connection (lighter)

When MySQL Wins

MySQL's thread-per-connection model handles more connections natively. Its clustered index means primary key lookups are faster (data is co-located with the index). For simple read-heavy workloads with straightforward schemas, MySQL can be simpler to operate. But for complex queries, advanced types, and strict correctness guarantees, PostgreSQL is the stronger choice.

08

Interview Questions

Q:Why does PostgreSQL need VACUUM but MySQL doesn't?

A: PostgreSQL stores old row versions in the main heap table. When a row is updated, the old version remains in-place as a dead tuple until VACUUM reclaims it. MySQL (InnoDB) stores old versions in a separate undo log that is automatically purged. PostgreSQL's approach enables simpler crash recovery and replication but requires active maintenance.

Q:Why is connection pooling mandatory for PostgreSQL in production?

A: PostgreSQL forks a new OS process for each connection (~5-10 MB RAM each). At 500+ connections, you're burning gigabytes on process overhead alone, and context switching between processes degrades performance. PgBouncer in transaction pooling mode lets thousands of application threads share a small pool of actual database connections (typically 20-50).

Q:What is the WAL and why does PostgreSQL write it before data pages?

A: The Write-Ahead Log is a sequential, append-only record of all changes. By writing WAL before modifying data pages, PostgreSQL guarantees that committed transactions survive crashes — even if data pages weren't flushed to disk. After a crash, PostgreSQL replays the WAL from the last checkpoint to restore consistency. WAL also enables replication and point-in-time recovery.

Q:Explain MVCC visibility rules in PostgreSQL.

A: Each row has xmin (creating transaction) and xmax (deleting transaction). A row is visible to transaction T if: (1) xmin is committed and started before T's snapshot, AND (2) xmax is either 0, not yet committed, or committed after T's snapshot. This means each transaction sees a consistent snapshot without acquiring read locks.

Q:What happens if VACUUM never runs?

A: Three problems compound: (1) Table bloat — dead tuples waste disk and slow sequential scans. (2) Index bloat — indexes point to dead tuples, growing larger. (3) Transaction ID wraparound — PostgreSQL uses 32-bit transaction IDs (~2 billion). Without VACUUM freezing old rows, the counter wraps around and PostgreSQL stops accepting writes to prevent data loss. This is the 'PostgreSQL doomsday scenario.'

09

Common Mistakes

⚠️

Setting max_connections too high

Setting max_connections = 1000 thinking more connections = more throughput.

Keep max_connections low (100-200) and use PgBouncer for connection pooling. More connections means more process overhead, more lock contention, and worse performance — not better.

⚠️

Ignoring shared_buffers tuning

Running with the default shared_buffers = 128 MB on a server with 64 GB RAM.

Set shared_buffers to 25% of RAM (16 GB in this case). The default is absurdly low because PostgreSQL ships configured for a laptop, not a production server.

⚠️

Setting work_mem too high globally

Setting work_mem = 1 GB globally because queries are slow.

work_mem is per-operation per-connection. With 100 connections and complex queries, 1 GB work_mem could consume 100+ GB. Set it low globally (16-64 MB) and increase per-session for specific analytical queries: SET work_mem = '512MB';

⚠️

Not monitoring autovacuum

Assuming autovacuum 'just works' and never checking if it's keeping up.

Monitor n_dead_tup in pg_stat_user_tables. If dead tuples grow faster than autovacuum cleans them, tune per-table: ALTER TABLE hot_table SET (autovacuum_vacuum_scale_factor = 0.01). Watch for transaction ID age approaching 2 billion.