Multi-ProcessShared BuffersWALMVCCTOASTCheckpointsPage CacheBackground Workers

Architecture & Internals

How PostgreSQL actually works under the hood — the process model, memory architecture, storage engine, and the WAL that makes durability possible.

45 min read9 sections

PostgreSQL's Origin & Philosophy

PostgreSQL traces back to 1986 at UC Berkeley. Michael Stonebraker's POSTGRES project (Post-Ingres) aimed to solve the limitations of the relational model by adding extensibility — custom types, operators, and access methods. The "object-relational" in PostgreSQL means you can define your own data types and teach the database how to index them.

The core philosophy: correctness first, performance second. PostgreSQL will never sacrifice data integrity for speed. This is why it's the default choice when you need to trust your data — financial systems, healthcare, anything where "eventually consistent" isn't acceptable.

🏛️

The Bank Vault vs The Cash Register

MySQL is like a fast cash register — optimized for quick transactions, sometimes at the cost of strict guarantees. PostgreSQL is like a bank vault — every operation is verified, every constraint enforced, every transaction truly atomic. You pay slightly more in overhead, but you never lose money to a race condition.

The 'Just Use Postgres' Default

For most new systems, PostgreSQL is the correct starting database. It handles OLTP, has decent full-text search, supports JSON documents via jsonb, does geospatial with PostGIS, and scales reads via streaming replication. You only need a specialized database when PostgreSQL demonstrably cannot handle your specific access pattern at your specific scale.

When PostgreSQL is the Wrong Choice

❌Extreme write throughput (millions/sec) — Cassandra or ScyllaDB territory
❌Simple key-value at massive scale with single-digit ms — DynamoDB territory
❌Full-text search as the primary feature at huge scale — Elasticsearch territory
❌Time-series with billions of data points — TimescaleDB extension or InfluxDB
❌Graph traversal as the primary query pattern — Neo4j or Neptune

Process Architecture

PostgreSQL uses a multi-process model — one OS process per client connection. This is fundamentally different from thread-based databases like MySQL. Each backend process has its own memory space, which provides isolation but costs ~5–10 MB per connection.

Process	Role	Count
Postmaster	Parent process — listens for connections, forks backends	1
Backend	Handles one client connection — parses, plans, executes queries	1 per connection
WAL Writer	Flushes WAL buffers to disk periodically	1
Checkpointer	Writes dirty pages to disk at checkpoint intervals	1
Background Writer	Gradually writes dirty buffers to reduce checkpoint spikes	1
Autovacuum Launcher	Spawns autovacuum workers to reclaim dead tuples	1
Stats Collector	Collects table/index usage statistics	1

The Connection Problem

Because each connection is a full OS process, PostgreSQL degrades badly beyond ~200–300 connections. At 1000 connections, you're burning 5–10 GB of RAM just on process overhead. This is why connection pooling (PgBouncer) is mandatory in production — it's not optional, it's a requirement.

Check current connectionssql

-- See all active connections
SELECT pid, usename, application_name, state, query_start
FROM pg_stat_activity
WHERE state = 'active';

-- Check max connections setting
SHOW max_connections;  -- default: 100

-- Connection count by state
SELECT state, COUNT(*)
FROM pg_stat_activity
GROUP BY state;

Memory Architecture

PostgreSQL's memory is split into shared memory (accessible by all processes) and per-process memory (private to each backend). Understanding this split is essential for tuning.

Parameter	What It Does	Recommended
shared_buffers	PostgreSQL's internal page cache — hot data lives here	25% of RAM
effective_cache_size	Hint to planner about total cache (shared_buffers + OS cache)	75% of RAM
work_mem	Memory per sort/hash operation per query	4–64 MB (careful!)
maintenance_work_mem	Memory for VACUUM, CREATE INDEX, ALTER TABLE	256 MB – 1 GB
wal_buffers	In-memory buffer before WAL hits disk	64 MB

The work_mem Trap

work_mem is per operation, not per query. A complex query with 5 sorts and 3 hash joins uses 8× work_mem. With 100 concurrent connections, that's potentially 100 × 8 × work_mem. Setting work_mem = 256 MB sounds reasonable until you realize it could consume 200 GB across concurrent queries. Start low (4–16 MB), increase only for specific workloads.

🏠

Two Layers of Cache

PostgreSQL has a double-buffering architecture. shared_buffers is PostgreSQL's own cache (the house's pantry). The OS page cache is the second layer (the neighborhood store). When PostgreSQL reads a page not in shared_buffers, it goes to the OS — which might have it cached. This is why effective_cache_size should be 75% of RAM: it tells the planner 'between my cache and the OS cache, this much data is probably already in memory.'

Check buffer usagesql

-- See shared buffer hit ratio (should be > 99%)
SELECT
  sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) AS hit_ratio
FROM pg_stat_database
WHERE datname = current_database();

-- Check current memory settings
SHOW shared_buffers;
SHOW work_mem;
SHOW effective_cache_size;

Storage Architecture

Tables in PostgreSQL are stored as heap files — unordered collections of 8 KB pages. Each page contains a header, item pointers (line pointers), free space, and the actual tuple data. There is no clustered index by default — rows are stored in insertion order.

Concept	Description
Page (Block)	8 KB unit of I/O — everything is read/written in pages
Tuple	A single row version — header (23 bytes) + null bitmap + data
Item Pointer	4-byte pointer from page header to tuple location within page
CTID	Physical location of a tuple: (page_number, item_number)
Fill Factor	How full to pack pages (default 100%) — leave room for HOT updates
Tablespace	Map tables/indexes to specific storage devices

TOAST — The Oversized Attribute Storage Technique

When a row exceeds ~2 KB, PostgreSQL automatically moves large column values to a separate TOAST table. This keeps the main table's pages efficient for scanning. TOAST uses compression (LZ) by default and stores values out-of-line.

TOAST Storage Strategies

✅PLAIN — no compression, no out-of-line storage (fixed-length types)
✅EXTENDED — compress first, then store out-of-line if still too large (default for text/jsonb)
✅EXTERNAL — store out-of-line without compression (useful for pre-compressed data)
✅MAIN — try compression, avoid out-of-line unless absolutely necessary

Inspect table storagesql

-- See physical size of a table (including TOAST and indexes)
SELECT
  pg_size_pretty(pg_total_relation_size('orders')) AS total,
  pg_size_pretty(pg_relation_size('orders')) AS table_only,
  pg_size_pretty(pg_indexes_size('orders')) AS indexes;

-- See TOAST table for a relation
SELECT relname, reltoastrelid::regclass
FROM pg_class
WHERE relname = 'orders';

Write-Ahead Log (WAL)

The WAL is PostgreSQL's fundamental durability mechanism. The rule is simple: write the log before writing the data. Before any change to a data page is flushed to disk, the corresponding WAL record must be on stable storage. This guarantees that after a crash, PostgreSQL can replay the WAL to recover any committed transaction.

📓

The Captain's Log

Imagine a ship's captain who writes every order in a logbook before the crew executes it. If the ship sinks and is recovered, you can replay the logbook to know exactly what happened. The WAL is that logbook — sequential, append-only, and written before any action is taken. Even if the data files are corrupted, the WAL can reconstruct the correct state.

WAL Concept	Description
WAL Segment	16 MB file — WAL is split into segments for management
LSN (Log Sequence Number)	Monotonically increasing pointer into the WAL stream
WAL Writer	Background process that flushes WAL buffers to disk
Checkpoint	Flush all dirty pages to disk + write checkpoint record
checkpoint_completion_target	Spread checkpoint I/O over time (default 0.9 = 90% of interval)
WAL Archiving	Copy completed WAL segments to safe storage for PITR

WAL is the Foundation of Everything

The WAL isn't just for crash recovery. It's the foundation for: streaming replication (standbys replay WAL from primary), point-in-time recovery (replay WAL to any timestamp), and logical decoding (extract row-level changes from WAL for CDC). Understanding WAL is understanding how PostgreSQL achieves durability, replication, and backup — all three.

WAL monitoringsql

-- Current WAL position
SELECT pg_current_wal_lsn();

-- WAL generation rate
SELECT pg_wal_lsn_diff(
  pg_current_wal_lsn(),
  '0/0'
) / 1024 / 1024 AS total_wal_mb;

-- Check checkpoint frequency
SELECT checkpoints_timed, checkpoints_req,
       checkpoint_write_time, checkpoint_sync_time
FROM pg_stat_bgwriter;

MVCC — The Core Mechanism

MVCC (Multi-Version Concurrency Control) is how PostgreSQL achieves non-blocking reads. Instead of locking rows when they're being modified, PostgreSQL keeps multiple versions of each row. Readers see a consistent snapshot without waiting for writers. Writers create new row versions without blocking readers.

Field	Meaning
xmin	Transaction ID that created this row version (INSERT or UPDATE)
xmax	Transaction ID that deleted/updated this row version (0 if still live)
Visibility Rule	Row visible if: xmin committed before my snapshot AND xmax not committed before my snapshot
Dead Tuple	Old row version no longer visible to any active transaction
VACUUM	Reclaims space from dead tuples — the cost of MVCC

📸

Snapshot Photography

Each transaction gets a 'photograph' of the database at the moment it starts (or at each statement, depending on isolation level). Other transactions can modify data freely — your photograph doesn't change. When you commit your own changes, you create a new version that future photographs will see. Old versions (that no photograph needs anymore) are eventually cleaned up by VACUUM.

The Cost of MVCC

MVCC is not free. Every UPDATE creates a new row version and leaves the old one as a dead tuple. Every DELETE marks a row as dead but doesn't reclaim space. Without VACUUM, tables bloat indefinitely. This is PostgreSQL's fundamental operational trade-off: non-blocking reads in exchange for mandatory background maintenance.

See MVCC in actionsql

-- See xmin/xmax on rows
SELECT xmin, xmax, * FROM orders LIMIT 5;

-- See dead tuples accumulating
SELECT relname, n_live_tup, n_dead_tup,
       round(n_dead_tup * 100.0 / nullif(n_live_tup, 0), 1) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

-- Check transaction ID age (wraparound risk)
SELECT datname, age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY xid_age DESC;

PostgreSQL vs MySQL

This comparison matters because it's the most common interview question about database choice. The differences are architectural, not superficial.

Aspect	PostgreSQL	MySQL (InnoDB)
MVCC Implementation	Stores old versions in main heap — requires VACUUM	Stores old versions in undo log — auto-purged
Storage Engine	Single engine, heap-based	Pluggable engines (InnoDB default, clustered index)
Clustered Index	No clustered index — heap storage	InnoDB always has clustered index on PK
Standards Compliance	Strict SQL standard compliance	Historically loose, improved over time
Extension System	Rich: custom types, operators, index methods, FDW	Limited: plugins, UDFs
Replication	Streaming (physical) + Logical	Binary log (logical) + Group Replication
JSON Support	jsonb — binary, indexed, full operator set	JSON — text-based, limited indexing
Full-text Search	Built-in tsvector/tsquery with GIN	Built-in but less flexible
Partitioning	Declarative (range, list, hash) since PG 10	Range, list, hash, key
Connection Model	Process per connection (needs pooler)	Thread per connection (lighter)

When MySQL Wins

MySQL's thread-per-connection model handles more connections natively. Its clustered index means primary key lookups are faster (data is co-located with the index). For simple read-heavy workloads with straightforward schemas, MySQL can be simpler to operate. But for complex queries, advanced types, and strict correctness guarantees, PostgreSQL is the stronger choice.

Interview Questions

Q:Why does PostgreSQL need VACUUM but MySQL doesn't?

A: PostgreSQL stores old row versions in the main heap table. When a row is updated, the old version remains in-place as a dead tuple until VACUUM reclaims it. MySQL (InnoDB) stores old versions in a separate undo log that is automatically purged. PostgreSQL's approach enables simpler crash recovery and replication but requires active maintenance.

Q:Why is connection pooling mandatory for PostgreSQL in production?

A: PostgreSQL forks a new OS process for each connection (~5-10 MB RAM each). At 500+ connections, you're burning gigabytes on process overhead alone, and context switching between processes degrades performance. PgBouncer in transaction pooling mode lets thousands of application threads share a small pool of actual database connections (typically 20-50).

Q:What is the WAL and why does PostgreSQL write it before data pages?

A: The Write-Ahead Log is a sequential, append-only record of all changes. By writing WAL before modifying data pages, PostgreSQL guarantees that committed transactions survive crashes — even if data pages weren't flushed to disk. After a crash, PostgreSQL replays the WAL from the last checkpoint to restore consistency. WAL also enables replication and point-in-time recovery.

Q:Explain MVCC visibility rules in PostgreSQL.

A: Each row has xmin (creating transaction) and xmax (deleting transaction). A row is visible to transaction T if: (1) xmin is committed and started before T's snapshot, AND (2) xmax is either 0, not yet committed, or committed after T's snapshot. This means each transaction sees a consistent snapshot without acquiring read locks.

Q:What happens if VACUUM never runs?

A: Three problems compound: (1) Table bloat — dead tuples waste disk and slow sequential scans. (2) Index bloat — indexes point to dead tuples, growing larger. (3) Transaction ID wraparound — PostgreSQL uses 32-bit transaction IDs (~2 billion). Without VACUUM freezing old rows, the counter wraps around and PostgreSQL stops accepting writes to prevent data loss. This is the 'PostgreSQL doomsday scenario.'

Common Mistakes

⚠️

Setting max_connections too high

Setting max_connections = 1000 thinking more connections = more throughput.

✅Keep max_connections low (100-200) and use PgBouncer for connection pooling. More connections means more process overhead, more lock contention, and worse performance — not better.

⚠️

Ignoring shared_buffers tuning

Running with the default shared_buffers = 128 MB on a server with 64 GB RAM.

✅Set shared_buffers to 25% of RAM (16 GB in this case). The default is absurdly low because PostgreSQL ships configured for a laptop, not a production server.

⚠️

Setting work_mem too high globally

Setting work_mem = 1 GB globally because queries are slow.

✅work_mem is per-operation per-connection. With 100 connections and complex queries, 1 GB work_mem could consume 100+ GB. Set it low globally (16-64 MB) and increase per-session for specific analytical queries: SET work_mem = '512MB';

⚠️

Not monitoring autovacuum

Assuming autovacuum 'just works' and never checking if it's keeping up.

✅Monitor n_dead_tup in pg_stat_user_tables. If dead tuples grow faster than autovacuum cleans them, tune per-table: ALTER TABLE hot_table SET (autovacuum_vacuum_scale_factor = 0.01). Watch for transaction ID age approaching 2 billion.

Architecture & Internals

Table of Contents

PostgreSQL's Origin & Philosophy

The Bank Vault vs The Cash Register

When PostgreSQL is the Wrong Choice

When PostgreSQL is the Wrong Choice

Process Architecture

Memory Architecture

Two Layers of Cache

Storage Architecture

TOAST — The Oversized Attribute Storage Technique

TOAST Storage Strategies

Write-Ahead Log (WAL)

The Captain's Log

MVCC — The Core Mechanism

Snapshot Photography

PostgreSQL vs MySQL

Interview Questions

Q:Why does PostgreSQL need VACUUM but MySQL doesn't?

Q:Why is connection pooling mandatory for PostgreSQL in production?

Q:What is the WAL and why does PostgreSQL write it before data pages?

Q:Explain MVCC visibility rules in PostgreSQL.

Q:What happens if VACUUM never runs?

Common Mistakes

Setting max_connections too high

Ignoring shared_buffers tuning

Setting work_mem too high globally

Not monitoring autovacuum