Architecture & Internals
How PostgreSQL actually works under the hood — the process model, memory architecture, storage engine, and the WAL that makes durability possible.
Table of Contents
PostgreSQL's Origin & Philosophy
PostgreSQL traces back to 1986 at UC Berkeley. Michael Stonebraker's POSTGRES project (Post-Ingres) aimed to solve the limitations of the relational model by adding extensibility — custom types, operators, and access methods. The "object-relational" in PostgreSQL means you can define your own data types and teach the database how to index them.
The core philosophy: correctness first, performance second. PostgreSQL will never sacrifice data integrity for speed. This is why it's the default choice when you need to trust your data — financial systems, healthcare, anything where "eventually consistent" isn't acceptable.
The Bank Vault vs The Cash Register
MySQL is like a fast cash register — optimized for quick transactions, sometimes at the cost of strict guarantees. PostgreSQL is like a bank vault — every operation is verified, every constraint enforced, every transaction truly atomic. You pay slightly more in overhead, but you never lose money to a race condition.
The 'Just Use Postgres' Default
For most new systems, PostgreSQL is the correct starting database. It handles OLTP, has decent full-text search, supports JSON documents via jsonb, does geospatial with PostGIS, and scales reads via streaming replication. You only need a specialized database when PostgreSQL demonstrably cannot handle your specific access pattern at your specific scale.
When PostgreSQL is the Wrong Choice
When PostgreSQL is the Wrong Choice
- ❌Extreme write throughput (millions/sec) — Cassandra or ScyllaDB territory
- ❌Simple key-value at massive scale with single-digit ms — DynamoDB territory
- ❌Full-text search as the primary feature at huge scale — Elasticsearch territory
- ❌Time-series with billions of data points — TimescaleDB extension or InfluxDB
- ❌Graph traversal as the primary query pattern — Neo4j or Neptune
Process Architecture
PostgreSQL uses a multi-process model — one OS process per client connection. This is fundamentally different from thread-based databases like MySQL. Each backend process has its own memory space, which provides isolation but costs ~5–10 MB per connection.
| Process | Role | Count |
|---|---|---|
| Postmaster | Parent process — listens for connections, forks backends | 1 |
| Backend | Handles one client connection — parses, plans, executes queries | 1 per connection |
| WAL Writer | Flushes WAL buffers to disk periodically | 1 |
| Checkpointer | Writes dirty pages to disk at checkpoint intervals | 1 |
| Background Writer | Gradually writes dirty buffers to reduce checkpoint spikes | 1 |
| Autovacuum Launcher | Spawns autovacuum workers to reclaim dead tuples | 1 |
| Stats Collector | Collects table/index usage statistics | 1 |
The Connection Problem
Because each connection is a full OS process, PostgreSQL degrades badly beyond ~200–300 connections. At 1000 connections, you're burning 5–10 GB of RAM just on process overhead. This is why connection pooling (PgBouncer) is mandatory in production — it's not optional, it's a requirement.
-- See all active connections SELECT pid, usename, application_name, state, query_start FROM pg_stat_activity WHERE state = 'active'; -- Check max connections setting SHOW max_connections; -- default: 100 -- Connection count by state SELECT state, COUNT(*) FROM pg_stat_activity GROUP BY state;
Memory Architecture
PostgreSQL's memory is split into shared memory (accessible by all processes) and per-process memory (private to each backend). Understanding this split is essential for tuning.
| Parameter | What It Does | Recommended |
|---|---|---|
| shared_buffers | PostgreSQL's internal page cache — hot data lives here | 25% of RAM |
| effective_cache_size | Hint to planner about total cache (shared_buffers + OS cache) | 75% of RAM |
| work_mem | Memory per sort/hash operation per query | 4–64 MB (careful!) |
| maintenance_work_mem | Memory for VACUUM, CREATE INDEX, ALTER TABLE | 256 MB – 1 GB |
| wal_buffers | In-memory buffer before WAL hits disk | 64 MB |
The work_mem Trap
work_mem is per operation, not per query. A complex query with 5 sorts and 3 hash joins uses 8× work_mem. With 100 concurrent connections, that's potentially 100 × 8 × work_mem. Setting work_mem = 256 MB sounds reasonable until you realize it could consume 200 GB across concurrent queries. Start low (4–16 MB), increase only for specific workloads.
Two Layers of Cache
PostgreSQL has a double-buffering architecture. shared_buffers is PostgreSQL's own cache (the house's pantry). The OS page cache is the second layer (the neighborhood store). When PostgreSQL reads a page not in shared_buffers, it goes to the OS — which might have it cached. This is why effective_cache_size should be 75% of RAM: it tells the planner 'between my cache and the OS cache, this much data is probably already in memory.'
-- See shared buffer hit ratio (should be > 99%) SELECT sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) AS hit_ratio FROM pg_stat_database WHERE datname = current_database(); -- Check current memory settings SHOW shared_buffers; SHOW work_mem; SHOW effective_cache_size;
Storage Architecture
Tables in PostgreSQL are stored as heap files — unordered collections of 8 KB pages. Each page contains a header, item pointers (line pointers), free space, and the actual tuple data. There is no clustered index by default — rows are stored in insertion order.
| Concept | Description |
|---|---|
| Page (Block) | 8 KB unit of I/O — everything is read/written in pages |
| Tuple | A single row version — header (23 bytes) + null bitmap + data |
| Item Pointer | 4-byte pointer from page header to tuple location within page |
| CTID | Physical location of a tuple: (page_number, item_number) |
| Fill Factor | How full to pack pages (default 100%) — leave room for HOT updates |
| Tablespace | Map tables/indexes to specific storage devices |
TOAST — The Oversized Attribute Storage Technique
When a row exceeds ~2 KB, PostgreSQL automatically moves large column values to a separate TOAST table. This keeps the main table's pages efficient for scanning. TOAST uses compression (LZ) by default and stores values out-of-line.
TOAST Storage Strategies
- ✅PLAIN — no compression, no out-of-line storage (fixed-length types)
- ✅EXTENDED — compress first, then store out-of-line if still too large (default for text/jsonb)
- ✅EXTERNAL — store out-of-line without compression (useful for pre-compressed data)
- ✅MAIN — try compression, avoid out-of-line unless absolutely necessary
-- See physical size of a table (including TOAST and indexes) SELECT pg_size_pretty(pg_total_relation_size('orders')) AS total, pg_size_pretty(pg_relation_size('orders')) AS table_only, pg_size_pretty(pg_indexes_size('orders')) AS indexes; -- See TOAST table for a relation SELECT relname, reltoastrelid::regclass FROM pg_class WHERE relname = 'orders';
Write-Ahead Log (WAL)
The WAL is PostgreSQL's fundamental durability mechanism. The rule is simple: write the log before writing the data. Before any change to a data page is flushed to disk, the corresponding WAL record must be on stable storage. This guarantees that after a crash, PostgreSQL can replay the WAL to recover any committed transaction.
The Captain's Log
Imagine a ship's captain who writes every order in a logbook before the crew executes it. If the ship sinks and is recovered, you can replay the logbook to know exactly what happened. The WAL is that logbook — sequential, append-only, and written before any action is taken. Even if the data files are corrupted, the WAL can reconstruct the correct state.
| WAL Concept | Description |
|---|---|
| WAL Segment | 16 MB file — WAL is split into segments for management |
| LSN (Log Sequence Number) | Monotonically increasing pointer into the WAL stream |
| WAL Writer | Background process that flushes WAL buffers to disk |
| Checkpoint | Flush all dirty pages to disk + write checkpoint record |
| checkpoint_completion_target | Spread checkpoint I/O over time (default 0.9 = 90% of interval) |
| WAL Archiving | Copy completed WAL segments to safe storage for PITR |
WAL is the Foundation of Everything
The WAL isn't just for crash recovery. It's the foundation for: streaming replication (standbys replay WAL from primary), point-in-time recovery (replay WAL to any timestamp), and logical decoding (extract row-level changes from WAL for CDC). Understanding WAL is understanding how PostgreSQL achieves durability, replication, and backup — all three.
-- Current WAL position SELECT pg_current_wal_lsn(); -- WAL generation rate SELECT pg_wal_lsn_diff( pg_current_wal_lsn(), '0/0' ) / 1024 / 1024 AS total_wal_mb; -- Check checkpoint frequency SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time, checkpoint_sync_time FROM pg_stat_bgwriter;
MVCC — The Core Mechanism
MVCC (Multi-Version Concurrency Control) is how PostgreSQL achieves non-blocking reads. Instead of locking rows when they're being modified, PostgreSQL keeps multiple versions of each row. Readers see a consistent snapshot without waiting for writers. Writers create new row versions without blocking readers.
| Field | Meaning |
|---|---|
| xmin | Transaction ID that created this row version (INSERT or UPDATE) |
| xmax | Transaction ID that deleted/updated this row version (0 if still live) |
| Visibility Rule | Row visible if: xmin committed before my snapshot AND xmax not committed before my snapshot |
| Dead Tuple | Old row version no longer visible to any active transaction |
| VACUUM | Reclaims space from dead tuples — the cost of MVCC |
Snapshot Photography
Each transaction gets a 'photograph' of the database at the moment it starts (or at each statement, depending on isolation level). Other transactions can modify data freely — your photograph doesn't change. When you commit your own changes, you create a new version that future photographs will see. Old versions (that no photograph needs anymore) are eventually cleaned up by VACUUM.
The Cost of MVCC
MVCC is not free. Every UPDATE creates a new row version and leaves the old one as a dead tuple. Every DELETE marks a row as dead but doesn't reclaim space. Without VACUUM, tables bloat indefinitely. This is PostgreSQL's fundamental operational trade-off: non-blocking reads in exchange for mandatory background maintenance.
-- See xmin/xmax on rows SELECT xmin, xmax, * FROM orders LIMIT 5; -- See dead tuples accumulating SELECT relname, n_live_tup, n_dead_tup, round(n_dead_tup * 100.0 / nullif(n_live_tup, 0), 1) AS dead_pct FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10; -- Check transaction ID age (wraparound risk) SELECT datname, age(datfrozenxid) AS xid_age FROM pg_database ORDER BY xid_age DESC;
PostgreSQL vs MySQL
This comparison matters because it's the most common interview question about database choice. The differences are architectural, not superficial.
| Aspect | PostgreSQL | MySQL (InnoDB) |
|---|---|---|
| MVCC Implementation | Stores old versions in main heap — requires VACUUM | Stores old versions in undo log — auto-purged |
| Storage Engine | Single engine, heap-based | Pluggable engines (InnoDB default, clustered index) |
| Clustered Index | No clustered index — heap storage | InnoDB always has clustered index on PK |
| Standards Compliance | Strict SQL standard compliance | Historically loose, improved over time |
| Extension System | Rich: custom types, operators, index methods, FDW | Limited: plugins, UDFs |
| Replication | Streaming (physical) + Logical | Binary log (logical) + Group Replication |
| JSON Support | jsonb — binary, indexed, full operator set | JSON — text-based, limited indexing |
| Full-text Search | Built-in tsvector/tsquery with GIN | Built-in but less flexible |
| Partitioning | Declarative (range, list, hash) since PG 10 | Range, list, hash, key |
| Connection Model | Process per connection (needs pooler) | Thread per connection (lighter) |
When MySQL Wins
MySQL's thread-per-connection model handles more connections natively. Its clustered index means primary key lookups are faster (data is co-located with the index). For simple read-heavy workloads with straightforward schemas, MySQL can be simpler to operate. But for complex queries, advanced types, and strict correctness guarantees, PostgreSQL is the stronger choice.
Interview Questions
Q:Why does PostgreSQL need VACUUM but MySQL doesn't?
A: PostgreSQL stores old row versions in the main heap table. When a row is updated, the old version remains in-place as a dead tuple until VACUUM reclaims it. MySQL (InnoDB) stores old versions in a separate undo log that is automatically purged. PostgreSQL's approach enables simpler crash recovery and replication but requires active maintenance.
Q:Why is connection pooling mandatory for PostgreSQL in production?
A: PostgreSQL forks a new OS process for each connection (~5-10 MB RAM each). At 500+ connections, you're burning gigabytes on process overhead alone, and context switching between processes degrades performance. PgBouncer in transaction pooling mode lets thousands of application threads share a small pool of actual database connections (typically 20-50).
Q:What is the WAL and why does PostgreSQL write it before data pages?
A: The Write-Ahead Log is a sequential, append-only record of all changes. By writing WAL before modifying data pages, PostgreSQL guarantees that committed transactions survive crashes — even if data pages weren't flushed to disk. After a crash, PostgreSQL replays the WAL from the last checkpoint to restore consistency. WAL also enables replication and point-in-time recovery.
Q:Explain MVCC visibility rules in PostgreSQL.
A: Each row has xmin (creating transaction) and xmax (deleting transaction). A row is visible to transaction T if: (1) xmin is committed and started before T's snapshot, AND (2) xmax is either 0, not yet committed, or committed after T's snapshot. This means each transaction sees a consistent snapshot without acquiring read locks.
Q:What happens if VACUUM never runs?
A: Three problems compound: (1) Table bloat — dead tuples waste disk and slow sequential scans. (2) Index bloat — indexes point to dead tuples, growing larger. (3) Transaction ID wraparound — PostgreSQL uses 32-bit transaction IDs (~2 billion). Without VACUUM freezing old rows, the counter wraps around and PostgreSQL stops accepting writes to prevent data loss. This is the 'PostgreSQL doomsday scenario.'
Common Mistakes
Setting max_connections too high
Setting max_connections = 1000 thinking more connections = more throughput.
✅Keep max_connections low (100-200) and use PgBouncer for connection pooling. More connections means more process overhead, more lock contention, and worse performance — not better.
Ignoring shared_buffers tuning
Running with the default shared_buffers = 128 MB on a server with 64 GB RAM.
✅Set shared_buffers to 25% of RAM (16 GB in this case). The default is absurdly low because PostgreSQL ships configured for a laptop, not a production server.
Setting work_mem too high globally
Setting work_mem = 1 GB globally because queries are slow.
✅work_mem is per-operation per-connection. With 100 connections and complex queries, 1 GB work_mem could consume 100+ GB. Set it low globally (16-64 MB) and increase per-session for specific analytical queries: SET work_mem = '512MB';
Not monitoring autovacuum
Assuming autovacuum 'just works' and never checking if it's keeping up.
✅Monitor n_dead_tup in pg_stat_user_tables. If dead tuples grow faster than autovacuum cleans them, tune per-table: ALTER TABLE hot_table SET (autovacuum_vacuum_scale_factor = 0.01). Watch for transaction ID age approaching 2 billion.