Data Types & Schema Design
PostgreSQL's type system is one of its most underused strengths. Master jsonb, arrays, ranges, and the schema design decisions that are expensive to change after launch.
Table of Contents
Numeric & Character Types
Choosing the right numeric type matters for storage, performance, and correctness. The most common mistake is using numeric wheninteger suffices, or double precision for money.
| Type | Size | Range / Use Case |
|---|---|---|
| smallint | 2 bytes | -32,768 to 32,767 — status codes, small counters |
| integer | 4 bytes | -2.1B to 2.1B — the default choice for most IDs and counts |
| bigint | 8 bytes | -9.2×10¹⁸ to 9.2×10¹⁸ — large tables, snowflake IDs |
| numeric(p,s) | variable | Exact decimal — money, financial calculations |
| real | 4 bytes | 6 decimal digits — scientific data where inexactness is OK |
| double precision | 8 bytes | 15 decimal digits — never use for money |
| serial / bigserial | 4 / 8 bytes | Auto-increment shorthand (creates a sequence) |
Never Use float for Money
0.1 + 0.2 = 0.30000000000000004 in floating point. For any financial calculation, use numeric(precision, scale)or store amounts as integers in the smallest unit (cents, paise).
Character Types
| Type | Behavior | Recommendation |
|---|---|---|
| text | Variable length, no limit | ✅ The default choice — use this |
| varchar(n) | Variable length with limit | Rarely better than text + CHECK |
| char(n) | Fixed length, space-padded | ❌ Almost never the right choice |
In PostgreSQL, text and varchar have identical performance. The length check in varchar(n) adds a constraint but no storage benefit. Prefer text with a CHECKconstraint if you need length validation — it's easier to modify later.
Date, Time & Boolean
| Type | Stores | Use When |
|---|---|---|
| date | Date only (no time) | Birthdays, calendar dates |
| time | Time only (no date) | Rarely useful alone |
| timestamp | Date + time, NO timezone | ❌ Avoid — ambiguous |
| timestamptz | Date + time, stored as UTC | ✅ Always use this |
| interval | Duration: '3 hours', '2 days' | Date arithmetic, scheduling |
Always Use timestamptz
timestamp (without timezone) stores the value as-is — if your server moves to a different timezone, your data means something different.timestamptz converts to UTC on storage and back to the session's timezone on retrieval. It's always unambiguous. There is no valid reason to use timestamp without timezone in a new system.
-- timestamptz stores as UTC, displays in session timezone SET timezone = 'America/New_York'; SELECT now(); -- shows Eastern time SET timezone = 'Asia/Kolkata'; SELECT now(); -- same instant, displayed as IST -- Date arithmetic with intervals SELECT now() + interval '30 days' AS expires_at; SELECT age(now(), created_at) FROM users; -- human-readable duration
JSON & jsonb
PostgreSQL has two JSON types. json stores text as-is and re-parses on every access. jsonb stores decomposed binary — supports indexing, is faster to query, and is almost always the correct choice.
| Aspect | json | jsonb |
|---|---|---|
| Storage | Text (preserves whitespace, key order) | Binary (decomposed, no whitespace) |
| Parse cost | Re-parsed on every access | Parsed once on write |
| Indexing | ❌ Cannot be indexed | ✅ GIN index for containment queries |
| Operators | -> and ->> only | Full set: @>, <@, ?, ?|, ?&, jsonb_path_query |
| Duplicate keys | Preserved | Last value wins |
| When to use | Audit logs where exact input matters | Everything else |
-- Create table with jsonb CREATE TABLE events ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, payload jsonb NOT NULL, created_at timestamptz DEFAULT now() ); -- GIN index for containment queries CREATE INDEX idx_events_payload ON events USING GIN (payload); -- Query: find events where payload contains {"type": "purchase"} SELECT * FROM events WHERE payload @> '{"type": "purchase"}'; -- Extract nested value SELECT payload -> 'user' ->> 'email' AS email FROM events; -- jsonb_path_query (SQL/JSON path — PG 12+) SELECT * FROM events WHERE jsonb_path_exists(payload, '$.items[*] ? (@.price > 100)');
jsonb vs Proper Columns
Use jsonb for truly schema-less data — event payloads, user preferences, third-party webhook bodies. If you find yourself querying the same jsonb path in every query, that field should be a proper column with a proper index. jsonb is an escape hatch, not a replacement for schema design.
Arrays & Range Types
Arrays
Any PostgreSQL type can be an array: integer[], text[],jsonb[]. Arrays are useful for tags, permissions, and small collections that don't need their own table.
-- Table with array column CREATE TABLE articles ( id serial PRIMARY KEY, title text NOT NULL, tags text[] DEFAULT '{}' ); -- Insert with array INSERT INTO articles (title, tags) VALUES ('MVCC Explained', ARRAY['postgresql', 'internals', 'mvcc']); -- Query: articles with 'postgresql' tag SELECT * FROM articles WHERE 'postgresql' = ANY(tags); -- Query: articles with ALL of these tags SELECT * FROM articles WHERE tags @> ARRAY['postgresql', 'internals']; -- GIN index for fast array queries CREATE INDEX idx_articles_tags ON articles USING GIN (tags);
Range Types
Range types represent a range of values: int4range,tstzrange, daterange. The killer feature is exclusion constraints — preventing overlapping bookings at the database level.
-- Prevent overlapping room bookings CREATE TABLE bookings ( id serial PRIMARY KEY, room_id integer NOT NULL, during tstzrange NOT NULL, EXCLUDE USING GIST (room_id WITH =, during WITH &&) ); -- This succeeds INSERT INTO bookings (room_id, during) VALUES (1, tstzrange('2024-03-01 10:00', '2024-03-01 12:00')); -- This fails — overlaps with existing booking INSERT INTO bookings (room_id, during) VALUES (1, tstzrange('2024-03-01 11:00', '2024-03-01 13:00')); -- ERROR: conflicting key value violates exclusion constraint
Arrays vs Child Tables
Use arrays for small, fixed-size collections that you query as a whole (tags, permissions). Use a child table when: the collection is large, you need to query/update individual elements, or you need foreign keys on the elements. If you're doing unnest() in most queries, it should probably be a table.
UUID, Enum & Domain Types
UUID
| Aspect | bigserial | UUID v4 | UUID v7 |
|---|---|---|---|
| Size | 8 bytes | 16 bytes | 16 bytes |
| Generation | Sequential (database) | Random | Time-ordered + random |
| Index locality | ✅ Excellent (sequential) | ❌ Poor (random scatter) | ✅ Good (time-ordered) |
| Globally unique | ❌ No (per-table only) | ✅ Yes | ✅ Yes |
| Guessable | Yes (sequential) | No | Partially (time component) |
| Best for | Internal IDs, high-write tables | External-facing IDs | External IDs with good index perf |
-- Built-in UUID generation (PG 13+) CREATE TABLE users ( id uuid PRIMARY KEY DEFAULT gen_random_uuid(), email text UNIQUE NOT NULL ); -- For UUIDv7 (time-ordered), use pg_uuidv7 extension -- or generate in application layer
Enum Types
-- Create enum type CREATE TYPE order_status AS ENUM ( 'pending', 'confirmed', 'shipped', 'delivered', 'cancelled' ); -- Use in table CREATE TABLE orders ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, status order_status NOT NULL DEFAULT 'pending' ); -- Enums are sorted by definition order, stored as 4 bytes -- Adding values: only append (cannot reorder or remove) ALTER TYPE order_status ADD VALUE 'refunded' AFTER 'cancelled';
Enum Trade-offs
Enums are compact (4 bytes) and type-safe, but inflexible. You can only append values — never remove or reorder. For frequently changing sets, prefer a lookup table with a foreign key, or a text column with a CHECK constraint (easier to modify with ALTER TABLE).
Domain Types
-- Email domain with validation CREATE DOMAIN email_address AS text CHECK (value ~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'); -- Use like any type CREATE TABLE contacts ( id serial PRIMARY KEY, email email_address NOT NULL ); -- Positive money domain CREATE DOMAIN positive_amount AS numeric(12, 2) CHECK (value > 0);
Constraints & Foreign Keys
Constraints are PostgreSQL's way of enforcing data integrity at the database level — not in application code where bugs can bypass them.
| Constraint | Purpose | Example |
|---|---|---|
| NOT NULL | Column must have a value | email text NOT NULL |
| UNIQUE | No duplicate values (allows one NULL) | UNIQUE(email) |
| PRIMARY KEY | NOT NULL + UNIQUE, identifies the row | id bigint PRIMARY KEY |
| CHECK | Arbitrary boolean expression | CHECK (age >= 0 AND age < 150) |
| FOREIGN KEY | References another table's PK/unique | REFERENCES users(id) |
| EXCLUDE | No overlapping values (GiST) | EXCLUDE USING GIST (range WITH &&) |
| DEFAULT | Value when not specified | DEFAULT now() |
CREATE TABLE order_items ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, order_id bigint NOT NULL REFERENCES orders(id) ON DELETE CASCADE, product_id bigint NOT NULL REFERENCES products(id) ON DELETE RESTRICT, quantity integer NOT NULL CHECK (quantity > 0) ); -- ON DELETE actions: -- RESTRICT — block delete if children exist (safest) -- CASCADE — delete children too (use for owned entities) -- SET NULL — set FK to NULL (use for optional relationships) -- NO ACTION — like RESTRICT but deferrable -- CRITICAL: always index FK columns on the child table CREATE INDEX idx_order_items_order_id ON order_items(order_id); CREATE INDEX idx_order_items_product_id ON order_items(product_id);
Always Index Foreign Keys
PostgreSQL does NOT automatically create an index on foreign key columns. Without an index, every DELETE or UPDATE on the parent table triggers a sequential scan on the child table to check for references. This is the #1 cause of unexpected slow deletes.
Normalization & Partitioning
Normalization
Normal Forms
- ✅1NF — atomic values, no repeating groups (every column holds a single value)
- ✅2NF — no partial dependencies (every non-key column depends on the full PK)
- ✅3NF — no transitive dependencies (non-key columns don't depend on other non-key columns)
- ✅BCNF — every determinant is a candidate key (stricter than 3NF)
Normalize by default. Denormalize only when you have measured proof that joins are the bottleneck — and even then, consider materialized views first.
Declarative Partitioning
-- Parent table (no data stored here directly) CREATE TABLE events ( id bigint GENERATED ALWAYS AS IDENTITY, event_type text NOT NULL, payload jsonb, created_at timestamptz NOT NULL DEFAULT now() ) PARTITION BY RANGE (created_at); -- Create monthly partitions CREATE TABLE events_2024_01 PARTITION OF events FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); CREATE TABLE events_2024_02 PARTITION OF events FOR VALUES FROM ('2024-02-01') TO ('2024-03-01'); -- Queries automatically prune irrelevant partitions SELECT * FROM events WHERE created_at >= '2024-01-15' AND created_at < '2024-01-20'; -- Only scans events_2024_01
When Partitioning Helps
Partition when: tables exceed 100M+ rows, you have time-based data with retention policies (drop old partitions instead of DELETE), or you need partition-wise aggregation. Don't partition small tables — it adds planning overhead without benefit.
Interview Questions
Q:When would you use jsonb vs a proper relational column?
A: Use jsonb for truly schema-less data: event payloads, third-party webhook bodies, user preferences with unknown structure. Use proper columns when you query/filter on the field regularly, need foreign keys, or need type safety. The rule: if you're writing WHERE payload->>'status' = 'active' in most queries, 'status' should be a column.
Q:Why use timestamptz over timestamp?
A: timestamp stores the value as-is with no timezone context. If your server moves timezones, or clients are in different timezones, the data becomes ambiguous. timestamptz converts to UTC on storage and displays in the session's timezone — it's always unambiguous. There's no storage difference (both 8 bytes), so always use timestamptz.
Q:What are exclusion constraints and when would you use them?
A: Exclusion constraints prevent rows from having overlapping values using GiST indexes. Classic use case: preventing double-bookings. EXCLUDE USING GIST (room_id WITH =, time_range WITH &&) ensures no two bookings for the same room have overlapping time ranges. This is impossible to enforce reliably with application code due to race conditions.
Q:UUID v4 vs bigserial for primary keys — trade-offs?
A: bigserial: 8 bytes, sequential (great index locality), not globally unique, guessable. UUID v4: 16 bytes, random (poor index locality — random writes scatter across B-tree), globally unique, not guessable. UUID v7 is the best of both: time-ordered (good locality) + random suffix (globally unique). Use bigserial for internal high-write tables, UUID for external-facing IDs.
Common Mistakes
Using timestamp instead of timestamptz
CREATE TABLE events (created_at timestamp DEFAULT now()) — stores without timezone context.
✅Always use timestamptz. It stores as UTC internally and converts to the session timezone on display. There's no performance difference, only correctness.
Not indexing foreign key columns
Adding REFERENCES users(id) without creating an index on the FK column in the child table.
✅PostgreSQL doesn't auto-create FK indexes. Without them, parent table DELETEs trigger sequential scans on child tables. Always: CREATE INDEX idx_child_parent_id ON child(parent_id);
Using varchar(255) everywhere
Cargo-culting varchar(255) from MySQL — every text column is varchar(255).
✅In PostgreSQL, text and varchar have identical performance. Use text as the default. Add a CHECK constraint if you need length validation — it's easier to modify than altering a varchar limit.
Storing money as float
Using double precision or real for financial amounts.
✅Use numeric(12, 2) for exact decimal arithmetic, or store as integer cents (amount_cents integer). Floating point cannot represent 0.1 exactly — you will have rounding errors in financial calculations.