Partition KeyClustering ColumnsCQLDenormalizationTTLLWT

Data Model & CQL

Cassandra data modeling is query-first: you design your tables to serve specific access patterns, not to represent entities. One table per query pattern is the rule, not the exception.

50 min read9 sections

Query-First Design Philosophy

In relational databases, you normalize entities first and write queries later. In Cassandra, you do the exact opposite: list every query your application needs, then create a table optimized for each query. This often means the same data is stored in multiple tables with different primary keys.

📚

The Library Index Cards

A library has the same books indexed multiple ways: by author, by title, by subject, by ISBN. Each index card catalog is a different 'table' optimized for a different lookup. You don't normalize into one master catalog — you denormalize into multiple indexes so every lookup is O(1). Cassandra tables are like these index cards.

🔑 The Golden Rule

In Cassandra, you design tables to answer queries — not to model entities. If you have 5 different queries about users, you might have 5 different tables containing user data, each with a different primary key optimized for that specific query.

query-first-design.txttext

Step 1: List Access Patterns
═══════════════════════════════════════════════════════════════
Q1: Get user by user_id
Q2: Get all posts by a user, newest first
Q3: Get all posts in a category, newest first
Q4: Get a specific post by post_id
Q5: Get all comments on a post, oldest first

Step 2: Design One Table Per Query
═══════════════════════════════════════════════════════════════
Q1 → users_by_id         (PK: user_id)
Q2 → posts_by_user       (PK: user_id, CK: created_at DESC)
Q3 → posts_by_category   (PK: category, CK: created_at DESC)
Q4 → posts_by_id         (PK: post_id)
Q5 → comments_by_post    (PK: post_id, CK: created_at ASC)

Step 3: Accept Denormalization
═══════════════════════════════════════════════════════════════
Post data exists in posts_by_user AND posts_by_category AND posts_by_id.
This is correct. Disk is cheap. Joins are impossible.
Write to all tables on insert (or use materialized views/CDC).

Query-First Principles

✅One table per query pattern — each table serves exactly one access pattern
✅Denormalization is correct — duplicate data across tables is expected
✅No joins — if you need data from two entities, co-locate them or duplicate
✅Partition key = equality filter in WHERE clause
✅Clustering columns = range filters and sort order

Primary Key Design

The primary key in Cassandra has two parts: the partition key (determines which node stores the data) and optional clustering columns (determine sort order within the partition). The syntax is: PRIMARY KEY ((partition_key), clustering_col1, clustering_col2).

primary-key-anatomy.cqlsql

-- Anatomy of a Cassandra Primary Key:
-- PRIMARY KEY ((partition_key_cols), clustering_cols)

-- Simple partition key, one clustering column
CREATE TABLE posts_by_user (
  user_id    UUID,
  created_at TIMESTAMP,
  post_id    UUID,
  title      TEXT,
  body       TEXT,
  PRIMARY KEY (user_id, created_at)
  --          ^^^^^^^^  ^^^^^^^^^^
  --          partition  clustering
  --          key        column
) WITH CLUSTERING ORDER BY (created_at DESC);

-- Composite partition key (multiple columns)
CREATE TABLE sensor_data (
  sensor_id  TEXT,
  day        DATE,
  event_time TIMESTAMP,
  reading    DOUBLE,
  PRIMARY KEY ((sensor_id, day), event_time)
  --          ^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^
  --          composite partition  clustering
  --          key (both required)  column
) WITH CLUSTERING ORDER BY (event_time DESC);

-- Multiple clustering columns
CREATE TABLE messages (
  chat_id    UUID,
  bucket     INT,
  sent_at    TIMESTAMP,
  message_id UUID,
  sender_id  UUID,
  body       TEXT,
  PRIMARY KEY ((chat_id, bucket), sent_at, message_id)
  --          ^^^^^^^^^^^^^^^^^^^  ^^^^^^  ^^^^^^^^^^
  --          composite partition   CK1     CK2
) WITH CLUSTERING ORDER BY (sent_at DESC, message_id ASC);

Component	Purpose	Query Restriction	Cardinality Goal
Partition Key	Data distribution (which node)	Must use = in WHERE	High (many distinct values)
Clustering Column 1	Primary sort within partition	Can use =, <, >, BETWEEN, begins_with	Moderate
Clustering Column 2+	Secondary sort	Can only filter if CK1 is equality-filtered	Any

The Partition Size Rule

A single partition should not exceed 100MB (hard limit is 2GB but performance degrades well before that). If your partition key leads to unbounded growth (all events for a device, all messages in a chat), add a bucketing column to the partition key (day, month, or sequence number).

Clustering Columns & Sort Order

Clustering columns determine how rows are physically sorted within a partition on disk. This sort order is defined at table creation and cannot be changed. Reads within a partition are sequential disk reads — extremely fast when the sort order matches your query.

clustering-order.cqlsql

-- Default: clustering columns sort ASC
CREATE TABLE events (
  device_id  TEXT,
  event_time TIMESTAMP,
  event_type TEXT,
  payload    TEXT,
  PRIMARY KEY (device_id, event_time)
);
-- Stored on disk: oldest first
-- Query: SELECT * FROM events WHERE device_id = 'sensor1'
--        ORDER BY event_time ASC  ← free (matches storage)

-- Explicit DESC for "newest first" queries
CREATE TABLE events_newest_first (
  device_id  TEXT,
  event_time TIMESTAMP,
  event_type TEXT,
  payload    TEXT,
  PRIMARY KEY (device_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
-- Stored on disk: newest first
-- Query: SELECT * FROM events_newest_first 
--        WHERE device_id = 'sensor1' LIMIT 10
--        → returns 10 most recent events (single sequential read)

-- Multiple clustering columns
CREATE TABLE leaderboard (
  game_id TEXT,
  score   INT,
  user_id UUID,
  PRIMARY KEY (game_id, score, user_id)
) WITH CLUSTERING ORDER BY (score DESC, user_id ASC);
-- Sorted by score descending, then user_id ascending within same score

Sort Order Is Physical

Unlike SQL ORDER BY which sorts at query time, Cassandra's CLUSTERING ORDER BY determines how data is physically written to disk. Querying in the declared order is a sequential read (fast). Querying in reverse order is also supported but slightly less efficient. You cannot sort by a non-clustering column.

Clustering Column Rules

✅Clustering columns define physical sort order on disk
✅You can only filter on clustering column N if columns 1..N-1 use equality
✅LIMIT with matching sort order = efficient 'top N' queries
✅Range queries (>, <, BETWEEN) only on the last filtered clustering column
✅Cannot skip clustering columns in WHERE (must filter left to right)

clustering-restrictions.cqlsql

-- Table: PRIMARY KEY (user_id, year, month, day)

-- ✅ Valid queries (left-to-right filtering):
SELECT * FROM t WHERE user_id = 'alice' AND year = 2024;
SELECT * FROM t WHERE user_id = 'alice' AND year = 2024 AND month = 1;
SELECT * FROM t WHERE user_id = 'alice' AND year = 2024 AND month >= 6;

-- ❌ Invalid queries (skipping clustering columns):
SELECT * FROM t WHERE user_id = 'alice' AND month = 6;  -- skipped year!
SELECT * FROM t WHERE user_id = 'alice' AND day = 15;   -- skipped year, month!

-- ❌ Invalid: range on non-last filtered column
SELECT * FROM t WHERE user_id = 'alice' AND year > 2020 AND month = 6;
-- year uses range, so month cannot be filtered

CQL vs SQL — Critical Differences

CQL (Cassandra Query Language) looks deceptively like SQL but has fundamental restrictions. These restrictions exist because Cassandra is distributed — operations that would require coordinating across all nodes are either forbidden or extremely expensive.

Feature	SQL (PostgreSQL)	CQL (Cassandra)
JOINs	Full support	Not supported — denormalize instead
WHERE clause	Any column	Only partition key + clustering columns
GROUP BY	Any column	Only clustering columns (Cassandra 4.0+)
ORDER BY	Any column	Only clustering columns, must match table order
Aggregations	Full (SUM, AVG, COUNT)	Limited, per-partition only
Subqueries	Full support	Not supported
UPDATE semantics	Read-modify-write	Upsert (blind write, no read)
DELETE	Immediate removal	Writes a tombstone (removed later)
NULL handling	Stored explicitly	Absent = not stored (saves space)
Secondary indexes	Efficient B-tree	Expensive, avoid in production

cql-gotchas.cqlsql

-- CQL looks like SQL but behaves very differently:

-- ❌ No JOINs
SELECT * FROM users u JOIN orders o ON u.id = o.user_id;  -- ERROR

-- ❌ WHERE must include full partition key
SELECT * FROM posts_by_user WHERE title = 'Hello';  -- ERROR
-- Must be: WHERE user_id = ? (partition key required)

-- ❌ No arbitrary ORDER BY
SELECT * FROM posts_by_user WHERE user_id = ?
  ORDER BY title;  -- ERROR (title is not a clustering column)

-- ✅ UPDATE is actually an UPSERT (no read-before-write)
UPDATE users SET name = 'Alice' WHERE user_id = 123;
-- If user 123 doesn't exist, this CREATES it!
-- There is no "row not found" error for UPDATE

-- ✅ INSERT is also an UPSERT
INSERT INTO users (user_id, name) VALUES (123, 'Alice');
-- If user 123 exists, this OVERWRITES it!
-- There is no "duplicate key" error for INSERT

-- ✅ TTL — automatic expiration
INSERT INTO sessions (session_id, user_id, data)
  VALUES (uuid(), 123, 'session_data')
  USING TTL 3600;  -- expires in 1 hour

-- ✅ Lightweight Transaction (conditional write)
INSERT INTO users (user_id, name)
  VALUES (123, 'Alice')
  IF NOT EXISTS;  -- only inserts if truly new (uses Paxos)

UPDATE = INSERT in Cassandra

This is the most surprising difference for SQL developers. Cassandra does not read before writing. UPDATE creates the row if it doesn't exist. INSERT overwrites if it does. They are semantically identical operations. If you need true "insert-only" or "update-only" behavior, use Lightweight Transactions (IF NOT EXISTS / IF EXISTS).

Data Modeling Patterns

Pattern 1: Time-Series with Bucketing

time-series-bucketing.cqlsql

-- Problem: All events for a device in one partition → unbounded growth
-- Solution: Bucket by day (or hour for high-volume)

CREATE TABLE sensor_readings (
  sensor_id  TEXT,
  day        DATE,          -- bucketing column in partition key
  event_time TIMESTAMP,
  value      DOUBLE,
  PRIMARY KEY ((sensor_id, day), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

-- Query today's readings:
SELECT * FROM sensor_readings
  WHERE sensor_id = 'temp-001' AND day = '2024-01-15'
  LIMIT 100;

-- Query a range within a day:
SELECT * FROM sensor_readings
  WHERE sensor_id = 'temp-001' AND day = '2024-01-15'
  AND event_time >= '2024-01-15 10:00:00'
  AND event_time <= '2024-01-15 12:00:00';

Pattern 2: Denormalized Lookup Tables

denormalized-tables.cqlsql

-- Same data, different access patterns, different tables:

-- Lookup by email (login flow)
CREATE TABLE users_by_email (
  email    TEXT PRIMARY KEY,
  user_id  UUID,
  name     TEXT,
  password_hash TEXT
);

-- Lookup by user_id (profile page)
CREATE TABLE users_by_id (
  user_id  UUID PRIMARY KEY,
  email    TEXT,
  name     TEXT,
  bio      TEXT,
  avatar   TEXT
);

-- Application writes to BOTH tables on user creation:
-- BEGIN BATCH
--   INSERT INTO users_by_email ...
--   INSERT INTO users_by_id ...
-- APPLY BATCH;

Pattern 3: Wide Rows for Relationships

wide-rows.cqlsql

-- Followers/following relationship
CREATE TABLE followers (
  user_id     UUID,
  follower_id UUID,
  followed_at TIMESTAMP,
  PRIMARY KEY (user_id, follower_id)
);

-- Get all followers of a user:
SELECT * FROM followers WHERE user_id = ?;

-- Check if specific user follows:
SELECT * FROM followers WHERE user_id = ? AND follower_id = ?;

-- Count followers (expensive — counts all rows in partition):
SELECT COUNT(*) FROM followers WHERE user_id = ?;
-- Better: maintain a counter table separately

Pattern 4: Materialized Views (Use with Caution)

materialized-views.cqlsql

-- Base table
CREATE TABLE users (
  user_id UUID PRIMARY KEY,
  email   TEXT,
  country TEXT,
  name    TEXT
);

-- Materialized view: automatic denormalization
CREATE MATERIALIZED VIEW users_by_country AS
  SELECT * FROM users
  WHERE country IS NOT NULL AND user_id IS NOT NULL
  PRIMARY KEY (country, user_id);

-- Cassandra automatically keeps the view in sync
-- ⚠️ BUT: MVs have known bugs and performance issues
-- Many teams prefer manual denormalization with BATCH writes

Anti-Patterns to Avoid

Data Modeling Anti-Patterns

❌Using secondary indexes for high-cardinality columns (user_id, email)
❌Unbounded partition growth — no bucketing for time-series data
❌Relying on ALLOW FILTERING in production queries
❌Using Cassandra counters for critical business logic (they can lose updates)
❌Storing large blobs (>1MB) directly in Cassandra columns
❌Creating too many materialized views (each adds write amplification)
❌Using IN clause with large value lists (creates coordinator pressure)

anti-patterns.cqlsql

-- ❌ ALLOW FILTERING — full cluster scan
SELECT * FROM users WHERE name = 'Alice' ALLOW FILTERING;
-- Reads EVERY partition, filters in memory. O(n) on cluster size.
-- Never use in production. Create a table with name as partition key.

-- ❌ Secondary index on high-cardinality column
CREATE INDEX ON users (email);
SELECT * FROM users WHERE email = 'alice@example.com';
-- Queries ALL nodes (scatter-gather). Extremely slow at scale.
-- Solution: Create users_by_email table.

-- ❌ Large IN clause
SELECT * FROM posts WHERE user_id IN (uuid1, uuid2, ..., uuid100);
-- Coordinator must query 100 different partitions and merge.
-- Solution: Query individually and merge in application.

-- ❌ Unbounded partition
CREATE TABLE events (
  device_id TEXT,
  event_time TIMESTAMP,
  data TEXT,
  PRIMARY KEY (device_id, event_time)
);
-- A device running for years → partition grows to GB+
-- Solution: Add bucketing: PRIMARY KEY ((device_id, month), event_time)

ALLOW FILTERING = Table Scan

If your query requires ALLOW FILTERING, it means Cassandra cannot serve it efficiently. It will read every partition in the cluster and filter in memory. This is acceptable for development/debugging but must never appear in production code. The fix is always to create a new table with the correct primary key for that query.

Lightweight Transactions (LWT)

Lightweight Transactions (LWT) provide linearizable consistency using the Paxos consensus protocol. They enable conditional writes — "insert if not exists" or "update if current value matches." However, they are 4-10x slower than regular writes and should be used sparingly.

lwt-examples.cqlsql

-- Conditional INSERT (create if not exists)
INSERT INTO users (user_id, email, name)
  VALUES (uuid(), 'alice@example.com', 'Alice')
  IF NOT EXISTS;
-- Returns [applied]=true if inserted, false if email already taken

-- Conditional UPDATE (compare-and-swap)
UPDATE accounts SET balance = 900
  WHERE account_id = 123
  IF balance = 1000;
-- Only updates if current balance is exactly 1000
-- Returns [applied]=true/false and current values if false

-- Conditional DELETE
DELETE FROM locks WHERE lock_id = 'resource-1'
  IF owner = 'worker-5';
-- Only deletes if current owner matches

Aspect	Regular Write	LWT (IF EXISTS/IF NOT EXISTS)
Consensus	None (last-write-wins)	Paxos (4 round trips)
Latency	~2-5ms	~15-50ms (4-10x slower)
Throughput	High	Low (serial execution per partition)
Consistency	Eventual or tunable	Linearizable (SERIAL)
Use case	Most writes	Uniqueness, CAS, distributed locks

When to Use LWT

Use LWT only when you need true conditional semantics: unique email registration, distributed locks, compare-and-swap counters. For most writes, last-write-wins (regular writes) is sufficient. If you find yourself using LWT on every write, Cassandra may not be the right database for your use case.

LWT Best Practices

✅Use sparingly — only for operations that truly need conditional logic
✅Never mix LWT and non-LWT writes to the same partition (undefined behavior)
✅LWT operations on the same partition are serialized (bottleneck)
✅Set a reasonable timeout — Paxos can be slow under contention
✅Consider if last-write-wins is actually acceptable before reaching for LWT

Interview Questions

Q:Why does Cassandra use a 'query-first' data modeling approach?

A: Because Cassandra has no joins and restricted WHERE clauses — you can only efficiently query by partition key and clustering columns. If you model entities first (like in RDBMS), you'll discover your queries can't be served. By listing queries first, you design tables where each query maps to a single partition read — O(1) regardless of cluster size.

Q:What is the difference between a partition key and clustering columns?

A: The partition key determines WHICH node stores the data (distribution via hash). Clustering columns determine the SORT ORDER within that partition (physical ordering on disk). You must always provide the full partition key in queries (equality only). Clustering columns support range queries but must be filtered left-to-right without gaps.

Q:Why is denormalization correct in Cassandra?

A: Cassandra has no joins — the only way to serve different access patterns on the same data is to store it in multiple tables with different primary keys. Disk is cheap; network round trips are expensive. One denormalized read from a single partition is always faster than multiple reads from different tables that would need application-level joining.

Q:What is a Lightweight Transaction and when would you use one?

A: LWT uses Paxos consensus to provide linearizable (conditional) writes: IF NOT EXISTS, IF column = value. Use cases: unique constraints (email registration), distributed locks, compare-and-swap. Trade-off: 4-10x slower than regular writes due to 4 Paxos round trips. Only use when last-write-wins semantics are unacceptable.

Q:How do you handle unbounded partition growth in Cassandra?

A: Add a bucketing column to the partition key. For time-series: PRIMARY KEY ((device_id, day), event_time) limits each partition to one day of data. For high-volume: use hour or even minute buckets. The application must know which bucket to query (trade-off: multi-bucket queries need multiple requests). Target: partitions under 100MB.

Common Mistakes

🔗

Trying to normalize data like a relational database

Creating separate tables for users, orders, and products then trying to join them in the application. This results in multiple network round trips and defeats Cassandra's single-partition-read performance model.

✅Denormalize: embed the data you need in each table. If you need user name with their orders, store user_name in the orders table. Accept data duplication as the correct pattern.

🔍

Using ALLOW FILTERING in production

Adding ALLOW FILTERING to make a query 'work' without understanding it causes a full cluster scan. This query gets slower as the cluster grows — O(n) on data size.

✅Create a new table with the correct primary key for that query pattern. Every production query should hit exactly one partition.

📏

Ignoring partition size limits

Designing partitions that grow without bound (all messages in a chat room, all events for a device). Partitions over 100MB cause GC pressure, slow reads, and compaction issues.

✅Add time-based or sequence-based bucketing to the partition key. Monitor partition sizes with nodetool tablehistograms.

🏷️

Using secondary indexes as a primary access pattern

Creating secondary indexes on high-cardinality columns (user_id, email) and using them for primary lookups. Secondary indexes query ALL nodes (scatter-gather) — they don't scale.

✅Secondary indexes are only appropriate for low-cardinality columns queried alongside the partition key. For primary lookups, create a dedicated table.

⚡

Overusing Lightweight Transactions

Using IF NOT EXISTS or IF conditions on every write 'for safety.' LWT uses Paxos (4 round trips), is 4-10x slower, and serializes writes to the same partition.

✅Use regular writes with last-write-wins for most operations. Reserve LWT for true uniqueness constraints and compare-and-swap operations only.