Data Model & CQL
Cassandra data modeling is query-first: you design your tables to serve specific access patterns, not to represent entities. One table per query pattern is the rule, not the exception.
Table of Contents
Query-First Design Philosophy
In relational databases, you normalize entities first and write queries later. In Cassandra, you do the exact opposite: list every query your application needs, then create a table optimized for each query. This often means the same data is stored in multiple tables with different primary keys.
The Library Index Cards
A library has the same books indexed multiple ways: by author, by title, by subject, by ISBN. Each index card catalog is a different 'table' optimized for a different lookup. You don't normalize into one master catalog — you denormalize into multiple indexes so every lookup is O(1). Cassandra tables are like these index cards.
🔑 The Golden Rule
In Cassandra, you design tables to answer queries — not to model entities. If you have 5 different queries about users, you might have 5 different tables containing user data, each with a different primary key optimized for that specific query.
Step 1: List Access Patterns ═══════════════════════════════════════════════════════════════ Q1: Get user by user_id Q2: Get all posts by a user, newest first Q3: Get all posts in a category, newest first Q4: Get a specific post by post_id Q5: Get all comments on a post, oldest first Step 2: Design One Table Per Query ═══════════════════════════════════════════════════════════════ Q1 → users_by_id (PK: user_id) Q2 → posts_by_user (PK: user_id, CK: created_at DESC) Q3 → posts_by_category (PK: category, CK: created_at DESC) Q4 → posts_by_id (PK: post_id) Q5 → comments_by_post (PK: post_id, CK: created_at ASC) Step 3: Accept Denormalization ═══════════════════════════════════════════════════════════════ Post data exists in posts_by_user AND posts_by_category AND posts_by_id. This is correct. Disk is cheap. Joins are impossible. Write to all tables on insert (or use materialized views/CDC).
Query-First Principles
- ✅One table per query pattern — each table serves exactly one access pattern
- ✅Denormalization is correct — duplicate data across tables is expected
- ✅No joins — if you need data from two entities, co-locate them or duplicate
- ✅Partition key = equality filter in WHERE clause
- ✅Clustering columns = range filters and sort order
Primary Key Design
The primary key in Cassandra has two parts: the partition key (determines which node stores the data) and optional clustering columns (determine sort order within the partition). The syntax is: PRIMARY KEY ((partition_key), clustering_col1, clustering_col2).
-- Anatomy of a Cassandra Primary Key: -- PRIMARY KEY ((partition_key_cols), clustering_cols) -- Simple partition key, one clustering column CREATE TABLE posts_by_user ( user_id UUID, created_at TIMESTAMP, post_id UUID, title TEXT, body TEXT, PRIMARY KEY (user_id, created_at) -- ^^^^^^^^ ^^^^^^^^^^ -- partition clustering -- key column ) WITH CLUSTERING ORDER BY (created_at DESC); -- Composite partition key (multiple columns) CREATE TABLE sensor_data ( sensor_id TEXT, day DATE, event_time TIMESTAMP, reading DOUBLE, PRIMARY KEY ((sensor_id, day), event_time) -- ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^ -- composite partition clustering -- key (both required) column ) WITH CLUSTERING ORDER BY (event_time DESC); -- Multiple clustering columns CREATE TABLE messages ( chat_id UUID, bucket INT, sent_at TIMESTAMP, message_id UUID, sender_id UUID, body TEXT, PRIMARY KEY ((chat_id, bucket), sent_at, message_id) -- ^^^^^^^^^^^^^^^^^^^ ^^^^^^ ^^^^^^^^^^ -- composite partition CK1 CK2 ) WITH CLUSTERING ORDER BY (sent_at DESC, message_id ASC);
| Component | Purpose | Query Restriction | Cardinality Goal |
|---|---|---|---|
| Partition Key | Data distribution (which node) | Must use = in WHERE | High (many distinct values) |
| Clustering Column 1 | Primary sort within partition | Can use =, <, >, BETWEEN, begins_with | Moderate |
| Clustering Column 2+ | Secondary sort | Can only filter if CK1 is equality-filtered | Any |
The Partition Size Rule
A single partition should not exceed 100MB (hard limit is 2GB but performance degrades well before that). If your partition key leads to unbounded growth (all events for a device, all messages in a chat), add a bucketing column to the partition key (day, month, or sequence number).
Clustering Columns & Sort Order
Clustering columns determine how rows are physically sorted within a partition on disk. This sort order is defined at table creation and cannot be changed. Reads within a partition are sequential disk reads — extremely fast when the sort order matches your query.
-- Default: clustering columns sort ASC CREATE TABLE events ( device_id TEXT, event_time TIMESTAMP, event_type TEXT, payload TEXT, PRIMARY KEY (device_id, event_time) ); -- Stored on disk: oldest first -- Query: SELECT * FROM events WHERE device_id = 'sensor1' -- ORDER BY event_time ASC ← free (matches storage) -- Explicit DESC for "newest first" queries CREATE TABLE events_newest_first ( device_id TEXT, event_time TIMESTAMP, event_type TEXT, payload TEXT, PRIMARY KEY (device_id, event_time) ) WITH CLUSTERING ORDER BY (event_time DESC); -- Stored on disk: newest first -- Query: SELECT * FROM events_newest_first -- WHERE device_id = 'sensor1' LIMIT 10 -- → returns 10 most recent events (single sequential read) -- Multiple clustering columns CREATE TABLE leaderboard ( game_id TEXT, score INT, user_id UUID, PRIMARY KEY (game_id, score, user_id) ) WITH CLUSTERING ORDER BY (score DESC, user_id ASC); -- Sorted by score descending, then user_id ascending within same score
Sort Order Is Physical
Unlike SQL ORDER BY which sorts at query time, Cassandra's CLUSTERING ORDER BY determines how data is physically written to disk. Querying in the declared order is a sequential read (fast). Querying in reverse order is also supported but slightly less efficient. You cannot sort by a non-clustering column.
Clustering Column Rules
- ✅Clustering columns define physical sort order on disk
- ✅You can only filter on clustering column N if columns 1..N-1 use equality
- ✅LIMIT with matching sort order = efficient 'top N' queries
- ✅Range queries (>, <, BETWEEN) only on the last filtered clustering column
- ✅Cannot skip clustering columns in WHERE (must filter left to right)
-- Table: PRIMARY KEY (user_id, year, month, day) -- ✅ Valid queries (left-to-right filtering): SELECT * FROM t WHERE user_id = 'alice' AND year = 2024; SELECT * FROM t WHERE user_id = 'alice' AND year = 2024 AND month = 1; SELECT * FROM t WHERE user_id = 'alice' AND year = 2024 AND month >= 6; -- ❌ Invalid queries (skipping clustering columns): SELECT * FROM t WHERE user_id = 'alice' AND month = 6; -- skipped year! SELECT * FROM t WHERE user_id = 'alice' AND day = 15; -- skipped year, month! -- ❌ Invalid: range on non-last filtered column SELECT * FROM t WHERE user_id = 'alice' AND year > 2020 AND month = 6; -- year uses range, so month cannot be filtered
CQL vs SQL — Critical Differences
CQL (Cassandra Query Language) looks deceptively like SQL but has fundamental restrictions. These restrictions exist because Cassandra is distributed — operations that would require coordinating across all nodes are either forbidden or extremely expensive.
| Feature | SQL (PostgreSQL) | CQL (Cassandra) |
|---|---|---|
| JOINs | Full support | Not supported — denormalize instead |
| WHERE clause | Any column | Only partition key + clustering columns |
| GROUP BY | Any column | Only clustering columns (Cassandra 4.0+) |
| ORDER BY | Any column | Only clustering columns, must match table order |
| Aggregations | Full (SUM, AVG, COUNT) | Limited, per-partition only |
| Subqueries | Full support | Not supported |
| UPDATE semantics | Read-modify-write | Upsert (blind write, no read) |
| DELETE | Immediate removal | Writes a tombstone (removed later) |
| NULL handling | Stored explicitly | Absent = not stored (saves space) |
| Secondary indexes | Efficient B-tree | Expensive, avoid in production |
-- CQL looks like SQL but behaves very differently: -- ❌ No JOINs SELECT * FROM users u JOIN orders o ON u.id = o.user_id; -- ERROR -- ❌ WHERE must include full partition key SELECT * FROM posts_by_user WHERE title = 'Hello'; -- ERROR -- Must be: WHERE user_id = ? (partition key required) -- ❌ No arbitrary ORDER BY SELECT * FROM posts_by_user WHERE user_id = ? ORDER BY title; -- ERROR (title is not a clustering column) -- ✅ UPDATE is actually an UPSERT (no read-before-write) UPDATE users SET name = 'Alice' WHERE user_id = 123; -- If user 123 doesn't exist, this CREATES it! -- There is no "row not found" error for UPDATE -- ✅ INSERT is also an UPSERT INSERT INTO users (user_id, name) VALUES (123, 'Alice'); -- If user 123 exists, this OVERWRITES it! -- There is no "duplicate key" error for INSERT -- ✅ TTL — automatic expiration INSERT INTO sessions (session_id, user_id, data) VALUES (uuid(), 123, 'session_data') USING TTL 3600; -- expires in 1 hour -- ✅ Lightweight Transaction (conditional write) INSERT INTO users (user_id, name) VALUES (123, 'Alice') IF NOT EXISTS; -- only inserts if truly new (uses Paxos)
UPDATE = INSERT in Cassandra
This is the most surprising difference for SQL developers. Cassandra does not read before writing. UPDATE creates the row if it doesn't exist. INSERT overwrites if it does. They are semantically identical operations. If you need true "insert-only" or "update-only" behavior, use Lightweight Transactions (IF NOT EXISTS / IF EXISTS).
Data Modeling Patterns
Pattern 1: Time-Series with Bucketing
-- Problem: All events for a device in one partition → unbounded growth -- Solution: Bucket by day (or hour for high-volume) CREATE TABLE sensor_readings ( sensor_id TEXT, day DATE, -- bucketing column in partition key event_time TIMESTAMP, value DOUBLE, PRIMARY KEY ((sensor_id, day), event_time) ) WITH CLUSTERING ORDER BY (event_time DESC); -- Query today's readings: SELECT * FROM sensor_readings WHERE sensor_id = 'temp-001' AND day = '2024-01-15' LIMIT 100; -- Query a range within a day: SELECT * FROM sensor_readings WHERE sensor_id = 'temp-001' AND day = '2024-01-15' AND event_time >= '2024-01-15 10:00:00' AND event_time <= '2024-01-15 12:00:00';
Pattern 2: Denormalized Lookup Tables
-- Same data, different access patterns, different tables: -- Lookup by email (login flow) CREATE TABLE users_by_email ( email TEXT PRIMARY KEY, user_id UUID, name TEXT, password_hash TEXT ); -- Lookup by user_id (profile page) CREATE TABLE users_by_id ( user_id UUID PRIMARY KEY, email TEXT, name TEXT, bio TEXT, avatar TEXT ); -- Application writes to BOTH tables on user creation: -- BEGIN BATCH -- INSERT INTO users_by_email ... -- INSERT INTO users_by_id ... -- APPLY BATCH;
Pattern 3: Wide Rows for Relationships
-- Followers/following relationship CREATE TABLE followers ( user_id UUID, follower_id UUID, followed_at TIMESTAMP, PRIMARY KEY (user_id, follower_id) ); -- Get all followers of a user: SELECT * FROM followers WHERE user_id = ?; -- Check if specific user follows: SELECT * FROM followers WHERE user_id = ? AND follower_id = ?; -- Count followers (expensive — counts all rows in partition): SELECT COUNT(*) FROM followers WHERE user_id = ?; -- Better: maintain a counter table separately
Pattern 4: Materialized Views (Use with Caution)
-- Base table CREATE TABLE users ( user_id UUID PRIMARY KEY, email TEXT, country TEXT, name TEXT ); -- Materialized view: automatic denormalization CREATE MATERIALIZED VIEW users_by_country AS SELECT * FROM users WHERE country IS NOT NULL AND user_id IS NOT NULL PRIMARY KEY (country, user_id); -- Cassandra automatically keeps the view in sync -- ⚠️ BUT: MVs have known bugs and performance issues -- Many teams prefer manual denormalization with BATCH writes
Anti-Patterns to Avoid
Data Modeling Anti-Patterns
- ❌Using secondary indexes for high-cardinality columns (user_id, email)
- ❌Unbounded partition growth — no bucketing for time-series data
- ❌Relying on ALLOW FILTERING in production queries
- ❌Using Cassandra counters for critical business logic (they can lose updates)
- ❌Storing large blobs (>1MB) directly in Cassandra columns
- ❌Creating too many materialized views (each adds write amplification)
- ❌Using IN clause with large value lists (creates coordinator pressure)
-- ❌ ALLOW FILTERING — full cluster scan SELECT * FROM users WHERE name = 'Alice' ALLOW FILTERING; -- Reads EVERY partition, filters in memory. O(n) on cluster size. -- Never use in production. Create a table with name as partition key. -- ❌ Secondary index on high-cardinality column CREATE INDEX ON users (email); SELECT * FROM users WHERE email = 'alice@example.com'; -- Queries ALL nodes (scatter-gather). Extremely slow at scale. -- Solution: Create users_by_email table. -- ❌ Large IN clause SELECT * FROM posts WHERE user_id IN (uuid1, uuid2, ..., uuid100); -- Coordinator must query 100 different partitions and merge. -- Solution: Query individually and merge in application. -- ❌ Unbounded partition CREATE TABLE events ( device_id TEXT, event_time TIMESTAMP, data TEXT, PRIMARY KEY (device_id, event_time) ); -- A device running for years → partition grows to GB+ -- Solution: Add bucketing: PRIMARY KEY ((device_id, month), event_time)
ALLOW FILTERING = Table Scan
If your query requires ALLOW FILTERING, it means Cassandra cannot serve it efficiently. It will read every partition in the cluster and filter in memory. This is acceptable for development/debugging but must never appear in production code. The fix is always to create a new table with the correct primary key for that query.
Lightweight Transactions (LWT)
Lightweight Transactions (LWT) provide linearizable consistency using the Paxos consensus protocol. They enable conditional writes — "insert if not exists" or "update if current value matches." However, they are 4-10x slower than regular writes and should be used sparingly.
-- Conditional INSERT (create if not exists) INSERT INTO users (user_id, email, name) VALUES (uuid(), 'alice@example.com', 'Alice') IF NOT EXISTS; -- Returns [applied]=true if inserted, false if email already taken -- Conditional UPDATE (compare-and-swap) UPDATE accounts SET balance = 900 WHERE account_id = 123 IF balance = 1000; -- Only updates if current balance is exactly 1000 -- Returns [applied]=true/false and current values if false -- Conditional DELETE DELETE FROM locks WHERE lock_id = 'resource-1' IF owner = 'worker-5'; -- Only deletes if current owner matches
| Aspect | Regular Write | LWT (IF EXISTS/IF NOT EXISTS) |
|---|---|---|
| Consensus | None (last-write-wins) | Paxos (4 round trips) |
| Latency | ~2-5ms | ~15-50ms (4-10x slower) |
| Throughput | High | Low (serial execution per partition) |
| Consistency | Eventual or tunable | Linearizable (SERIAL) |
| Use case | Most writes | Uniqueness, CAS, distributed locks |
When to Use LWT
Use LWT only when you need true conditional semantics: unique email registration, distributed locks, compare-and-swap counters. For most writes, last-write-wins (regular writes) is sufficient. If you find yourself using LWT on every write, Cassandra may not be the right database for your use case.
LWT Best Practices
- ✅Use sparingly — only for operations that truly need conditional logic
- ✅Never mix LWT and non-LWT writes to the same partition (undefined behavior)
- ✅LWT operations on the same partition are serialized (bottleneck)
- ✅Set a reasonable timeout — Paxos can be slow under contention
- ✅Consider if last-write-wins is actually acceptable before reaching for LWT
Interview Questions
Q:Why does Cassandra use a 'query-first' data modeling approach?
A: Because Cassandra has no joins and restricted WHERE clauses — you can only efficiently query by partition key and clustering columns. If you model entities first (like in RDBMS), you'll discover your queries can't be served. By listing queries first, you design tables where each query maps to a single partition read — O(1) regardless of cluster size.
Q:What is the difference between a partition key and clustering columns?
A: The partition key determines WHICH node stores the data (distribution via hash). Clustering columns determine the SORT ORDER within that partition (physical ordering on disk). You must always provide the full partition key in queries (equality only). Clustering columns support range queries but must be filtered left-to-right without gaps.
Q:Why is denormalization correct in Cassandra?
A: Cassandra has no joins — the only way to serve different access patterns on the same data is to store it in multiple tables with different primary keys. Disk is cheap; network round trips are expensive. One denormalized read from a single partition is always faster than multiple reads from different tables that would need application-level joining.
Q:What is a Lightweight Transaction and when would you use one?
A: LWT uses Paxos consensus to provide linearizable (conditional) writes: IF NOT EXISTS, IF column = value. Use cases: unique constraints (email registration), distributed locks, compare-and-swap. Trade-off: 4-10x slower than regular writes due to 4 Paxos round trips. Only use when last-write-wins semantics are unacceptable.
Q:How do you handle unbounded partition growth in Cassandra?
A: Add a bucketing column to the partition key. For time-series: PRIMARY KEY ((device_id, day), event_time) limits each partition to one day of data. For high-volume: use hour or even minute buckets. The application must know which bucket to query (trade-off: multi-bucket queries need multiple requests). Target: partitions under 100MB.
Common Mistakes
Trying to normalize data like a relational database
Creating separate tables for users, orders, and products then trying to join them in the application. This results in multiple network round trips and defeats Cassandra's single-partition-read performance model.
✅Denormalize: embed the data you need in each table. If you need user name with their orders, store user_name in the orders table. Accept data duplication as the correct pattern.
Using ALLOW FILTERING in production
Adding ALLOW FILTERING to make a query 'work' without understanding it causes a full cluster scan. This query gets slower as the cluster grows — O(n) on data size.
✅Create a new table with the correct primary key for that query pattern. Every production query should hit exactly one partition.
Ignoring partition size limits
Designing partitions that grow without bound (all messages in a chat room, all events for a device). Partitions over 100MB cause GC pressure, slow reads, and compaction issues.
✅Add time-based or sequence-based bucketing to the partition key. Monitor partition sizes with nodetool tablehistograms.
Using secondary indexes as a primary access pattern
Creating secondary indexes on high-cardinality columns (user_id, email) and using them for primary lookups. Secondary indexes query ALL nodes (scatter-gather) — they don't scale.
✅Secondary indexes are only appropriate for low-cardinality columns queried alongside the partition key. For primary lookups, create a dedicated table.
Overusing Lightweight Transactions
Using IF NOT EXISTS or IF conditions on every write 'for safety.' LWT uses Paxos (4 round trips), is 4-10x slower, and serializes writes to the same partition.
✅Use regular writes with last-write-wins for most operations. Reserve LWT for true uniqueness constraints and compare-and-swap operations only.