Commit LogMemtableSSTableBloom FilterKey CacheRead Amplification

Write & Read Path

Cassandra's write path is optimized for speed — sequential I/O only, no read-before-write. The read path trades complexity for flexibility, using bloom filters and caches to minimize disk access.

45 min read9 sections

Write Path Overview

Cassandra's write path is designed for maximum throughput. Every write goes to two places simultaneously: the commit log (for durability) and the memtable (for serving reads). There is no read-before-write, no lock acquisition, no index update on the write path. This is why Cassandra writes are so fast.

Client Sends Write

Client sends INSERT/UPDATE to coordinator node (or directly to replica via token-aware driver)

Coordinator Forwards

Coordinator identifies replica nodes from token ring and forwards write to all replicas simultaneously

Commit Log (Sequential Write)

Each replica appends the mutation to its commit log — a sequential, append-only file on disk. This is the durability guarantee.

Memtable (In-Memory Write)

The mutation is also written to the memtable — an in-memory sorted data structure (skip list or similar)

Acknowledge

Once commit log and memtable are written, the replica sends ACK to coordinator

Coordinator Responds

Once enough ACKs received (based on consistency level), coordinator returns success to client

📝

The Accountant's Ledger

Imagine an accountant who records every transaction in two places: a permanent ledger (commit log) written sequentially page by page, and sticky notes on a whiteboard sorted by account (memtable). The ledger is for safety — if the office burns down, the ledger survives. The whiteboard is for speed — finding the latest balance is instant. Periodically, the whiteboard is photographed and filed (SSTable flush), and the sticky notes are cleared.

write-path.txttext

Write Path on a Single Replica Node:

Client Request: INSERT INTO users (id, name) VALUES ('alice', 'Alice')
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────┐
│                    Replica Node                          │
│                                                         │
│  ┌──────────────┐         ┌──────────────────────┐     │
│  │ Commit Log   │         │ Memtable (in-memory) │     │
│  │ (append-only)│         │ Sorted by partition   │     │
│  │              │         │ key + clustering cols │     │
│  │ ...          │         │                      │     │
│  │ {alice:Alice}│ ◄──┬──► │ alice → {name:Alice} │     │
│  │ ...          │    │    │                      │     │
│  └──────────────┘    │    └──────────┬───────────┘     │
│                      │               │ (when full)     │
│                  WRITE               ▼                  │
│                  (simultaneous)  ┌──────────┐          │
│                                 │ SSTable  │          │
│                                 │ (on disk)│          │
│                                 └──────────┘          │
└─────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                              ACK to Coordinator

Commit Log

The commit log is Cassandra's durability mechanism. It's an append-only file where every mutation is written sequentially. If a node crashes, the commit log is replayed on restart to recover any data that was in the memtable but not yet flushed to SSTables.

Commit Log Properties

✅Append-only — sequential writes only (fastest possible disk I/O)
✅Shared across all tables — one commit log for the entire node
✅Segmented — divided into segments (default 32MB) that are recycled
✅Synced to disk based on commitlog_sync setting (periodic or batch)
✅Deleted once all mutations in a segment are flushed to SSTables

cassandra.yamlyaml

# Commit log sync modes:

# Periodic (default) — sync every N ms
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000  # 10 seconds
# Risk: up to 10s of data loss on crash
# Benefit: highest write throughput

# Batch — sync on every write (or batch of writes)
commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 2  # group writes within 2ms
# Risk: minimal data loss
# Benefit: stronger durability
# Cost: higher write latency (waits for fsync)

# Commit log location (separate disk recommended)
commitlog_directory: /var/lib/cassandra/commitlog
# Using a dedicated SSD for commit log avoids I/O contention
# with SSTable reads/compaction on the data disk

Separate Commit Log Disk

In production, place the commit log on a separate physical disk (or SSD) from the data directory. The commit log does sequential writes while data reads are random I/O. Mixing them on one disk causes contention. With SSDs this matters less, but dedicated commit log disks still improve p99 latency.

Memtable & SSTable Flush

The memtable is an in-memory data structure (one per table) that holds recent writes sorted by partition key and clustering columns. When the memtable reaches a size threshold, it's flushed to disk as an immutable SSTable (Sorted String Table).

Component	Location	Mutability	Purpose
Memtable	Memory (heap)	Mutable (accepts writes)	Fast reads of recent data
SSTable	Disk	Immutable (never modified)	Persistent sorted data
Commit Log	Disk (sequential)	Append-only	Crash recovery

sstable-structure.txttext

SSTable Components (on disk):
═══════════════════════════════════════════════════════════════
File                  | Purpose
═══════════════════════════════════════════════════════════════
*-Data.db             | Actual row data (sorted by token)
*-Index.db            | Partition index (token → offset in Data.db)
*-Summary.db          | Sampled index (every Nth entry from Index.db)
*-Filter.db           | Bloom filter (probabilistic membership test)
*-Statistics.db       | Metadata (min/max timestamps, tombstone count)
*-CompressionInfo.db  | Compression offsets for random access
*-TOC.txt             | Table of contents listing all components
═══════════════════════════════════════════════════════════════

Flush triggers:
  - Memtable size exceeds memtable_cleanup_threshold
  - Commit log segment needs recycling
  - Manual: nodetool flush
  - Node shutdown (graceful)

Key property: SSTables are IMMUTABLE
  - Never modified after creation
  - Updates create new SSTables (old versions remain until compaction)
  - Deletes write tombstones (not actual deletion)

Immutability Is the Key Insight

SSTables are never modified after creation. An "update" to a row creates a new entry in a new SSTable with a newer timestamp. A "delete" writes a tombstone marker. The old data remains until compaction merges SSTables and discards obsolete versions. This immutability enables lock-free reads and simple backup (just copy files).

Why Writes Are Fast

Cassandra writes are consistently fast (sub-millisecond on the node) because the write path avoids every operation that makes traditional databases slow: no read-before-write, no random I/O, no lock contention, no index maintenance on the hot path.

Operation	Traditional RDBMS	Cassandra
Disk I/O pattern	Random (B-tree updates)	Sequential only (append to commit log)
Read before write	Yes (check constraints, indexes)	No (blind write, last-write-wins)
Lock acquisition	Row/page locks	None (append-only structures)
Index update	Synchronous B-tree insert	Deferred (handled during flush/compaction)
Fsync frequency	Every commit	Configurable (periodic or batch)
Write amplification	High (WAL + B-tree + indexes)	Low (commit log + memtable only)

Why Cassandra Writes Are O(1)

✅Commit log: sequential append — the fastest possible disk operation
✅Memtable: in-memory sorted insert — no disk I/O
✅No read-before-write: UPDATE doesn't check if row exists
✅No constraint checking: no unique constraints (except LWT), no foreign keys
✅No lock contention: append-only structures don't need locks
✅Constant time regardless of table size — 1 row or 1 billion rows, same speed

📮

The Mailbox vs The Filing Cabinet

A traditional database is like a filing cabinet — to add a document, you must find the right drawer, the right folder, move other documents aside, and file it in order. Cassandra is like a mailbox — you just drop the letter in the slot. It's always fast because you never open the box to organize. Organization happens later (compaction) in the background.

The Trade-off

The speed of writes comes at a cost: reads are more complex. Because updates don't modify existing data (they create new entries), a read might need to check multiple SSTables to find the latest version of a row. This is read amplification — the price paid for fast writes.

Read Path Overview

The read path is more complex than the write path. Cassandra must check multiple locations — the memtable and potentially many SSTables — to assemble the complete, most-recent version of the requested data. Multiple optimizations (bloom filters, caches, indexes) minimize the actual disk reads required.

Check Memtable

First check the in-memory memtable — if the data was recently written, it's here (fastest path)

Check Row Cache (if enabled)

Optional: check the row cache for frequently-read hot rows (stores deserialized row data)

Check Bloom Filters

For each SSTable, check its bloom filter: 'Could this partition exist in this SSTable?' If no → skip entirely

Check Partition Key Cache

If bloom filter says 'maybe', check the key cache for the exact byte offset of this partition in the SSTable

Read Partition Index

If not in key cache, read the partition index (Summary → Index → Data) to find the offset

Read SSTable Data

Seek to the offset in the Data.db file and read the partition data

Merge Results

Merge data from memtable + all relevant SSTables, resolve by timestamp (newest wins), apply tombstones

read-path.txttext

Read Path Decision Tree:

SELECT * FROM users WHERE user_id = 'alice'
                          │
                          ▼
                   ┌─────────────┐
                   │  Memtable   │──── Found? Include in merge
                   └─────────────┘
                          │
                          ▼
                   ┌─────────────┐
                   │  Row Cache  │──── Hit? Return immediately
                   └─────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │ For each SSTable:     │
              │                       │
              │  Bloom Filter ──No──► Skip this SSTable
              │       │               │
              │      Maybe            │
              │       │               │
              │  Key Cache ───Hit──► Read at offset
              │       │               │
              │      Miss             │
              │       │               │
              │  Summary Index        │
              │       │               │
              │  Partition Index      │
              │       │               │
              │  Read Data.db         │
              └───────────────────────┘
                          │
                          ▼
                   ┌─────────────┐
                   │   Merge     │ Newest timestamp wins
                   │   Results   │ Apply tombstones
                   └─────────────┘
                          │
                          ▼
                   Return to client

Bloom Filters & Caches

Bloom filters are the most important read optimization in Cassandra. They answer the question "does this partition key exist in this SSTable?" with zero disk I/O. A bloom filter can say "definitely not here" (100% accurate) or "maybe here" (small false positive rate).

Cache/Filter	What It Stores	Hit Effect	Memory Cost
Bloom Filter	Probabilistic set membership	Eliminates SSTable from read path	~1.5 bytes per partition per SSTable
Key Cache	Partition key → byte offset in SSTable	Skips index lookup, direct seek	Configurable (default 100MB)
Row Cache	Full deserialized partition data	Skips all disk I/O	High (stores full rows in heap)
Chunk Cache	Decompressed SSTable chunks	Avoids re-decompression	Off-heap, configurable

bloom-filter.txttext

Bloom Filter Example:

SSTable has partitions: [alice, bob, charlie, dave]
Bloom filter: bit array with multiple hash functions

Query: "Does 'eve' exist in this SSTable?"
  hash1('eve') → bit 7 → 0 (not set)
  → DEFINITELY NOT HERE. Skip this SSTable entirely.

Query: "Does 'alice' exist in this SSTable?"
  hash1('alice') → bit 3 → 1
  hash2('alice') → bit 9 → 1
  hash3('alice') → bit 2 → 1
  → MAYBE HERE. Must check the actual SSTable.

False positive rate (default): ~1%
  - 99% of unnecessary SSTable reads are eliminated
  - Configurable via bloom_filter_fp_chance (per table)
  - Lower fp_chance = more memory, fewer false positives

-- Tune bloom filter for read-heavy tables:
ALTER TABLE hot_table
  WITH bloom_filter_fp_chance = 0.001;  -- 0.1% false positive

Key Cache Is Almost Always Worth It

The key cache maps partition keys to their exact byte offset in SSTable data files. A key cache hit means one direct disk seek instead of reading through the partition index. Default size is 100MB — for most workloads, this caches the majority of active partitions. Monitor key_cache_hit_rate in nodetool info.

Cache Recommendations

✅Bloom filters: always enabled, tune fp_chance for read-heavy tables
✅Key cache: always enabled (default 100MB), monitor hit rate
✅Row cache: rarely useful — only for small, frequently-read, rarely-updated tables
✅Chunk cache: useful for compressed tables with repeated reads
✅Off-heap caches preferred to avoid GC pressure

Read Amplification

Read amplification is the number of SSTables that must be checked to serve a single read. Because Cassandra never modifies SSTables in place, a single partition's data may be spread across many SSTables — each containing a different version or subset of the row's columns.

📚

The Newspaper Archive

Imagine looking up a company's stock price history. Each day's newspaper is a separate SSTable. To get the full history, you must check every newspaper. If you only want today's price, you still might need to check several recent papers because corrections (updates) appear in later editions. Compaction is like publishing a consolidated annual report — merging all daily papers into one volume.

read-amplification.txttext

Read Amplification Example:

Partition 'alice' has been updated 5 times:
  SSTable 1 (oldest): alice = {name: "Alice", age: 25}
  SSTable 2:          alice = {age: 26}  (only age updated)
  SSTable 3:          alice = {email: "a@b.com"}  (new column)
  SSTable 4:          alice = {age: 27}
  SSTable 5 (newest): alice = {name: "Alice Smith"}

To read alice's current state:
  → Must check ALL 5 SSTables
  → Merge: {name: "Alice Smith", age: 27, email: "a@b.com"}
  → Read amplification = 5

After compaction merges SSTables 1-5 into one:
  New SSTable: alice = {name: "Alice Smith", age: 27, email: "a@b.com"}
  → Read amplification = 1

Monitoring:
  nodetool tablestats my_keyspace.my_table
  → SSTable count: indicates read amplification potential
  → Read latency: p99 correlates with SSTable count

Factor	Increases Read Amplification	Decreases Read Amplification
Write pattern	Frequent updates to same partition	Write-once data (time-series)
Compaction	Falling behind / not running	Aggressive compaction strategy
Table design	Wide partitions with partial updates	Narrow partitions, full overwrites
Bloom filters	High false positive rate	Low fp_chance setting
Compaction strategy	STCS (size-tiered)	LCS (leveled)

Compaction Is the Cure

Compaction merges multiple SSTables into fewer, larger ones — discarding obsolete versions and expired tombstones. It directly reduces read amplification. If your reads are slow, check SSTable count per table (nodetool tablestats). High counts mean compaction is falling behind.

Interview Questions

Q:Describe the Cassandra write path and explain why writes are fast.

A: Write path: (1) append to commit log (sequential I/O), (2) write to memtable (in-memory). That's it — ACK returned. Writes are fast because: no read-before-write, no random I/O, no lock contention, no index updates on the hot path. The commit log is sequential (fastest disk operation), and the memtable is in-memory. Writes are O(1) regardless of table size.

Q:What is an SSTable and why is it immutable?

A: An SSTable (Sorted String Table) is an immutable on-disk file containing partition data sorted by token. Immutability enables: (1) lock-free concurrent reads, (2) simple backups (copy files), (3) efficient compaction (merge-sort without locks), (4) no write amplification from in-place updates. The trade-off: updates create new SSTables, causing read amplification until compaction merges them.

Q:How do bloom filters optimize the read path?

A: Each SSTable has a bloom filter — a probabilistic data structure that answers 'does this partition key exist in this SSTable?' It can say 'definitely no' (skip the SSTable entirely) or 'maybe yes' (must check). With default 1% false positive rate, 99% of unnecessary SSTable reads are eliminated. Bloom filters are checked in memory with zero disk I/O, making them extremely fast.

Q:What is read amplification and how do you reduce it?

A: Read amplification is the number of SSTables checked per read. It increases when the same partition is updated across many SSTables. Reduction strategies: (1) compaction merges SSTables (LCS gives lowest read amplification), (2) tune bloom filters to reduce false positives, (3) design tables for write-once patterns, (4) use key cache for direct offset lookups, (5) monitor with nodetool tablestats.

Q:What is the role of the commit log vs the memtable?

A: Commit log = durability (survives crashes). Memtable = performance (serves reads from memory). They serve different purposes: the commit log is append-only sequential I/O for crash recovery — it's replayed on restart to rebuild the memtable. The memtable is a sorted in-memory structure for fast reads of recent data. Once the memtable flushes to an SSTable, the corresponding commit log segments can be deleted.

Common Mistakes

💾

Putting commit log and data on the same disk

The commit log does sequential writes while data reads are random I/O. Sharing a disk causes I/O contention — commit log writes wait behind random reads, increasing write latency.

✅Use a dedicated SSD for the commit log (commitlog_directory). Even a small, fast SSD is sufficient since commit log segments are recycled.

🗑️

Enabling row cache for frequently-updated tables

Row cache stores full deserialized rows in heap memory. Every update invalidates the cache entry, causing constant cache churn, GC pressure, and no actual benefit.

✅Row cache is only useful for small, rarely-updated, frequently-read tables (like configuration data). For most tables, rely on key cache and bloom filters instead.

📊

Ignoring SSTable count as a performance indicator

Not monitoring SSTable count per table. High counts mean compaction is falling behind, causing read amplification — every read checks more files, increasing latency.

✅Monitor with nodetool tablestats. If SSTable count grows continuously, investigate compaction throughput, disk space, or switch compaction strategy (STCS → LCS for read-heavy tables).

⚙️

Using commitlog_sync: periodic without understanding the risk

Periodic sync (default) only fsyncs every 10 seconds. A crash within that window loses up to 10 seconds of acknowledged writes. Teams assume 'acknowledged = durable' which isn't true with periodic sync.

✅For critical data, use commitlog_sync: batch (fsyncs on every write batch). Accept the ~2ms latency increase. Or understand and document the data loss window for your use case.

🔄

Frequent partial updates to wide partitions

Updating individual columns in a wide partition creates many small SSTable entries. Reading the full partition requires merging across all these SSTables — severe read amplification.

✅Design for full-row writes where possible. If partial updates are necessary, use LCS compaction to keep read amplification low, and monitor SSTable count.