Commit LogMemtableSSTableBloom FilterKey CacheRead Amplification

Write & Read Path

Cassandra's write path is optimized for speed — sequential I/O only, no read-before-write. The read path trades complexity for flexibility, using bloom filters and caches to minimize disk access.

45 min read9 sections
01

Write Path Overview

Cassandra's write path is designed for maximum throughput. Every write goes to two places simultaneously: the commit log (for durability) and the memtable (for serving reads). There is no read-before-write, no lock acquisition, no index update on the write path. This is why Cassandra writes are so fast.

1

Client Sends Write

Client sends INSERT/UPDATE to coordinator node (or directly to replica via token-aware driver)

2

Coordinator Forwards

Coordinator identifies replica nodes from token ring and forwards write to all replicas simultaneously

3

Commit Log (Sequential Write)

Each replica appends the mutation to its commit log — a sequential, append-only file on disk. This is the durability guarantee.

4

Memtable (In-Memory Write)

The mutation is also written to the memtable — an in-memory sorted data structure (skip list or similar)

5

Acknowledge

Once commit log and memtable are written, the replica sends ACK to coordinator

6

Coordinator Responds

Once enough ACKs received (based on consistency level), coordinator returns success to client

šŸ“

The Accountant's Ledger

Imagine an accountant who records every transaction in two places: a permanent ledger (commit log) written sequentially page by page, and sticky notes on a whiteboard sorted by account (memtable). The ledger is for safety — if the office burns down, the ledger survives. The whiteboard is for speed — finding the latest balance is instant. Periodically, the whiteboard is photographed and filed (SSTable flush), and the sticky notes are cleared.

write-path.txttext
Write Path on a Single Replica Node:

Client Request: INSERT INTO users (id, name) VALUES ('alice', 'Alice')
                                    │
                                    ā–¼
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                    Replica Node                          │
│                                                         │
│  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”         ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”     │
│  │ Commit Log   │         │ Memtable (in-memory) │     │
│  │ (append-only)│         │ Sorted by partition   │     │
│  │              │         │ key + clustering cols │     │
│  │ ...          │         │                      │     │
│  │ {alice:Alice}│ ◄──┬──► │ alice → {name:Alice} │     │
│  │ ...          │    │    │                      │     │
│  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    │    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜     │
│                      │               │ (when full)     │
│                  WRITE               ā–¼                  │
│                  (simultaneous)  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”          │
│                                 │ SSTable  │          │
│                                 │ (on disk)│          │
│                                 ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜          │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                                    │
                                    ā–¼
                              ACK to Coordinator
02

Commit Log

The commit log is Cassandra's durability mechanism. It's an append-only file where every mutation is written sequentially. If a node crashes, the commit log is replayed on restart to recover any data that was in the memtable but not yet flushed to SSTables.

Commit Log Properties

  • āœ…Append-only — sequential writes only (fastest possible disk I/O)
  • āœ…Shared across all tables — one commit log for the entire node
  • āœ…Segmented — divided into segments (default 32MB) that are recycled
  • āœ…Synced to disk based on commitlog_sync setting (periodic or batch)
  • āœ…Deleted once all mutations in a segment are flushed to SSTables
cassandra.yamlyaml
# Commit log sync modes:

# Periodic (default) — sync every N ms
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000  # 10 seconds
# Risk: up to 10s of data loss on crash
# Benefit: highest write throughput

# Batch — sync on every write (or batch of writes)
commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 2  # group writes within 2ms
# Risk: minimal data loss
# Benefit: stronger durability
# Cost: higher write latency (waits for fsync)

# Commit log location (separate disk recommended)
commitlog_directory: /var/lib/cassandra/commitlog
# Using a dedicated SSD for commit log avoids I/O contention
# with SSTable reads/compaction on the data disk

Separate Commit Log Disk

In production, place the commit log on a separate physical disk (or SSD) from the data directory. The commit log does sequential writes while data reads are random I/O. Mixing them on one disk causes contention. With SSDs this matters less, but dedicated commit log disks still improve p99 latency.

03

Memtable & SSTable Flush

The memtable is an in-memory data structure (one per table) that holds recent writes sorted by partition key and clustering columns. When the memtable reaches a size threshold, it's flushed to disk as an immutable SSTable (Sorted String Table).

ComponentLocationMutabilityPurpose
MemtableMemory (heap)Mutable (accepts writes)Fast reads of recent data
SSTableDiskImmutable (never modified)Persistent sorted data
Commit LogDisk (sequential)Append-onlyCrash recovery
sstable-structure.txttext
SSTable Components (on disk):
═══════════════════════════════════════════════════════════════
File                  | Purpose
═══════════════════════════════════════════════════════════════
*-Data.db             | Actual row data (sorted by token)
*-Index.db            | Partition index (token → offset in Data.db)
*-Summary.db          | Sampled index (every Nth entry from Index.db)
*-Filter.db           | Bloom filter (probabilistic membership test)
*-Statistics.db       | Metadata (min/max timestamps, tombstone count)
*-CompressionInfo.db  | Compression offsets for random access
*-TOC.txt             | Table of contents listing all components
═══════════════════════════════════════════════════════════════

Flush triggers:
  - Memtable size exceeds memtable_cleanup_threshold
  - Commit log segment needs recycling
  - Manual: nodetool flush
  - Node shutdown (graceful)

Key property: SSTables are IMMUTABLE
  - Never modified after creation
  - Updates create new SSTables (old versions remain until compaction)
  - Deletes write tombstones (not actual deletion)

Immutability Is the Key Insight

SSTables are never modified after creation. An "update" to a row creates a new entry in a new SSTable with a newer timestamp. A "delete" writes a tombstone marker. The old data remains until compaction merges SSTables and discards obsolete versions. This immutability enables lock-free reads and simple backup (just copy files).

04

Why Writes Are Fast

Cassandra writes are consistently fast (sub-millisecond on the node) because the write path avoids every operation that makes traditional databases slow: no read-before-write, no random I/O, no lock contention, no index maintenance on the hot path.

OperationTraditional RDBMSCassandra
Disk I/O patternRandom (B-tree updates)Sequential only (append to commit log)
Read before writeYes (check constraints, indexes)No (blind write, last-write-wins)
Lock acquisitionRow/page locksNone (append-only structures)
Index updateSynchronous B-tree insertDeferred (handled during flush/compaction)
Fsync frequencyEvery commitConfigurable (periodic or batch)
Write amplificationHigh (WAL + B-tree + indexes)Low (commit log + memtable only)

Why Cassandra Writes Are O(1)

  • āœ…Commit log: sequential append — the fastest possible disk operation
  • āœ…Memtable: in-memory sorted insert — no disk I/O
  • āœ…No read-before-write: UPDATE doesn't check if row exists
  • āœ…No constraint checking: no unique constraints (except LWT), no foreign keys
  • āœ…No lock contention: append-only structures don't need locks
  • āœ…Constant time regardless of table size — 1 row or 1 billion rows, same speed
šŸ“®

The Mailbox vs The Filing Cabinet

A traditional database is like a filing cabinet — to add a document, you must find the right drawer, the right folder, move other documents aside, and file it in order. Cassandra is like a mailbox — you just drop the letter in the slot. It's always fast because you never open the box to organize. Organization happens later (compaction) in the background.

The Trade-off

The speed of writes comes at a cost: reads are more complex. Because updates don't modify existing data (they create new entries), a read might need to check multiple SSTables to find the latest version of a row. This is read amplification — the price paid for fast writes.

05

Read Path Overview

The read path is more complex than the write path. Cassandra must check multiple locations — the memtable and potentially many SSTables — to assemble the complete, most-recent version of the requested data. Multiple optimizations (bloom filters, caches, indexes) minimize the actual disk reads required.

1

Check Memtable

First check the in-memory memtable — if the data was recently written, it's here (fastest path)

2

Check Row Cache (if enabled)

Optional: check the row cache for frequently-read hot rows (stores deserialized row data)

3

Check Bloom Filters

For each SSTable, check its bloom filter: 'Could this partition exist in this SSTable?' If no → skip entirely

4

Check Partition Key Cache

If bloom filter says 'maybe', check the key cache for the exact byte offset of this partition in the SSTable

5

Read Partition Index

If not in key cache, read the partition index (Summary → Index → Data) to find the offset

6

Read SSTable Data

Seek to the offset in the Data.db file and read the partition data

7

Merge Results

Merge data from memtable + all relevant SSTables, resolve by timestamp (newest wins), apply tombstones

read-path.txttext
Read Path Decision Tree:

SELECT * FROM users WHERE user_id = 'alice'
                          │
                          ā–¼
                   ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
                   │  Memtable   │──── Found? Include in merge
                   ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                          │
                          ā–¼
                   ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
                   │  Row Cache  │──── Hit? Return immediately
                   ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                          │
                          ā–¼
              ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
              │ For each SSTable:     │
              │                       │
              │  Bloom Filter ──No──► Skip this SSTable
              │       │               │
              │      Maybe            │
              │       │               │
              │  Key Cache ───Hit──► Read at offset
              │       │               │
              │      Miss             │
              │       │               │
              │  Summary Index        │
              │       │               │
              │  Partition Index      │
              │       │               │
              │  Read Data.db         │
              ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                          │
                          ā–¼
                   ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
                   │   Merge     │ Newest timestamp wins
                   │   Results   │ Apply tombstones
                   ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                          │
                          ā–¼
                   Return to client
06

Bloom Filters & Caches

Bloom filters are the most important read optimization in Cassandra. They answer the question "does this partition key exist in this SSTable?" with zero disk I/O. A bloom filter can say "definitely not here" (100% accurate) or "maybe here" (small false positive rate).

Cache/FilterWhat It StoresHit EffectMemory Cost
Bloom FilterProbabilistic set membershipEliminates SSTable from read path~1.5 bytes per partition per SSTable
Key CachePartition key → byte offset in SSTableSkips index lookup, direct seekConfigurable (default 100MB)
Row CacheFull deserialized partition dataSkips all disk I/OHigh (stores full rows in heap)
Chunk CacheDecompressed SSTable chunksAvoids re-decompressionOff-heap, configurable
bloom-filter.txttext
Bloom Filter Example:

SSTable has partitions: [alice, bob, charlie, dave]
Bloom filter: bit array with multiple hash functions

Query: "Does 'eve' exist in this SSTable?"
  hash1('eve') → bit 7 → 0 (not set)
  → DEFINITELY NOT HERE. Skip this SSTable entirely.

Query: "Does 'alice' exist in this SSTable?"
  hash1('alice') → bit 3 → 1
  hash2('alice') → bit 9 → 1
  hash3('alice') → bit 2 → 1
  → MAYBE HERE. Must check the actual SSTable.

False positive rate (default): ~1%
  - 99% of unnecessary SSTable reads are eliminated
  - Configurable via bloom_filter_fp_chance (per table)
  - Lower fp_chance = more memory, fewer false positives

-- Tune bloom filter for read-heavy tables:
ALTER TABLE hot_table
  WITH bloom_filter_fp_chance = 0.001;  -- 0.1% false positive

Key Cache Is Almost Always Worth It

The key cache maps partition keys to their exact byte offset in SSTable data files. A key cache hit means one direct disk seek instead of reading through the partition index. Default size is 100MB — for most workloads, this caches the majority of active partitions. Monitor key_cache_hit_rate in nodetool info.

Cache Recommendations

  • āœ…Bloom filters: always enabled, tune fp_chance for read-heavy tables
  • āœ…Key cache: always enabled (default 100MB), monitor hit rate
  • āœ…Row cache: rarely useful — only for small, frequently-read, rarely-updated tables
  • āœ…Chunk cache: useful for compressed tables with repeated reads
  • āœ…Off-heap caches preferred to avoid GC pressure
07

Read Amplification

Read amplification is the number of SSTables that must be checked to serve a single read. Because Cassandra never modifies SSTables in place, a single partition's data may be spread across many SSTables — each containing a different version or subset of the row's columns.

šŸ“š

The Newspaper Archive

Imagine looking up a company's stock price history. Each day's newspaper is a separate SSTable. To get the full history, you must check every newspaper. If you only want today's price, you still might need to check several recent papers because corrections (updates) appear in later editions. Compaction is like publishing a consolidated annual report — merging all daily papers into one volume.

read-amplification.txttext
Read Amplification Example:

Partition 'alice' has been updated 5 times:
  SSTable 1 (oldest): alice = {name: "Alice", age: 25}
  SSTable 2:          alice = {age: 26}  (only age updated)
  SSTable 3:          alice = {email: "a@b.com"}  (new column)
  SSTable 4:          alice = {age: 27}
  SSTable 5 (newest): alice = {name: "Alice Smith"}

To read alice's current state:
  → Must check ALL 5 SSTables
  → Merge: {name: "Alice Smith", age: 27, email: "a@b.com"}
  → Read amplification = 5

After compaction merges SSTables 1-5 into one:
  New SSTable: alice = {name: "Alice Smith", age: 27, email: "a@b.com"}
  → Read amplification = 1

Monitoring:
  nodetool tablestats my_keyspace.my_table
  → SSTable count: indicates read amplification potential
  → Read latency: p99 correlates with SSTable count
FactorIncreases Read AmplificationDecreases Read Amplification
Write patternFrequent updates to same partitionWrite-once data (time-series)
CompactionFalling behind / not runningAggressive compaction strategy
Table designWide partitions with partial updatesNarrow partitions, full overwrites
Bloom filtersHigh false positive rateLow fp_chance setting
Compaction strategySTCS (size-tiered)LCS (leveled)

Compaction Is the Cure

Compaction merges multiple SSTables into fewer, larger ones — discarding obsolete versions and expired tombstones. It directly reduces read amplification. If your reads are slow, check SSTable count per table (nodetool tablestats). High counts mean compaction is falling behind.

08

Interview Questions

Q:Describe the Cassandra write path and explain why writes are fast.

A: Write path: (1) append to commit log (sequential I/O), (2) write to memtable (in-memory). That's it — ACK returned. Writes are fast because: no read-before-write, no random I/O, no lock contention, no index updates on the hot path. The commit log is sequential (fastest disk operation), and the memtable is in-memory. Writes are O(1) regardless of table size.

Q:What is an SSTable and why is it immutable?

A: An SSTable (Sorted String Table) is an immutable on-disk file containing partition data sorted by token. Immutability enables: (1) lock-free concurrent reads, (2) simple backups (copy files), (3) efficient compaction (merge-sort without locks), (4) no write amplification from in-place updates. The trade-off: updates create new SSTables, causing read amplification until compaction merges them.

Q:How do bloom filters optimize the read path?

A: Each SSTable has a bloom filter — a probabilistic data structure that answers 'does this partition key exist in this SSTable?' It can say 'definitely no' (skip the SSTable entirely) or 'maybe yes' (must check). With default 1% false positive rate, 99% of unnecessary SSTable reads are eliminated. Bloom filters are checked in memory with zero disk I/O, making them extremely fast.

Q:What is read amplification and how do you reduce it?

A: Read amplification is the number of SSTables checked per read. It increases when the same partition is updated across many SSTables. Reduction strategies: (1) compaction merges SSTables (LCS gives lowest read amplification), (2) tune bloom filters to reduce false positives, (3) design tables for write-once patterns, (4) use key cache for direct offset lookups, (5) monitor with nodetool tablestats.

Q:What is the role of the commit log vs the memtable?

A: Commit log = durability (survives crashes). Memtable = performance (serves reads from memory). They serve different purposes: the commit log is append-only sequential I/O for crash recovery — it's replayed on restart to rebuild the memtable. The memtable is a sorted in-memory structure for fast reads of recent data. Once the memtable flushes to an SSTable, the corresponding commit log segments can be deleted.

09

Common Mistakes

šŸ’¾

Putting commit log and data on the same disk

The commit log does sequential writes while data reads are random I/O. Sharing a disk causes I/O contention — commit log writes wait behind random reads, increasing write latency.

āœ…Use a dedicated SSD for the commit log (commitlog_directory). Even a small, fast SSD is sufficient since commit log segments are recycled.

šŸ—‘ļø

Enabling row cache for frequently-updated tables

Row cache stores full deserialized rows in heap memory. Every update invalidates the cache entry, causing constant cache churn, GC pressure, and no actual benefit.

āœ…Row cache is only useful for small, rarely-updated, frequently-read tables (like configuration data). For most tables, rely on key cache and bloom filters instead.

šŸ“Š

Ignoring SSTable count as a performance indicator

Not monitoring SSTable count per table. High counts mean compaction is falling behind, causing read amplification — every read checks more files, increasing latency.

āœ…Monitor with nodetool tablestats. If SSTable count grows continuously, investigate compaction throughput, disk space, or switch compaction strategy (STCS → LCS for read-heavy tables).

āš™ļø

Using commitlog_sync: periodic without understanding the risk

Periodic sync (default) only fsyncs every 10 seconds. A crash within that window loses up to 10 seconds of acknowledged writes. Teams assume 'acknowledged = durable' which isn't true with periodic sync.

āœ…For critical data, use commitlog_sync: batch (fsyncs on every write batch). Accept the ~2ms latency increase. Or understand and document the data loss window for your use case.

šŸ”„

Frequent partial updates to wide partitions

Updating individual columns in a wide partition creates many small SSTable entries. Reading the full partition requires merging across all these SSTables — severe read amplification.

āœ…Design for full-row writes where possible. If partial updates are necessary, use LCS compaction to keep read amplification low, and monitor SSTable count.