Write & Read Path
Cassandra's write path is optimized for speed ā sequential I/O only, no read-before-write. The read path trades complexity for flexibility, using bloom filters and caches to minimize disk access.
Table of Contents
Write Path Overview
Cassandra's write path is designed for maximum throughput. Every write goes to two places simultaneously: the commit log (for durability) and the memtable (for serving reads). There is no read-before-write, no lock acquisition, no index update on the write path. This is why Cassandra writes are so fast.
Client Sends Write
Client sends INSERT/UPDATE to coordinator node (or directly to replica via token-aware driver)
Coordinator Forwards
Coordinator identifies replica nodes from token ring and forwards write to all replicas simultaneously
Commit Log (Sequential Write)
Each replica appends the mutation to its commit log ā a sequential, append-only file on disk. This is the durability guarantee.
Memtable (In-Memory Write)
The mutation is also written to the memtable ā an in-memory sorted data structure (skip list or similar)
Acknowledge
Once commit log and memtable are written, the replica sends ACK to coordinator
Coordinator Responds
Once enough ACKs received (based on consistency level), coordinator returns success to client
The Accountant's Ledger
Imagine an accountant who records every transaction in two places: a permanent ledger (commit log) written sequentially page by page, and sticky notes on a whiteboard sorted by account (memtable). The ledger is for safety ā if the office burns down, the ledger survives. The whiteboard is for speed ā finding the latest balance is instant. Periodically, the whiteboard is photographed and filed (SSTable flush), and the sticky notes are cleared.
Write Path on a Single Replica Node: Client Request: INSERT INTO users (id, name) VALUES ('alice', 'Alice') ā ā¼ āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Replica Node ā ā ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāā ā ā ā Commit Log ā ā Memtable (in-memory) ā ā ā ā (append-only)ā ā Sorted by partition ā ā ā ā ā ā key + clustering cols ā ā ā ā ... ā ā ā ā ā ā {alice:Alice}ā āāāā¬āāāŗ ā alice ā {name:Alice} ā ā ā ā ... ā ā ā ā ā ā āāāāāāāāāāāāāāāā ā āāāāāāāāāāāā¬āāāāāāāāāāāā ā ā ā ā (when full) ā ā WRITE ā¼ ā ā (simultaneous) āāāāāāāāāāāā ā ā ā SSTable ā ā ā ā (on disk)ā ā ā āāāāāāāāāāāā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā¼ ACK to Coordinator
Commit Log
The commit log is Cassandra's durability mechanism. It's an append-only file where every mutation is written sequentially. If a node crashes, the commit log is replayed on restart to recover any data that was in the memtable but not yet flushed to SSTables.
Commit Log Properties
- ā Append-only ā sequential writes only (fastest possible disk I/O)
- ā Shared across all tables ā one commit log for the entire node
- ā Segmented ā divided into segments (default 32MB) that are recycled
- ā Synced to disk based on commitlog_sync setting (periodic or batch)
- ā Deleted once all mutations in a segment are flushed to SSTables
# Commit log sync modes: # Periodic (default) ā sync every N ms commitlog_sync: periodic commitlog_sync_period_in_ms: 10000 # 10 seconds # Risk: up to 10s of data loss on crash # Benefit: highest write throughput # Batch ā sync on every write (or batch of writes) commitlog_sync: batch commitlog_sync_batch_window_in_ms: 2 # group writes within 2ms # Risk: minimal data loss # Benefit: stronger durability # Cost: higher write latency (waits for fsync) # Commit log location (separate disk recommended) commitlog_directory: /var/lib/cassandra/commitlog # Using a dedicated SSD for commit log avoids I/O contention # with SSTable reads/compaction on the data disk
Separate Commit Log Disk
In production, place the commit log on a separate physical disk (or SSD) from the data directory. The commit log does sequential writes while data reads are random I/O. Mixing them on one disk causes contention. With SSDs this matters less, but dedicated commit log disks still improve p99 latency.
Memtable & SSTable Flush
The memtable is an in-memory data structure (one per table) that holds recent writes sorted by partition key and clustering columns. When the memtable reaches a size threshold, it's flushed to disk as an immutable SSTable (Sorted String Table).
| Component | Location | Mutability | Purpose |
|---|---|---|---|
| Memtable | Memory (heap) | Mutable (accepts writes) | Fast reads of recent data |
| SSTable | Disk | Immutable (never modified) | Persistent sorted data |
| Commit Log | Disk (sequential) | Append-only | Crash recovery |
SSTable Components (on disk): āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā File | Purpose āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā *-Data.db | Actual row data (sorted by token) *-Index.db | Partition index (token ā offset in Data.db) *-Summary.db | Sampled index (every Nth entry from Index.db) *-Filter.db | Bloom filter (probabilistic membership test) *-Statistics.db | Metadata (min/max timestamps, tombstone count) *-CompressionInfo.db | Compression offsets for random access *-TOC.txt | Table of contents listing all components āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Flush triggers: - Memtable size exceeds memtable_cleanup_threshold - Commit log segment needs recycling - Manual: nodetool flush - Node shutdown (graceful) Key property: SSTables are IMMUTABLE - Never modified after creation - Updates create new SSTables (old versions remain until compaction) - Deletes write tombstones (not actual deletion)
Immutability Is the Key Insight
SSTables are never modified after creation. An "update" to a row creates a new entry in a new SSTable with a newer timestamp. A "delete" writes a tombstone marker. The old data remains until compaction merges SSTables and discards obsolete versions. This immutability enables lock-free reads and simple backup (just copy files).
Why Writes Are Fast
Cassandra writes are consistently fast (sub-millisecond on the node) because the write path avoids every operation that makes traditional databases slow: no read-before-write, no random I/O, no lock contention, no index maintenance on the hot path.
| Operation | Traditional RDBMS | Cassandra |
|---|---|---|
| Disk I/O pattern | Random (B-tree updates) | Sequential only (append to commit log) |
| Read before write | Yes (check constraints, indexes) | No (blind write, last-write-wins) |
| Lock acquisition | Row/page locks | None (append-only structures) |
| Index update | Synchronous B-tree insert | Deferred (handled during flush/compaction) |
| Fsync frequency | Every commit | Configurable (periodic or batch) |
| Write amplification | High (WAL + B-tree + indexes) | Low (commit log + memtable only) |
Why Cassandra Writes Are O(1)
- ā Commit log: sequential append ā the fastest possible disk operation
- ā Memtable: in-memory sorted insert ā no disk I/O
- ā No read-before-write: UPDATE doesn't check if row exists
- ā No constraint checking: no unique constraints (except LWT), no foreign keys
- ā No lock contention: append-only structures don't need locks
- ā Constant time regardless of table size ā 1 row or 1 billion rows, same speed
The Mailbox vs The Filing Cabinet
A traditional database is like a filing cabinet ā to add a document, you must find the right drawer, the right folder, move other documents aside, and file it in order. Cassandra is like a mailbox ā you just drop the letter in the slot. It's always fast because you never open the box to organize. Organization happens later (compaction) in the background.
The Trade-off
The speed of writes comes at a cost: reads are more complex. Because updates don't modify existing data (they create new entries), a read might need to check multiple SSTables to find the latest version of a row. This is read amplification ā the price paid for fast writes.
Read Path Overview
The read path is more complex than the write path. Cassandra must check multiple locations ā the memtable and potentially many SSTables ā to assemble the complete, most-recent version of the requested data. Multiple optimizations (bloom filters, caches, indexes) minimize the actual disk reads required.
Check Memtable
First check the in-memory memtable ā if the data was recently written, it's here (fastest path)
Check Row Cache (if enabled)
Optional: check the row cache for frequently-read hot rows (stores deserialized row data)
Check Bloom Filters
For each SSTable, check its bloom filter: 'Could this partition exist in this SSTable?' If no ā skip entirely
Check Partition Key Cache
If bloom filter says 'maybe', check the key cache for the exact byte offset of this partition in the SSTable
Read Partition Index
If not in key cache, read the partition index (Summary ā Index ā Data) to find the offset
Read SSTable Data
Seek to the offset in the Data.db file and read the partition data
Merge Results
Merge data from memtable + all relevant SSTables, resolve by timestamp (newest wins), apply tombstones
Read Path Decision Tree: SELECT * FROM users WHERE user_id = 'alice' ā ā¼ āāāāāāāāāāāāāāā ā Memtable āāāāā Found? Include in merge āāāāāāāāāāāāāāā ā ā¼ āāāāāāāāāāāāāāā ā Row Cache āāāāā Hit? Return immediately āāāāāāāāāāāāāāā ā ā¼ āāāāāāāāāāāāāāāāāāāāāāāāā ā For each SSTable: ā ā ā ā Bloom Filter āāNoāāāŗ Skip this SSTable ā ā ā ā Maybe ā ā ā ā ā Key Cache āāāHitāāāŗ Read at offset ā ā ā ā Miss ā ā ā ā ā Summary Index ā ā ā ā ā Partition Index ā ā ā ā ā Read Data.db ā āāāāāāāāāāāāāāāāāāāāāāāāā ā ā¼ āāāāāāāāāāāāāāā ā Merge ā Newest timestamp wins ā Results ā Apply tombstones āāāāāāāāāāāāāāā ā ā¼ Return to client
Bloom Filters & Caches
Bloom filters are the most important read optimization in Cassandra. They answer the question "does this partition key exist in this SSTable?" with zero disk I/O. A bloom filter can say "definitely not here" (100% accurate) or "maybe here" (small false positive rate).
| Cache/Filter | What It Stores | Hit Effect | Memory Cost |
|---|---|---|---|
| Bloom Filter | Probabilistic set membership | Eliminates SSTable from read path | ~1.5 bytes per partition per SSTable |
| Key Cache | Partition key ā byte offset in SSTable | Skips index lookup, direct seek | Configurable (default 100MB) |
| Row Cache | Full deserialized partition data | Skips all disk I/O | High (stores full rows in heap) |
| Chunk Cache | Decompressed SSTable chunks | Avoids re-decompression | Off-heap, configurable |
Bloom Filter Example: SSTable has partitions: [alice, bob, charlie, dave] Bloom filter: bit array with multiple hash functions Query: "Does 'eve' exist in this SSTable?" hash1('eve') ā bit 7 ā 0 (not set) ā DEFINITELY NOT HERE. Skip this SSTable entirely. Query: "Does 'alice' exist in this SSTable?" hash1('alice') ā bit 3 ā 1 hash2('alice') ā bit 9 ā 1 hash3('alice') ā bit 2 ā 1 ā MAYBE HERE. Must check the actual SSTable. False positive rate (default): ~1% - 99% of unnecessary SSTable reads are eliminated - Configurable via bloom_filter_fp_chance (per table) - Lower fp_chance = more memory, fewer false positives -- Tune bloom filter for read-heavy tables: ALTER TABLE hot_table WITH bloom_filter_fp_chance = 0.001; -- 0.1% false positive
Key Cache Is Almost Always Worth It
The key cache maps partition keys to their exact byte offset in SSTable data files. A key cache hit means one direct disk seek instead of reading through the partition index. Default size is 100MB ā for most workloads, this caches the majority of active partitions. Monitor key_cache_hit_rate in nodetool info.
Cache Recommendations
- ā Bloom filters: always enabled, tune fp_chance for read-heavy tables
- ā Key cache: always enabled (default 100MB), monitor hit rate
- ā Row cache: rarely useful ā only for small, frequently-read, rarely-updated tables
- ā Chunk cache: useful for compressed tables with repeated reads
- ā Off-heap caches preferred to avoid GC pressure
Read Amplification
Read amplification is the number of SSTables that must be checked to serve a single read. Because Cassandra never modifies SSTables in place, a single partition's data may be spread across many SSTables ā each containing a different version or subset of the row's columns.
The Newspaper Archive
Imagine looking up a company's stock price history. Each day's newspaper is a separate SSTable. To get the full history, you must check every newspaper. If you only want today's price, you still might need to check several recent papers because corrections (updates) appear in later editions. Compaction is like publishing a consolidated annual report ā merging all daily papers into one volume.
Read Amplification Example: Partition 'alice' has been updated 5 times: SSTable 1 (oldest): alice = {name: "Alice", age: 25} SSTable 2: alice = {age: 26} (only age updated) SSTable 3: alice = {email: "a@b.com"} (new column) SSTable 4: alice = {age: 27} SSTable 5 (newest): alice = {name: "Alice Smith"} To read alice's current state: ā Must check ALL 5 SSTables ā Merge: {name: "Alice Smith", age: 27, email: "a@b.com"} ā Read amplification = 5 After compaction merges SSTables 1-5 into one: New SSTable: alice = {name: "Alice Smith", age: 27, email: "a@b.com"} ā Read amplification = 1 Monitoring: nodetool tablestats my_keyspace.my_table ā SSTable count: indicates read amplification potential ā Read latency: p99 correlates with SSTable count
| Factor | Increases Read Amplification | Decreases Read Amplification |
|---|---|---|
| Write pattern | Frequent updates to same partition | Write-once data (time-series) |
| Compaction | Falling behind / not running | Aggressive compaction strategy |
| Table design | Wide partitions with partial updates | Narrow partitions, full overwrites |
| Bloom filters | High false positive rate | Low fp_chance setting |
| Compaction strategy | STCS (size-tiered) | LCS (leveled) |
Compaction Is the Cure
Compaction merges multiple SSTables into fewer, larger ones ā discarding obsolete versions and expired tombstones. It directly reduces read amplification. If your reads are slow, check SSTable count per table (nodetool tablestats). High counts mean compaction is falling behind.
Interview Questions
Q:Describe the Cassandra write path and explain why writes are fast.
A: Write path: (1) append to commit log (sequential I/O), (2) write to memtable (in-memory). That's it ā ACK returned. Writes are fast because: no read-before-write, no random I/O, no lock contention, no index updates on the hot path. The commit log is sequential (fastest disk operation), and the memtable is in-memory. Writes are O(1) regardless of table size.
Q:What is an SSTable and why is it immutable?
A: An SSTable (Sorted String Table) is an immutable on-disk file containing partition data sorted by token. Immutability enables: (1) lock-free concurrent reads, (2) simple backups (copy files), (3) efficient compaction (merge-sort without locks), (4) no write amplification from in-place updates. The trade-off: updates create new SSTables, causing read amplification until compaction merges them.
Q:How do bloom filters optimize the read path?
A: Each SSTable has a bloom filter ā a probabilistic data structure that answers 'does this partition key exist in this SSTable?' It can say 'definitely no' (skip the SSTable entirely) or 'maybe yes' (must check). With default 1% false positive rate, 99% of unnecessary SSTable reads are eliminated. Bloom filters are checked in memory with zero disk I/O, making them extremely fast.
Q:What is read amplification and how do you reduce it?
A: Read amplification is the number of SSTables checked per read. It increases when the same partition is updated across many SSTables. Reduction strategies: (1) compaction merges SSTables (LCS gives lowest read amplification), (2) tune bloom filters to reduce false positives, (3) design tables for write-once patterns, (4) use key cache for direct offset lookups, (5) monitor with nodetool tablestats.
Q:What is the role of the commit log vs the memtable?
A: Commit log = durability (survives crashes). Memtable = performance (serves reads from memory). They serve different purposes: the commit log is append-only sequential I/O for crash recovery ā it's replayed on restart to rebuild the memtable. The memtable is a sorted in-memory structure for fast reads of recent data. Once the memtable flushes to an SSTable, the corresponding commit log segments can be deleted.
Common Mistakes
Putting commit log and data on the same disk
The commit log does sequential writes while data reads are random I/O. Sharing a disk causes I/O contention ā commit log writes wait behind random reads, increasing write latency.
ā Use a dedicated SSD for the commit log (commitlog_directory). Even a small, fast SSD is sufficient since commit log segments are recycled.
Enabling row cache for frequently-updated tables
Row cache stores full deserialized rows in heap memory. Every update invalidates the cache entry, causing constant cache churn, GC pressure, and no actual benefit.
ā Row cache is only useful for small, rarely-updated, frequently-read tables (like configuration data). For most tables, rely on key cache and bloom filters instead.
Ignoring SSTable count as a performance indicator
Not monitoring SSTable count per table. High counts mean compaction is falling behind, causing read amplification ā every read checks more files, increasing latency.
ā Monitor with nodetool tablestats. If SSTable count grows continuously, investigate compaction throughput, disk space, or switch compaction strategy (STCS ā LCS for read-heavy tables).
Using commitlog_sync: periodic without understanding the risk
Periodic sync (default) only fsyncs every 10 seconds. A crash within that window loses up to 10 seconds of acknowledged writes. Teams assume 'acknowledged = durable' which isn't true with periodic sync.
ā For critical data, use commitlog_sync: batch (fsyncs on every write batch). Accept the ~2ms latency increase. Or understand and document the data loss window for your use case.
Frequent partial updates to wide partitions
Updating individual columns in a wide partition creates many small SSTable entries. Reading the full partition requires merging across all these SSTables ā severe read amplification.
ā Design for full-row writes where possible. If partial updates are necessary, use LCS compaction to keep read amplification low, and monitor SSTable count.