Compaction & Operations
Compaction is Cassandra's background maintenance β merging SSTables, discarding obsolete data, and reclaiming disk space. Understanding compaction strategies is essential for production performance.
Table of Contents
Why Compaction Exists
Because SSTables are immutable, every update and delete creates new data on disk without removing the old version. Over time, this leads to: (1) wasted disk space from obsolete data, (2) read amplification from checking many SSTables, and (3) tombstones accumulating without being purged. Compaction solves all three.
The Filing System
Imagine a filing system where you can never erase or modify a document β you can only add new versions. After a year, finding the latest version of a document means checking hundreds of folders. Compaction is like a clerk who periodically consolidates all versions into one folder, shredding outdated copies. Without the clerk, the filing room becomes unusable.
What Compaction Does
- β Merges multiple SSTables into fewer, larger ones
- β Discards obsolete versions (keeps only newest timestamp per cell)
- β Purges expired tombstones (after gc_grace_seconds)
- β Removes TTL-expired data
- β Reduces read amplification (fewer SSTables to check)
- β Reclaims disk space from deleted/overwritten data
Trigger
Compaction is triggered when SSTable count/size meets strategy-specific thresholds
Select SSTables
Strategy selects which SSTables to compact (varies by STCS/LCS/TWCS)
Merge-Sort
Selected SSTables are merge-sorted by partition key and clustering columns
Resolve Versions
For each cell, keep only the newest timestamp. Apply tombstones to delete old data.
Write New SSTable
Output a single new SSTable containing only the latest, live data
Delete Old SSTables
Once the new SSTable is complete, atomically swap and delete the input SSTables
Compaction Uses Temporary Disk Space
During compaction, both old and new SSTables exist simultaneously. This means you need enough free disk space to hold the compaction output before the old files are deleted. Rule of thumb: keep at least 50% disk free for STCS, 10-20% for LCS.
SizeTiered Compaction (STCS)
STCS is the default compaction strategy. It groups SSTables of similar size and compacts them together when enough accumulate (default: 4 similar-sized SSTables trigger compaction). It's optimized for write-heavy workloads.
SizeTiered Compaction Strategy (STCS): Tier 0 (small): [4MB] [4MB] [5MB] [4MB] β compact β [17MB] Tier 1 (medium): [15MB] [17MB] [16MB] [18MB] β compact β [66MB] Tier 2 (large): [60MB] [66MB] [55MB] [70MB] β compact β [251MB] Trigger: min_threshold (default 4) SSTables of similar size "Similar" = within bucket_low (0.5) to bucket_high (1.5) ratio Characteristics: β Low write amplification (each byte written ~4-5 times total) β Good for write-heavy workloads β High space amplification (50% overhead for temporary files) β High read amplification (many SSTables per partition) β Unpredictable compaction timing (large compactions are bursty)
| Setting | Default | Description |
|---|---|---|
| min_threshold | 4 | Minimum SSTables to trigger compaction |
| max_threshold | 32 | Maximum SSTables to compact at once |
| bucket_low | 0.5 | Lower bound for 'similar size' grouping |
| bucket_high | 1.5 | Upper bound for 'similar size' grouping |
| min_sstable_size | 50MB | SSTables below this are always in the smallest bucket |
When to Use STCS
STCS is best for write-heavy workloads where you write data once and rarely update it (logs, events, time-series without TTL). Avoid STCS for read-heavy workloads or tables with frequent updates β the high SSTable count causes read amplification.
Leveled Compaction (LCS)
LCS organizes SSTables into levels (L0, L1, L2, ...) where each level is 10x the size of the previous. Within each level (except L0), SSTables have non-overlapping token ranges. This guarantees that a partition exists in at most one SSTable per level β dramatically reducing read amplification.
Leveled Compaction Strategy (LCS): L0: [memtable flushes land here β may overlap] [5MB] [5MB] [5MB] [5MB] β promote to L1 when 4 accumulate L1: [50MB total, non-overlapping ranges] [A-F: 5MB] [G-M: 5MB] [N-S: 5MB] [T-Z: 5MB] ... L2: [500MB total, non-overlapping ranges] (10x L1) [A-B: 5MB] [C-D: 5MB] [E-F: 5MB] ... L3: [5GB total, non-overlapping ranges] (10x L2) ... Compaction: When L(n) exceeds its size limit, one SSTable from L(n) is merged with all overlapping SSTables in L(n+1). Key guarantee: Within each level, no two SSTables contain the same partition. Read amplification = at most (number of levels + 1). Characteristics: β Low read amplification (predictable, bounded) β Low space amplification (~10% overhead) β Predictable performance (no large bursty compactions) β High write amplification (each byte written ~10-30 times) β More I/O bandwidth consumed by compaction
| Setting | Default | Description |
|---|---|---|
| sstable_size_in_mb | 160 | Target size for each SSTable |
| fanout_size | 10 | Size multiplier between levels |
When to Use LCS
LCS is ideal for read-heavy workloads and tables with frequent updates (user profiles, session data, counters). The bounded read amplification means consistent read latency. The trade-off is higher write amplification β each byte is rewritten many times as it moves through levels. Don't use LCS for write-heavy, append-only workloads.
TimeWindow Compaction (TWCS)
TWCS is designed specifically for time-series data with TTL. It groups SSTables by time window (e.g., 1 hour, 1 day) and only compacts within the same window. Once a window closes, its SSTables are compacted into one final SSTable that is never compacted again β until TTL expires and the entire SSTable is dropped.
TimeWindow Compaction Strategy (TWCS): Time windows (1-day example): βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Jan 15 (closed): [SSTable: all Jan 15 data, 2GB] β β never compacted again β Jan 16 (closed): [SSTable: all Jan 16 data, 2GB] β β never compacted again β Jan 17 (active): [5MB] [5MB] [5MB] [5MB] β β STCS within window βββββββββββββββββββββββββββββββββββββββββββββββββββββββ When TTL expires (e.g., 30 days): Jan 15 SSTable: all data expired β DROP entire file (instant!) No tombstones needed, no compaction needed. Characteristics: β Minimal write amplification (compact once per window) β Efficient TTL expiration (drop whole SSTables, no tombstones) β Predictable disk usage (old windows drop off) β Only works for time-series / append-only data β Out-of-order writes break the model (data in wrong window) β Updates/deletes to old windows cause problems
-- TWCS configuration for time-series with 30-day TTL CREATE TABLE sensor_data ( sensor_id TEXT, day DATE, event_time TIMESTAMP, value DOUBLE, PRIMARY KEY ((sensor_id, day), event_time) ) WITH compaction = { 'class': 'TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': 1 } AND default_time_to_live = 2592000; -- 30 days in seconds -- Key rule: TTL should be a multiple of the window size -- Window = 1 day, TTL = 30 days β each window SSTable lives 30 days then drops
TWCS + TTL = Zero-Cost Deletes
The magic of TWCS: when all data in a time window has the same TTL, the entire SSTable expires at once. Cassandra drops the file without reading it, without writing tombstones, without compaction. This is the most efficient way to handle time-series data with retention policies.
Tombstones & gc_grace_seconds
In Cassandra, a DELETE doesn't remove data β it writes a tombstone, a marker that says "this data is deleted as of timestamp T." Tombstones are necessary because SSTables are immutable and replicas might be out of sync. The tombstone must propagate to all replicas before the original data can be purged.
The Gravestone
Imagine a distributed phone book with copies in 3 cities. You can't just erase Alice's number from your copy β the other cities still have it and would 'resurrect' it during sync. Instead, you write a note: 'Alice's number deleted on Jan 15.' All cities see this note during sync and remove Alice. After 10 days (gc_grace_seconds), everyone has seen the note, so you can safely remove the note itself.
Tombstone Lifecycle: 1. DELETE FROM users WHERE user_id = 'alice'; β Writes tombstone: {user_id: 'alice', deleted_at: 2024-01-15T10:00:00} 2. Tombstone propagates via: - Normal replication (write to all replicas) - Read repair (if stale replica is read) - Anti-entropy repair (nodetool repair) 3. After gc_grace_seconds (default: 864000 = 10 days): - Tombstone is eligible for garbage collection - Next compaction purges the tombstone AND the original data 4. If repair hasn't run within gc_grace_seconds: β οΈ DANGER: Tombstone is purged, but a stale replica still has the original data. Read repair will "resurrect" the deleted data! Types of tombstones: - Cell tombstone: single column deleted - Row tombstone: entire row deleted (DELETE FROM t WHERE pk=x AND ck=y) - Range tombstone: range of clustering keys deleted - Partition tombstone: entire partition deleted (DELETE FROM t WHERE pk=x) - TTL tombstone: auto-generated when TTL expires
| Scenario | gc_grace_seconds | Repair Frequency | Risk |
|---|---|---|---|
| Default | 10 days | Must repair within 10 days | Safe if repair runs on schedule |
| Aggressive | 1 day | Must repair within 1 day | High risk β any delay causes zombies |
| Conservative | 30 days | Must repair within 30 days | More disk usage from tombstones |
| TWCS with TTL | 0 (sometimes) | N/A (whole SSTables drop) | Only safe with TWCS + uniform TTL |
The Zombie Data Problem
If gc_grace_seconds expires before repair runs on all replicas: (1) Tombstone is purged from repaired nodes. (2) Stale replica still has the original data. (3) Next read repair sees the data on the stale replica and no tombstone anywhere. (4) Data is "resurrected" β the delete is lost. This is the most dangerous operational issue in Cassandra.
nodetool Essential Commands
nodetool is the primary CLI for Cassandra operations. It communicates with the local node via JMX to perform administrative tasks, monitoring, and maintenance.
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # CLUSTER STATUS & HEALTH # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Show cluster topology and node status nodetool status # UN = Up/Normal, DN = Down/Normal, UJ = Up/Joining, UL = Up/Leaving # Show node info (heap, load, uptime, key cache hit rate) nodetool info # Show gossip state for all nodes nodetool gossipinfo # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # TABLE STATISTICS & DIAGNOSTICS # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Table-level stats (SSTable count, read/write latency, partition size) nodetool tablestats my_keyspace.my_table # Histogram of partition sizes and cell counts nodetool tablehistograms my_keyspace.my_table # Show compaction activity nodetool compactionstats # Show thread pool stats (pending/blocked tasks) nodetool tpstats # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # MAINTENANCE OPERATIONS # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Flush memtables to SSTables nodetool flush my_keyspace # Trigger compaction manually nodetool compact my_keyspace my_table # Run anti-entropy repair nodetool repair my_keyspace # Take a snapshot (backup) nodetool snapshot my_keyspace -t my_snapshot_name # Clear a snapshot nodetool clearsnapshot -t my_snapshot_name # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # NODE LIFECYCLE # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Gracefully remove a node (streams data to other nodes) nodetool decommission # Remove a dead node from the cluster nodetool removenode <host_id> # Drain (flush + stop accepting writes, for graceful shutdown) nodetool drain
Daily Operations Checklist
- β nodetool status β verify all nodes are UN (Up/Normal)
- β nodetool tpstats β check for pending/blocked tasks (sign of overload)
- β nodetool compactionstats β verify compaction isn't falling behind
- β nodetool tablestats β monitor SSTable count and read latency
- β nodetool info β check heap usage and key cache hit rate
Adding & Removing Nodes
Cassandra supports elastic scaling β adding and removing nodes without downtime. The cluster automatically rebalances data when topology changes. However, these operations must be done carefully to avoid data loss or performance degradation.
Adding a Node (Bootstrap)
Configure New Node
Set cassandra.yaml with cluster name, seeds, snitch, and DC/rack. Set auto_bootstrap: true (default).
Start the Node
Node contacts seeds, joins gossip, claims token ranges (vnodes assigned automatically)
Stream Data
Existing nodes stream data for the new node's token ranges. Status shows as UJ (Up/Joining).
Join Complete
Once streaming finishes, node status changes to UN (Up/Normal). It now serves reads and writes.
Removing a Node (Decommission)
Initiate Decommission
Run 'nodetool decommission' on the node to be removed. Status changes to UL (Up/Leaving).
Stream Data Out
Node streams all its data to the remaining nodes that will take over its token ranges
Leave Ring
Once streaming completes, node removes itself from the ring and shuts down
Never Remove More Than One Node at a Time
Removing multiple nodes simultaneously can cause data loss if their token ranges overlap with the same replicas. Always wait for one decommission to complete before starting another. For dead nodes that can't run decommission, use "nodetool removenode" from another node.
| Operation | Command | When to Use |
|---|---|---|
| Add node | Start with auto_bootstrap: true | Scaling up, replacing hardware |
| Remove live node | nodetool decommission | Scaling down, planned removal |
| Remove dead node | nodetool removenode <host_id> | Node permanently failed |
| Replace dead node | Start with -Dcassandra.replace_address=<old_ip> | Hardware replacement |
Interview Questions
Q:Compare STCS, LCS, and TWCS β when would you use each?
A: STCS: write-heavy, append-only workloads (logs, events). Low write amplification but high read amplification and 50% disk overhead. LCS: read-heavy workloads with updates (user profiles, sessions). Low read amplification but high write amplification (10-30x). TWCS: time-series with TTL. Groups by time window, drops entire SSTables on expiry β zero-cost deletes. Never use TWCS for data that gets updated or deleted individually.
Q:What are tombstones and why can they cause problems?
A: Tombstones are delete markers (not actual deletions) because SSTables are immutable. Problems: (1) tombstones consume disk space until gc_grace_seconds expires, (2) reads must process tombstones (scanning past deleted data), (3) if repair doesn't run before gc_grace_seconds, deleted data can resurrect (zombie data). Excessive tombstones cause 'tombstone overwhelm' β reads timing out from processing millions of tombstones.
Q:What happens if repair doesn't run within gc_grace_seconds?
A: Tombstones are garbage collected (purged) from nodes that have been repaired. But a stale replica that missed the original delete still has the data AND the tombstone is now gone from other replicas. On the next read, read repair sees data on the stale replica with no tombstone to contradict it β so it propagates the 'deleted' data back to all replicas. The delete is effectively lost. This is called zombie data resurrection.
Q:How does adding a node to a Cassandra cluster work?
A: (1) New node starts with auto_bootstrap=true and contacts seed nodes. (2) It joins gossip and learns the cluster topology. (3) With vnodes, it's assigned random token ranges. (4) Existing nodes stream data for those ranges to the new node (status: UJ/Joining). (5) Once streaming completes, status becomes UN/Normal and it serves traffic. The cluster remains fully available throughout β no downtime.
Q:Why does STCS require 50% free disk space?
A: STCS compacts similar-sized SSTables together. In the worst case, it compacts all SSTables in a tier into one new SSTable. During compaction, both the input SSTables and the output SSTable exist simultaneously. If the input is 50% of disk, the output needs another 50%. Only after the new SSTable is complete are the old ones deleted. LCS needs less (10-20%) because it compacts one SSTable at a time against a level.
Common Mistakes
Using DELETE heavily without understanding tombstones
Treating Cassandra like an RDBMS and issuing frequent DELETEs. Each delete creates a tombstone that persists for gc_grace_seconds (10 days). Tables with heavy deletes accumulate millions of tombstones, causing read timeouts.
β Design for TTL-based expiration instead of explicit deletes. Use TWCS for time-series data. If deletes are necessary, monitor tombstone count with nodetool tablestats and ensure compaction runs regularly.
Running out of disk during compaction
Not reserving enough free disk space for compaction. STCS needs up to 50% free space temporarily. When disk fills, compaction stops, SSTables accumulate, reads get slower, and the node becomes unrecoverable.
β Monitor disk usage and alert at 50% for STCS, 70% for LCS. Add nodes or increase storage before reaching these thresholds. Never let a Cassandra node exceed 80% disk usage.
Using STCS for read-heavy workloads
STCS allows many SSTables to accumulate (high read amplification). Read-heavy tables suffer because every read checks many SSTables, increasing latency unpredictably.
β Switch to LCS for tables with frequent reads and updates. LCS guarantees bounded read amplification (one SSTable per level per partition). Accept the higher write amplification as the trade-off.
Using TWCS with out-of-order writes or updates
TWCS assumes data arrives in time order and is never updated. Out-of-order writes land in the wrong time window, and updates to old windows prevent those SSTables from being dropped cleanly.
β Only use TWCS for pure append-only time-series with uniform TTL. If data arrives out of order or gets updated, use STCS or LCS instead.
Decommissioning multiple nodes simultaneously
Removing several nodes at once to scale down quickly. If their token ranges overlap, data may not have enough replicas during the transition β risking data loss.
β Decommission one node at a time. Wait for the operation to complete (nodetool status shows the node gone) before starting the next. For large scale-downs, plan for days not hours.