STCSLCSTWCSTombstonesnodetoolRepairgc_grace_seconds

Compaction & Operations

Compaction is Cassandra's background maintenance — merging SSTables, discarding obsolete data, and reclaiming disk space. Understanding compaction strategies is essential for production performance.

50 min read9 sections

Why Compaction Exists

Because SSTables are immutable, every update and delete creates new data on disk without removing the old version. Over time, this leads to: (1) wasted disk space from obsolete data, (2) read amplification from checking many SSTables, and (3) tombstones accumulating without being purged. Compaction solves all three.

🗂️

The Filing System

Imagine a filing system where you can never erase or modify a document — you can only add new versions. After a year, finding the latest version of a document means checking hundreds of folders. Compaction is like a clerk who periodically consolidates all versions into one folder, shredding outdated copies. Without the clerk, the filing room becomes unusable.

What Compaction Does

✅Merges multiple SSTables into fewer, larger ones
✅Discards obsolete versions (keeps only newest timestamp per cell)
✅Purges expired tombstones (after gc_grace_seconds)
✅Removes TTL-expired data
✅Reduces read amplification (fewer SSTables to check)
✅Reclaims disk space from deleted/overwritten data

Trigger

Compaction is triggered when SSTable count/size meets strategy-specific thresholds

Select SSTables

Strategy selects which SSTables to compact (varies by STCS/LCS/TWCS)

Merge-Sort

Selected SSTables are merge-sorted by partition key and clustering columns

Resolve Versions

For each cell, keep only the newest timestamp. Apply tombstones to delete old data.

Write New SSTable

Output a single new SSTable containing only the latest, live data

Delete Old SSTables

Once the new SSTable is complete, atomically swap and delete the input SSTables

Compaction Uses Temporary Disk Space

During compaction, both old and new SSTables exist simultaneously. This means you need enough free disk space to hold the compaction output before the old files are deleted. Rule of thumb: keep at least 50% disk free for STCS, 10-20% for LCS.

SizeTiered Compaction (STCS)

STCS is the default compaction strategy. It groups SSTables of similar size and compacts them together when enough accumulate (default: 4 similar-sized SSTables trigger compaction). It's optimized for write-heavy workloads.

stcs-behavior.txttext

SizeTiered Compaction Strategy (STCS):

Tier 0 (small):   [4MB] [4MB] [5MB] [4MB] → compact → [17MB]
Tier 1 (medium):  [15MB] [17MB] [16MB] [18MB] → compact → [66MB]
Tier 2 (large):   [60MB] [66MB] [55MB] [70MB] → compact → [251MB]

Trigger: min_threshold (default 4) SSTables of similar size
         "Similar" = within bucket_low (0.5) to bucket_high (1.5) ratio

Characteristics:
  ✅ Low write amplification (each byte written ~4-5 times total)
  ✅ Good for write-heavy workloads
  ❌ High space amplification (50% overhead for temporary files)
  ❌ High read amplification (many SSTables per partition)
  ❌ Unpredictable compaction timing (large compactions are bursty)

Setting	Default	Description
min_threshold	4	Minimum SSTables to trigger compaction
max_threshold	32	Maximum SSTables to compact at once
bucket_low	0.5	Lower bound for 'similar size' grouping
bucket_high	1.5	Upper bound for 'similar size' grouping
min_sstable_size	50MB	SSTables below this are always in the smallest bucket

When to Use STCS

STCS is best for write-heavy workloads where you write data once and rarely update it (logs, events, time-series without TTL). Avoid STCS for read-heavy workloads or tables with frequent updates — the high SSTable count causes read amplification.

Leveled Compaction (LCS)

LCS organizes SSTables into levels (L0, L1, L2, ...) where each level is 10x the size of the previous. Within each level (except L0), SSTables have non-overlapping token ranges. This guarantees that a partition exists in at most one SSTable per level — dramatically reducing read amplification.

lcs-behavior.txttext

Leveled Compaction Strategy (LCS):

L0: [memtable flushes land here — may overlap]
    [5MB] [5MB] [5MB] [5MB]  → promote to L1 when 4 accumulate

L1: [50MB total, non-overlapping ranges]
    [A-F: 5MB] [G-M: 5MB] [N-S: 5MB] [T-Z: 5MB] ...

L2: [500MB total, non-overlapping ranges]  (10x L1)
    [A-B: 5MB] [C-D: 5MB] [E-F: 5MB] ...

L3: [5GB total, non-overlapping ranges]    (10x L2)
    ...

Compaction: When L(n) exceeds its size limit, one SSTable from L(n)
is merged with all overlapping SSTables in L(n+1).

Key guarantee: Within each level, no two SSTables contain the
same partition. Read amplification = at most (number of levels + 1).

Characteristics:
  ✅ Low read amplification (predictable, bounded)
  ✅ Low space amplification (~10% overhead)
  ✅ Predictable performance (no large bursty compactions)
  ❌ High write amplification (each byte written ~10-30 times)
  ❌ More I/O bandwidth consumed by compaction

Setting	Default	Description
sstable_size_in_mb	160	Target size for each SSTable
fanout_size	10	Size multiplier between levels

When to Use LCS

LCS is ideal for read-heavy workloads and tables with frequent updates (user profiles, session data, counters). The bounded read amplification means consistent read latency. The trade-off is higher write amplification — each byte is rewritten many times as it moves through levels. Don't use LCS for write-heavy, append-only workloads.

TimeWindow Compaction (TWCS)

TWCS is designed specifically for time-series data with TTL. It groups SSTables by time window (e.g., 1 hour, 1 day) and only compacts within the same window. Once a window closes, its SSTables are compacted into one final SSTable that is never compacted again — until TTL expires and the entire SSTable is dropped.

twcs-behavior.txttext

TimeWindow Compaction Strategy (TWCS):

Time windows (1-day example):
  ┌─────────────────────────────────────────────────────┐
  │ Jan 15 (closed): [SSTable: all Jan 15 data, 2GB]    │ ← never compacted again
  │ Jan 16 (closed): [SSTable: all Jan 16 data, 2GB]    │ ← never compacted again
  │ Jan 17 (active): [5MB] [5MB] [5MB] [5MB]            │ ← STCS within window
  └─────────────────────────────────────────────────────┘

When TTL expires (e.g., 30 days):
  Jan 15 SSTable: all data expired → DROP entire file (instant!)
  No tombstones needed, no compaction needed.

Characteristics:
  ✅ Minimal write amplification (compact once per window)
  ✅ Efficient TTL expiration (drop whole SSTables, no tombstones)
  ✅ Predictable disk usage (old windows drop off)
  ❌ Only works for time-series / append-only data
  ❌ Out-of-order writes break the model (data in wrong window)
  ❌ Updates/deletes to old windows cause problems

twcs-config.cqlsql

-- TWCS configuration for time-series with 30-day TTL
CREATE TABLE sensor_data (
  sensor_id TEXT,
  day DATE,
  event_time TIMESTAMP,
  value DOUBLE,
  PRIMARY KEY ((sensor_id, day), event_time)
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'DAYS',
  'compaction_window_size': 1
}
AND default_time_to_live = 2592000;  -- 30 days in seconds

-- Key rule: TTL should be a multiple of the window size
-- Window = 1 day, TTL = 30 days → each window SSTable lives 30 days then drops

TWCS + TTL = Zero-Cost Deletes

The magic of TWCS: when all data in a time window has the same TTL, the entire SSTable expires at once. Cassandra drops the file without reading it, without writing tombstones, without compaction. This is the most efficient way to handle time-series data with retention policies.

Tombstones & gc_grace_seconds

In Cassandra, a DELETE doesn't remove data — it writes a tombstone, a marker that says "this data is deleted as of timestamp T." Tombstones are necessary because SSTables are immutable and replicas might be out of sync. The tombstone must propagate to all replicas before the original data can be purged.

🪦

The Gravestone

Imagine a distributed phone book with copies in 3 cities. You can't just erase Alice's number from your copy — the other cities still have it and would 'resurrect' it during sync. Instead, you write a note: 'Alice's number deleted on Jan 15.' All cities see this note during sync and remove Alice. After 10 days (gc_grace_seconds), everyone has seen the note, so you can safely remove the note itself.

tombstones.txttext

Tombstone Lifecycle:

1. DELETE FROM users WHERE user_id = 'alice';
   → Writes tombstone: {user_id: 'alice', deleted_at: 2024-01-15T10:00:00}

2. Tombstone propagates via:
   - Normal replication (write to all replicas)
   - Read repair (if stale replica is read)
   - Anti-entropy repair (nodetool repair)

3. After gc_grace_seconds (default: 864000 = 10 days):
   - Tombstone is eligible for garbage collection
   - Next compaction purges the tombstone AND the original data

4. If repair hasn't run within gc_grace_seconds:
   ⚠️ DANGER: Tombstone is purged, but a stale replica still has
   the original data. Read repair will "resurrect" the deleted data!

Types of tombstones:
  - Cell tombstone: single column deleted
  - Row tombstone: entire row deleted (DELETE FROM t WHERE pk=x AND ck=y)
  - Range tombstone: range of clustering keys deleted
  - Partition tombstone: entire partition deleted (DELETE FROM t WHERE pk=x)
  - TTL tombstone: auto-generated when TTL expires

Scenario	gc_grace_seconds	Repair Frequency	Risk
Default	10 days	Must repair within 10 days	Safe if repair runs on schedule
Aggressive	1 day	Must repair within 1 day	High risk — any delay causes zombies
Conservative	30 days	Must repair within 30 days	More disk usage from tombstones
TWCS with TTL	0 (sometimes)	N/A (whole SSTables drop)	Only safe with TWCS + uniform TTL

The Zombie Data Problem

If gc_grace_seconds expires before repair runs on all replicas: (1) Tombstone is purged from repaired nodes. (2) Stale replica still has the original data. (3) Next read repair sees the data on the stale replica and no tombstone anywhere. (4) Data is "resurrected" — the delete is lost. This is the most dangerous operational issue in Cassandra.

nodetool Essential Commands

nodetool is the primary CLI for Cassandra operations. It communicates with the local node via JMX to perform administrative tasks, monitoring, and maintenance.

nodetool-commands.shbash

# ═══════════════════════════════════════════════════════════
# CLUSTER STATUS & HEALTH
# ═══════════════════════════════════════════════════════════

# Show cluster topology and node status
nodetool status
# UN = Up/Normal, DN = Down/Normal, UJ = Up/Joining, UL = Up/Leaving

# Show node info (heap, load, uptime, key cache hit rate)
nodetool info

# Show gossip state for all nodes
nodetool gossipinfo

# ═══════════════════════════════════════════════════════════
# TABLE STATISTICS & DIAGNOSTICS
# ═══════════════════════════════════════════════════════════

# Table-level stats (SSTable count, read/write latency, partition size)
nodetool tablestats my_keyspace.my_table

# Histogram of partition sizes and cell counts
nodetool tablehistograms my_keyspace.my_table

# Show compaction activity
nodetool compactionstats

# Show thread pool stats (pending/blocked tasks)
nodetool tpstats

# ═══════════════════════════════════════════════════════════
# MAINTENANCE OPERATIONS
# ═══════════════════════════════════════════════════════════

# Flush memtables to SSTables
nodetool flush my_keyspace

# Trigger compaction manually
nodetool compact my_keyspace my_table

# Run anti-entropy repair
nodetool repair my_keyspace

# Take a snapshot (backup)
nodetool snapshot my_keyspace -t my_snapshot_name

# Clear a snapshot
nodetool clearsnapshot -t my_snapshot_name

# ═══════════════════════════════════════════════════════════
# NODE LIFECYCLE
# ═══════════════════════════════════════════════════════════

# Gracefully remove a node (streams data to other nodes)
nodetool decommission

# Remove a dead node from the cluster
nodetool removenode <host_id>

# Drain (flush + stop accepting writes, for graceful shutdown)
nodetool drain

Daily Operations Checklist

✅nodetool status — verify all nodes are UN (Up/Normal)
✅nodetool tpstats — check for pending/blocked tasks (sign of overload)
✅nodetool compactionstats — verify compaction isn't falling behind
✅nodetool tablestats — monitor SSTable count and read latency
✅nodetool info — check heap usage and key cache hit rate

Adding & Removing Nodes

Cassandra supports elastic scaling — adding and removing nodes without downtime. The cluster automatically rebalances data when topology changes. However, these operations must be done carefully to avoid data loss or performance degradation.

Adding a Node (Bootstrap)

Configure New Node

Set cassandra.yaml with cluster name, seeds, snitch, and DC/rack. Set auto_bootstrap: true (default).

Start the Node

Node contacts seeds, joins gossip, claims token ranges (vnodes assigned automatically)

Stream Data

Existing nodes stream data for the new node's token ranges. Status shows as UJ (Up/Joining).

Join Complete

Once streaming finishes, node status changes to UN (Up/Normal). It now serves reads and writes.

Removing a Node (Decommission)

Initiate Decommission

Run 'nodetool decommission' on the node to be removed. Status changes to UL (Up/Leaving).

Stream Data Out

Node streams all its data to the remaining nodes that will take over its token ranges

Leave Ring

Once streaming completes, node removes itself from the ring and shuts down

Never Remove More Than One Node at a Time

Removing multiple nodes simultaneously can cause data loss if their token ranges overlap with the same replicas. Always wait for one decommission to complete before starting another. For dead nodes that can't run decommission, use "nodetool removenode" from another node.

Operation	Command	When to Use
Add node	Start with auto_bootstrap: true	Scaling up, replacing hardware
Remove live node	nodetool decommission	Scaling down, planned removal
Remove dead node	nodetool removenode <host_id>	Node permanently failed
Replace dead node	Start with -Dcassandra.replace_address=<old_ip>	Hardware replacement

Interview Questions

Q:Compare STCS, LCS, and TWCS — when would you use each?

A: STCS: write-heavy, append-only workloads (logs, events). Low write amplification but high read amplification and 50% disk overhead. LCS: read-heavy workloads with updates (user profiles, sessions). Low read amplification but high write amplification (10-30x). TWCS: time-series with TTL. Groups by time window, drops entire SSTables on expiry — zero-cost deletes. Never use TWCS for data that gets updated or deleted individually.

Q:What are tombstones and why can they cause problems?

A: Tombstones are delete markers (not actual deletions) because SSTables are immutable. Problems: (1) tombstones consume disk space until gc_grace_seconds expires, (2) reads must process tombstones (scanning past deleted data), (3) if repair doesn't run before gc_grace_seconds, deleted data can resurrect (zombie data). Excessive tombstones cause 'tombstone overwhelm' — reads timing out from processing millions of tombstones.

Q:What happens if repair doesn't run within gc_grace_seconds?

A: Tombstones are garbage collected (purged) from nodes that have been repaired. But a stale replica that missed the original delete still has the data AND the tombstone is now gone from other replicas. On the next read, read repair sees data on the stale replica with no tombstone to contradict it — so it propagates the 'deleted' data back to all replicas. The delete is effectively lost. This is called zombie data resurrection.

Q:How does adding a node to a Cassandra cluster work?

A: (1) New node starts with auto_bootstrap=true and contacts seed nodes. (2) It joins gossip and learns the cluster topology. (3) With vnodes, it's assigned random token ranges. (4) Existing nodes stream data for those ranges to the new node (status: UJ/Joining). (5) Once streaming completes, status becomes UN/Normal and it serves traffic. The cluster remains fully available throughout — no downtime.

Q:Why does STCS require 50% free disk space?

A: STCS compacts similar-sized SSTables together. In the worst case, it compacts all SSTables in a tier into one new SSTable. During compaction, both the input SSTables and the output SSTable exist simultaneously. If the input is 50% of disk, the output needs another 50%. Only after the new SSTable is complete are the old ones deleted. LCS needs less (10-20%) because it compacts one SSTable at a time against a level.

Common Mistakes

🗑️

Using DELETE heavily without understanding tombstones

Treating Cassandra like an RDBMS and issuing frequent DELETEs. Each delete creates a tombstone that persists for gc_grace_seconds (10 days). Tables with heavy deletes accumulate millions of tombstones, causing read timeouts.

✅Design for TTL-based expiration instead of explicit deletes. Use TWCS for time-series data. If deletes are necessary, monitor tombstone count with nodetool tablestats and ensure compaction runs regularly.

💽

Running out of disk during compaction

Not reserving enough free disk space for compaction. STCS needs up to 50% free space temporarily. When disk fills, compaction stops, SSTables accumulate, reads get slower, and the node becomes unrecoverable.

✅Monitor disk usage and alert at 50% for STCS, 70% for LCS. Add nodes or increase storage before reaching these thresholds. Never let a Cassandra node exceed 80% disk usage.

📈

Using STCS for read-heavy workloads

STCS allows many SSTables to accumulate (high read amplification). Read-heavy tables suffer because every read checks many SSTables, increasing latency unpredictably.

✅Switch to LCS for tables with frequent reads and updates. LCS guarantees bounded read amplification (one SSTable per level per partition). Accept the higher write amplification as the trade-off.

⏰

Using TWCS with out-of-order writes or updates

TWCS assumes data arrives in time order and is never updated. Out-of-order writes land in the wrong time window, and updates to old windows prevent those SSTables from being dropped cleanly.

✅Only use TWCS for pure append-only time-series with uniform TTL. If data arrives out of order or gets updated, use STCS or LCS instead.

🔄

Decommissioning multiple nodes simultaneously

Removing several nodes at once to scale down quickly. If their token ranges overlap, data may not have enough replicas during the transition — risking data loss.

✅Decommission one node at a time. Wait for the operation to complete (nodetool status shows the node gone) before starting the next. For large scale-downs, plan for days not hours.