Multi-DCNetworkTopologyStrategyActive-ActiveJVM TuningHardware

Multi-DC & Performance

Cassandra was built for multi-datacenter deployments. Active-active writes across continents, automatic conflict resolution, and zero-downtime failover are native capabilities — not bolt-on features.

45 min read9 sections

Why Multi-Datacenter

Multi-datacenter deployment in Cassandra serves three primary purposes: geographic distribution (low latency for global users), disaster recovery (survive entire DC failures), and workload isolation (analytics in one DC, production in another).

Multi-DC Benefits

✅Geographic proximity — users read from nearest DC (lower latency)
✅Disaster recovery — entire DC failure doesn't lose data or availability
✅Active-active — writes accepted in any DC, no failover needed
✅Workload isolation — run analytics/repair in a separate DC without impacting production
✅Regulatory compliance — keep data in specific geographic regions

🏦

The International Bank

Imagine a bank with branches in New York, London, and Tokyo. Each branch has a complete copy of all account records. A customer in Tokyo can deposit money at the Tokyo branch (fast, local). The deposit is replicated to New York and London within milliseconds. If the London branch burns down, New York and Tokyo continue operating without interruption. That's Cassandra multi-DC.

Deployment	DCs	Use Case	Consistency Model
Single DC	1	Development, small apps	QUORUM within one DC
Multi-DC (2)	2	DR + geo-distribution	LOCAL_QUORUM per DC
Multi-DC (3+)	3+	Global presence	LOCAL_QUORUM per DC
DC + Analytics DC	2	Workload isolation	LOCAL_ONE for analytics reads

NetworkTopologyStrategy Configuration

NetworkTopologyStrategy (NTS) is the only replication strategy suitable for production. It allows you to specify the replication factor independently for each datacenter and ensures replicas are placed on different racks within each DC.

multi-dc-keyspace.cqlsql

-- Standard multi-DC keyspace (3 replicas per DC)
CREATE KEYSPACE global_app
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east-1': 3,
  'eu-west-1': 3,
  'ap-southeast-1': 3
};

-- Asymmetric replication (analytics DC gets fewer replicas)
CREATE KEYSPACE analytics_app
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east-1': 3,      -- production: full replication
  'us-west-2': 3,      -- DR: full replication
  'analytics-dc': 1    -- analytics: single copy (saves storage)
};

-- Adding a new DC to existing keyspace
ALTER KEYSPACE global_app
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east-1': 3,
  'eu-west-1': 3,
  'ap-southeast-1': 3,
  'sa-east-1': 3       -- new South America DC
};
-- Then run: nodetool rebuild -- -Dcassandra.rebuild_from=us-east-1
-- on each node in the new DC to stream data

Rack-Aware Replica Placement

rack-placement.txttext

NetworkTopologyStrategy with RF=3 in a DC with 3 racks:

DC: us-east-1 (RF=3)
  Rack A: [Node1, Node2, Node3]
  Rack B: [Node4, Node5, Node6]
  Rack C: [Node7, Node8, Node9]

For partition P (token hashes to Node2):
  Replica 1: Node2 (Rack A) — primary replica
  Replica 2: Node4 (Rack B) — first node in a DIFFERENT rack
  Replica 3: Node7 (Rack C) — first node in yet ANOTHER rack

Result: Partition P survives any single rack failure.
All 3 replicas are in different racks.

⚠️ If you only have 2 racks with RF=3:
  Rack A gets 2 replicas, Rack B gets 1 (or vice versa)
  A single rack failure can lose 2 of 3 replicas!
  Best practice: racks ≥ RF

Racks Should Equal or Exceed RF

For optimal fault tolerance, have at least as many racks as your replication factor. With RF=3 and 3 racks, each rack holds exactly one replica. A rack failure loses only 1 of 3 replicas — QUORUM still achievable.

LOCAL_QUORUM — The Production Standard

LOCAL_QUORUM is the consistency level used by virtually every production Cassandra deployment. It provides strong consistency within a datacenter while avoiding cross-DC latency on the critical path.

local-quorum.txttext

Why LOCAL_QUORUM is the standard:

Setup: RF={'us-east-1': 3, 'eu-west-1': 3}

With QUORUM (global):
  Required ACKs: floor(6/2) + 1 = 4
  A write in us-east-1 must wait for at least 1 ACK from eu-west-1
  Cross-Atlantic latency: 80-120ms added to EVERY write
  If eu-west-1 is unreachable: ALL writes fail

With LOCAL_QUORUM:
  Required ACKs: floor(3/2) + 1 = 2 (local DC only)
  A write in us-east-1 only waits for 2 local replicas
  Latency: 1-5ms (local network only)
  If eu-west-1 is unreachable: us-east-1 continues normally
  Data replicates to eu-west-1 asynchronously (typically < 100ms)

The trade-off:
  ✅ Strong consistency within each DC
  ✅ Survives remote DC failures
  ✅ Low latency (no cross-DC wait)
  ⚠️ Brief window of inconsistency across DCs (~100ms)
  ⚠️ A write in us-east-1 may not be immediately visible in eu-west-1

Scenario	Recommended CL	Why
Standard reads/writes	LOCAL_QUORUM	Strong local consistency, survives DC failure
Analytics queries	LOCAL_ONE	Speed over consistency, stale data acceptable
Critical financial writes	EACH_QUORUM	Must confirm in all DCs before ACK
Session/cache data	LOCAL_ONE	Speed matters, data is ephemeral
Schema changes	ALL (implicit)	Must propagate to all nodes

Active-Active Writes & Conflict Resolution

Cassandra supports active-active writes — both DCs accept writes simultaneously for the same data. When conflicts occur (same partition updated in both DCs before replication completes), Cassandra resolves them using Last-Write-Wins (LWW) based on the write timestamp.

conflict-resolution.txttext

Conflict Resolution: Last-Write-Wins (LWW)

Timeline:
  t=0ms    User updates email to "a@new.com" in us-east-1
  t=5ms    Same user updates email to "b@new.com" in eu-west-1
  t=100ms  Replication delivers both writes to both DCs

Resolution (per-cell, not per-row):
  us-east-1 write timestamp: 1705312800000 (t=0ms)
  eu-west-1 write timestamp: 1705312800005 (t=5ms)
  
  Winner: eu-west-1 (higher timestamp) → email = "b@new.com"
  This resolution happens identically on ALL replicas.

Key properties:
  - Resolution is deterministic (same result on every node)
  - Resolution is per-CELL (column), not per-row
  - No data loss detection — the "losing" write is silently discarded
  - Depends on synchronized clocks (NTP is critical!)

Per-cell resolution example:
  us-east-1: UPDATE users SET name='Alice', age=30 WHERE id=1; (ts=100)
  eu-west-1: UPDATE users SET name='Bob', email='b@x.com' WHERE id=1; (ts=105)
  
  Result: name='Bob' (ts=105 wins), age=30 (only write), email='b@x.com' (ts=105 wins)
  Each column resolved independently!

NTP Is Critical for Multi-DC

Last-Write-Wins depends on timestamps. If clocks are skewed between DCs, the "wrong" write can win. All Cassandra nodes must run NTP (or chrony) with tight synchronization. Clock skew > 1 second between DCs is dangerous. Monitor clock drift as a critical metric.

Active-Active Considerations

✅LWW is automatic — no application code needed for conflict resolution
✅Resolution is per-cell (column), not per-row — partial updates merge correctly
✅NTP synchronization is mandatory — clock skew causes wrong winner
✅No conflict detection — application cannot know a conflict occurred
✅For conflict-sensitive data, use LWT or route writes to a single DC

Hardware & JVM Tuning

Cassandra's performance is heavily influenced by hardware choices and JVM configuration. The wrong settings can cause GC pauses, I/O bottlenecks, and unpredictable latency spikes.

Component	Recommendation	Why
Storage	SSDs (NVMe preferred)	Random reads for SSTable lookups; HDDs cause 10-100x latency
RAM	32-64 GB	8-16 GB for JVM heap, rest for OS page cache (caches SSTables)
CPU	8-16 cores	Compaction, compression, and request handling are CPU-bound
Network	10 Gbps+	Streaming during repair/bootstrap saturates network
Commit log disk	Separate SSD	Avoids I/O contention with data reads
Disk count	Multiple data disks (JBOD)	Cassandra manages multiple data_file_directories

jvm-tuning.txttext

JVM Heap Configuration (jvm.options or jvm11-server.options):

# Heap size: 8 GB is the sweet spot for most workloads
-Xms8G
-Xmx8G
# Never exceed 16 GB — GC pauses become unacceptable
# Rule: heap = min(8GB, 1/4 of RAM, 16GB max)

# Garbage Collector (Cassandra 4.0+):
# ZGC (recommended for Java 17+):
-XX:+UseZGC
-XX:+ZGenerational    # Java 21+

# G1GC (default for Cassandra 4.x on Java 11):
-XX:+UseG1GC
-XX:MaxGCPauseMillis=500
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:InitiatingHeapOccupancyPercent=70

# Off-heap memory (not counted in heap):
# - Bloom filters
# - Compression metadata
# - Key cache (configurable)
# - Chunk cache (Cassandra 4.0+)
# These use native memory — account for them in total RAM planning

# Total RAM budget example (64 GB machine):
#   JVM Heap:     8 GB
#   Off-heap:     4 GB (bloom filters, caches)
#   OS page cache: 48 GB (caches SSTable data files)
#   OS/other:     4 GB

Never Exceed 8-16 GB Heap

Larger heaps mean longer GC pauses. Cassandra is designed to use off-heap memory and OS page cache for data. The JVM heap holds memtables, internal structures, and request state. 8 GB handles most workloads. If you need more, add nodes (scale horizontally) rather than increasing heap.

Hardware Anti-Patterns

❌HDDs for data — random read latency makes Cassandra unusable at scale
❌Heap > 16 GB — GC pauses cause request timeouts and node flapping
❌Shared commit log disk — I/O contention spikes write latency
❌Fewer large nodes instead of more small nodes — blast radius too large
❌RAID for data disks — Cassandra handles replication; RAID adds overhead

Monitoring Key Metrics

Effective Cassandra monitoring focuses on a small set of critical metrics that indicate cluster health, performance degradation, and impending problems before they cause outages.

Metric	Healthy Range	Alert Threshold	Indicates
Read latency (p99)	< 10ms	> 50ms	Read amplification, compaction behind
Write latency (p99)	< 5ms	> 20ms	Commit log contention, overload
SSTable count per table	< 20 (LCS), < 50 (STCS)	Growing continuously	Compaction falling behind
Pending compactions	< 10	> 50	Compaction can't keep up with writes
GC pause time	< 200ms	> 500ms	Heap too large or memory pressure
Disk usage	< 50% (STCS), < 70% (LCS)	> threshold	Risk of compaction failure
Dropped mutations	0	> 0	Nodes overloaded, writes being dropped
Key cache hit rate	> 85%	< 50%	Cache too small or access pattern changed

monitoring-queries.shbash

# Key metrics from nodetool:

# Read/write latency per table
nodetool tablestats my_keyspace.my_table | grep -E "latency|SSTable count"

# GC stats
nodetool gcstats

# Thread pool status (look for pending/blocked)
nodetool tpstats | grep -E "Pool|Pending|Blocked"
# Critical pools: ReadStage, MutationStage, CompactionExecutor

# Compaction progress
nodetool compactionstats

# Network streaming (during repair/bootstrap)
nodetool netstats

# JMX metrics for Prometheus/Grafana:
# org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency
# org.apache.cassandra.metrics:type=Compaction,name=PendingTasks
# org.apache.cassandra.metrics:type=Storage,name=Load
# org.apache.cassandra.metrics:type=DroppedMessage,scope=MUTATION,name=Dropped

The Three Signals of Trouble

Watch for these three patterns that predict outages: (1) Pending compactions growing over time — compaction can't keep up, reads will degrade. (2) GC pauses increasing — heap pressure, eventually causes node to be marked down. (3) Dropped mutations > 0 — the node is overloaded and silently losing writes.

Cassandra vs ScyllaDB

ScyllaDB is a C++ rewrite of Cassandra that eliminates the JVM entirely. It uses a shard-per-core architecture (one thread per CPU core, no shared state) and achieves 5-10x higher throughput with lower, more predictable latency. It's wire-compatible with Cassandra — same CQL, same drivers, same tools.

Aspect	Apache Cassandra	ScyllaDB
Language	Java (JVM)	C++ (no GC)
Architecture	Thread-per-request	Shard-per-core (Seastar framework)
GC pauses	Yes (G1GC/ZGC, 50-500ms)	None (no garbage collector)
Throughput	Baseline	5-10x higher per node
Tail latency (p99)	Variable (GC spikes)	Consistent (no GC)
Nodes needed	More (lower per-node throughput)	Fewer (higher per-node throughput)
CQL compatibility	Native	100% compatible (same protocol)
Driver compatibility	Native	Uses same Cassandra drivers
Compaction	Single-threaded per table	Per-shard (parallel, no locks)
Cost	Open source (free)	Open source + Enterprise edition
Community	Large, mature (Apache)	Growing, backed by ScyllaDB Inc

scylladb-architecture.txttext

ScyllaDB Shard-Per-Core Architecture:

Traditional Cassandra (thread-per-request):
  ┌─────────────────────────────────────────┐
  │ JVM Heap (shared memory, GC managed)    │
  │                                         │
  │  Thread 1 ──┐                           │
  │  Thread 2 ──┼── Shared data structures  │
  │  Thread 3 ──┤   (locks, contention)     │
  │  Thread N ──┘                           │
  │                                         │
  │  GC pauses affect ALL threads           │
  └─────────────────────────────────────────┘

ScyllaDB (shard-per-core, Seastar framework):
  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ Core 0   │ │ Core 1   │ │ Core 2   │ │ Core N   │
  │          │ │          │ │          │ │          │
  │ Own data │ │ Own data │ │ Own data │ │ Own data │
  │ Own mem  │ │ Own mem  │ │ Own mem  │ │ Own mem  │
  │ Own I/O  │ │ Own I/O  │ │ Own I/O  │ │ Own I/O  │
  │          │ │          │ │          │ │          │
  │ No locks │ │ No locks │ │ No locks │ │ No locks │
  │ No GC    │ │ No GC    │ │ No GC    │ │ No GC    │
  └──────────┘ └──────────┘ └──────────┘ └──────────┘
  
  Each core is independent — no shared state, no locks,
  no garbage collection. Cross-core communication via
  message passing (like Erlang actors).

When to Choose ScyllaDB

Choose ScyllaDB when: you need lower tail latency (no GC pauses), higher throughput per node (fewer nodes = lower cost), or you're starting a new project with no existing Cassandra investment. Stick with Cassandra when: you have existing Cassandra expertise, need the larger community/ecosystem, or prefer Apache governance.

Interview Questions

Q:How does Cassandra handle conflicts in a multi-DC active-active setup?

A: Last-Write-Wins (LWW) based on write timestamp, resolved per-cell (column), not per-row. If the same column is updated in two DCs simultaneously, the write with the higher timestamp wins. Resolution is deterministic — all replicas reach the same result independently. This requires synchronized clocks (NTP). There's no conflict detection or notification — the 'losing' write is silently discarded.

Q:Why is LOCAL_QUORUM preferred over QUORUM in multi-DC deployments?

A: QUORUM counts replicas across ALL DCs: with RF={'dc1':3,'dc2':3}, QUORUM=4, requiring cross-DC acknowledgment (80-120ms latency). LOCAL_QUORUM counts only the local DC: floor(3/2)+1=2 local replicas. Benefits: (1) no cross-DC latency on critical path, (2) survives remote DC failure, (3) strong consistency within each DC. Trade-off: brief cross-DC inconsistency window (~100ms).

Q:What hardware recommendations would you give for a production Cassandra cluster?

A: SSDs (NVMe preferred) for data — random reads are critical. 32-64 GB RAM (8-16 GB heap, rest for OS page cache). 8-16 CPU cores. Separate SSD for commit log. 10 Gbps network. No RAID (Cassandra handles replication). Prefer more smaller nodes over fewer large nodes (smaller blast radius). Never exceed 16 GB JVM heap — GC pauses become unacceptable.

Q:What is ScyllaDB and how does it compare to Cassandra?

A: ScyllaDB is a C++ rewrite of Cassandra using the Seastar framework (shard-per-core architecture). No JVM means no GC pauses. Each CPU core owns its data independently — no locks, no shared state. Result: 5-10x throughput per node, consistent tail latency. It's wire-compatible (same CQL, same drivers). Trade-off: smaller community, commercial backing vs Apache governance.

Q:How do you add a new datacenter to an existing Cassandra cluster?

A: (1) ALTER KEYSPACE to add the new DC to NetworkTopologyStrategy. (2) Start nodes in the new DC with auto_bootstrap: false (they join empty). (3) Run 'nodetool rebuild --source-dc=existing_dc' on each new node to stream data. (4) Update client configuration to include new DC contact points. (5) The cluster remains fully available throughout — zero downtime.

Common Mistakes

🕐

Not synchronizing clocks across DCs

Last-Write-Wins depends on timestamps. Clock skew between DCs means the 'wrong' write can win conflicts. A 1-second skew means a write from 1 second ago can override a write from now.

✅Run NTP (or chrony) on all nodes with tight synchronization. Monitor clock drift as a critical metric. Alert on skew > 100ms between any two nodes.

🌐

Using global QUORUM in multi-DC deployments

QUORUM across DCs adds cross-DC latency (80-120ms) to every request and fails if the remote DC is unreachable. This defeats the purpose of multi-DC (availability and low latency).

✅Use LOCAL_QUORUM for both reads and writes. Accept eventual consistency across DCs (typically < 100ms). Use EACH_QUORUM only for critical operations that truly need global confirmation.

💾

Using HDDs instead of SSDs

Cassandra's read path involves random I/O (seeking to specific offsets in SSTables). HDDs have 5-10ms seek time vs SSDs at 0.1ms. This makes p99 read latency 50-100x worse on HDDs.

✅Always use SSDs for Cassandra data directories. NVMe SSDs are preferred for high-throughput workloads. HDDs are only acceptable for cold archive data that's rarely read.

📦

Running too few large nodes instead of more small nodes

A 3-node cluster with 64 cores and 256 GB RAM each. When one node goes down, 33% of capacity is lost. Repair and streaming take hours because each node holds so much data.

✅Prefer more smaller nodes (e.g., 9 nodes with 16 cores, 64 GB RAM). Smaller blast radius on failure, faster repairs, better load distribution. Target 1-2 TB data per node maximum.

⚙️

Setting JVM heap too large

Giving Cassandra 32 GB or 64 GB heap thinking 'more is better.' Large heaps cause long GC pauses (seconds), which cause the node to be marked down by other nodes, triggering unnecessary streaming and instability.

✅Keep heap at 8 GB (max 16 GB). Let the OS page cache use the remaining RAM to cache SSTable data. If you need more capacity, add nodes — don't increase heap.