Multi-DC & Performance
Cassandra was built for multi-datacenter deployments. Active-active writes across continents, automatic conflict resolution, and zero-downtime failover are native capabilities β not bolt-on features.
Table of Contents
Why Multi-Datacenter
Multi-datacenter deployment in Cassandra serves three primary purposes: geographic distribution (low latency for global users), disaster recovery (survive entire DC failures), and workload isolation (analytics in one DC, production in another).
Multi-DC Benefits
- β Geographic proximity β users read from nearest DC (lower latency)
- β Disaster recovery β entire DC failure doesn't lose data or availability
- β Active-active β writes accepted in any DC, no failover needed
- β Workload isolation β run analytics/repair in a separate DC without impacting production
- β Regulatory compliance β keep data in specific geographic regions
The International Bank
Imagine a bank with branches in New York, London, and Tokyo. Each branch has a complete copy of all account records. A customer in Tokyo can deposit money at the Tokyo branch (fast, local). The deposit is replicated to New York and London within milliseconds. If the London branch burns down, New York and Tokyo continue operating without interruption. That's Cassandra multi-DC.
| Deployment | DCs | Use Case | Consistency Model |
|---|---|---|---|
| Single DC | 1 | Development, small apps | QUORUM within one DC |
| Multi-DC (2) | 2 | DR + geo-distribution | LOCAL_QUORUM per DC |
| Multi-DC (3+) | 3+ | Global presence | LOCAL_QUORUM per DC |
| DC + Analytics DC | 2 | Workload isolation | LOCAL_ONE for analytics reads |
NetworkTopologyStrategy Configuration
NetworkTopologyStrategy (NTS) is the only replication strategy suitable for production. It allows you to specify the replication factor independently for each datacenter and ensures replicas are placed on different racks within each DC.
-- Standard multi-DC keyspace (3 replicas per DC) CREATE KEYSPACE global_app WITH replication = { 'class': 'NetworkTopologyStrategy', 'us-east-1': 3, 'eu-west-1': 3, 'ap-southeast-1': 3 }; -- Asymmetric replication (analytics DC gets fewer replicas) CREATE KEYSPACE analytics_app WITH replication = { 'class': 'NetworkTopologyStrategy', 'us-east-1': 3, -- production: full replication 'us-west-2': 3, -- DR: full replication 'analytics-dc': 1 -- analytics: single copy (saves storage) }; -- Adding a new DC to existing keyspace ALTER KEYSPACE global_app WITH replication = { 'class': 'NetworkTopologyStrategy', 'us-east-1': 3, 'eu-west-1': 3, 'ap-southeast-1': 3, 'sa-east-1': 3 -- new South America DC }; -- Then run: nodetool rebuild -- -Dcassandra.rebuild_from=us-east-1 -- on each node in the new DC to stream data
Rack-Aware Replica Placement
NetworkTopologyStrategy with RF=3 in a DC with 3 racks: DC: us-east-1 (RF=3) Rack A: [Node1, Node2, Node3] Rack B: [Node4, Node5, Node6] Rack C: [Node7, Node8, Node9] For partition P (token hashes to Node2): Replica 1: Node2 (Rack A) β primary replica Replica 2: Node4 (Rack B) β first node in a DIFFERENT rack Replica 3: Node7 (Rack C) β first node in yet ANOTHER rack Result: Partition P survives any single rack failure. All 3 replicas are in different racks. β οΈ If you only have 2 racks with RF=3: Rack A gets 2 replicas, Rack B gets 1 (or vice versa) A single rack failure can lose 2 of 3 replicas! Best practice: racks β₯ RF
Racks Should Equal or Exceed RF
For optimal fault tolerance, have at least as many racks as your replication factor. With RF=3 and 3 racks, each rack holds exactly one replica. A rack failure loses only 1 of 3 replicas β QUORUM still achievable.
LOCAL_QUORUM β The Production Standard
LOCAL_QUORUM is the consistency level used by virtually every production Cassandra deployment. It provides strong consistency within a datacenter while avoiding cross-DC latency on the critical path.
Why LOCAL_QUORUM is the standard: Setup: RF={'us-east-1': 3, 'eu-west-1': 3} With QUORUM (global): Required ACKs: floor(6/2) + 1 = 4 A write in us-east-1 must wait for at least 1 ACK from eu-west-1 Cross-Atlantic latency: 80-120ms added to EVERY write If eu-west-1 is unreachable: ALL writes fail With LOCAL_QUORUM: Required ACKs: floor(3/2) + 1 = 2 (local DC only) A write in us-east-1 only waits for 2 local replicas Latency: 1-5ms (local network only) If eu-west-1 is unreachable: us-east-1 continues normally Data replicates to eu-west-1 asynchronously (typically < 100ms) The trade-off: β Strong consistency within each DC β Survives remote DC failures β Low latency (no cross-DC wait) β οΈ Brief window of inconsistency across DCs (~100ms) β οΈ A write in us-east-1 may not be immediately visible in eu-west-1
| Scenario | Recommended CL | Why |
|---|---|---|
| Standard reads/writes | LOCAL_QUORUM | Strong local consistency, survives DC failure |
| Analytics queries | LOCAL_ONE | Speed over consistency, stale data acceptable |
| Critical financial writes | EACH_QUORUM | Must confirm in all DCs before ACK |
| Session/cache data | LOCAL_ONE | Speed matters, data is ephemeral |
| Schema changes | ALL (implicit) | Must propagate to all nodes |
Active-Active Writes & Conflict Resolution
Cassandra supports active-active writes β both DCs accept writes simultaneously for the same data. When conflicts occur (same partition updated in both DCs before replication completes), Cassandra resolves them using Last-Write-Wins (LWW) based on the write timestamp.
Conflict Resolution: Last-Write-Wins (LWW) Timeline: t=0ms User updates email to "a@new.com" in us-east-1 t=5ms Same user updates email to "b@new.com" in eu-west-1 t=100ms Replication delivers both writes to both DCs Resolution (per-cell, not per-row): us-east-1 write timestamp: 1705312800000 (t=0ms) eu-west-1 write timestamp: 1705312800005 (t=5ms) Winner: eu-west-1 (higher timestamp) β email = "b@new.com" This resolution happens identically on ALL replicas. Key properties: - Resolution is deterministic (same result on every node) - Resolution is per-CELL (column), not per-row - No data loss detection β the "losing" write is silently discarded - Depends on synchronized clocks (NTP is critical!) Per-cell resolution example: us-east-1: UPDATE users SET name='Alice', age=30 WHERE id=1; (ts=100) eu-west-1: UPDATE users SET name='Bob', email='b@x.com' WHERE id=1; (ts=105) Result: name='Bob' (ts=105 wins), age=30 (only write), email='b@x.com' (ts=105 wins) Each column resolved independently!
NTP Is Critical for Multi-DC
Last-Write-Wins depends on timestamps. If clocks are skewed between DCs, the "wrong" write can win. All Cassandra nodes must run NTP (or chrony) with tight synchronization. Clock skew > 1 second between DCs is dangerous. Monitor clock drift as a critical metric.
Active-Active Considerations
- β LWW is automatic β no application code needed for conflict resolution
- β Resolution is per-cell (column), not per-row β partial updates merge correctly
- β NTP synchronization is mandatory β clock skew causes wrong winner
- β No conflict detection β application cannot know a conflict occurred
- β For conflict-sensitive data, use LWT or route writes to a single DC
Hardware & JVM Tuning
Cassandra's performance is heavily influenced by hardware choices and JVM configuration. The wrong settings can cause GC pauses, I/O bottlenecks, and unpredictable latency spikes.
| Component | Recommendation | Why |
|---|---|---|
| Storage | SSDs (NVMe preferred) | Random reads for SSTable lookups; HDDs cause 10-100x latency |
| RAM | 32-64 GB | 8-16 GB for JVM heap, rest for OS page cache (caches SSTables) |
| CPU | 8-16 cores | Compaction, compression, and request handling are CPU-bound |
| Network | 10 Gbps+ | Streaming during repair/bootstrap saturates network |
| Commit log disk | Separate SSD | Avoids I/O contention with data reads |
| Disk count | Multiple data disks (JBOD) | Cassandra manages multiple data_file_directories |
JVM Heap Configuration (jvm.options or jvm11-server.options): # Heap size: 8 GB is the sweet spot for most workloads -Xms8G -Xmx8G # Never exceed 16 GB β GC pauses become unacceptable # Rule: heap = min(8GB, 1/4 of RAM, 16GB max) # Garbage Collector (Cassandra 4.0+): # ZGC (recommended for Java 17+): -XX:+UseZGC -XX:+ZGenerational # Java 21+ # G1GC (default for Cassandra 4.x on Java 11): -XX:+UseG1GC -XX:MaxGCPauseMillis=500 -XX:G1RSetUpdatingPauseTimePercent=5 -XX:InitiatingHeapOccupancyPercent=70 # Off-heap memory (not counted in heap): # - Bloom filters # - Compression metadata # - Key cache (configurable) # - Chunk cache (Cassandra 4.0+) # These use native memory β account for them in total RAM planning # Total RAM budget example (64 GB machine): # JVM Heap: 8 GB # Off-heap: 4 GB (bloom filters, caches) # OS page cache: 48 GB (caches SSTable data files) # OS/other: 4 GB
Never Exceed 8-16 GB Heap
Larger heaps mean longer GC pauses. Cassandra is designed to use off-heap memory and OS page cache for data. The JVM heap holds memtables, internal structures, and request state. 8 GB handles most workloads. If you need more, add nodes (scale horizontally) rather than increasing heap.
Hardware Anti-Patterns
- βHDDs for data β random read latency makes Cassandra unusable at scale
- βHeap > 16 GB β GC pauses cause request timeouts and node flapping
- βShared commit log disk β I/O contention spikes write latency
- βFewer large nodes instead of more small nodes β blast radius too large
- βRAID for data disks β Cassandra handles replication; RAID adds overhead
Monitoring Key Metrics
Effective Cassandra monitoring focuses on a small set of critical metrics that indicate cluster health, performance degradation, and impending problems before they cause outages.
| Metric | Healthy Range | Alert Threshold | Indicates |
|---|---|---|---|
| Read latency (p99) | < 10ms | > 50ms | Read amplification, compaction behind |
| Write latency (p99) | < 5ms | > 20ms | Commit log contention, overload |
| SSTable count per table | < 20 (LCS), < 50 (STCS) | Growing continuously | Compaction falling behind |
| Pending compactions | < 10 | > 50 | Compaction can't keep up with writes |
| GC pause time | < 200ms | > 500ms | Heap too large or memory pressure |
| Disk usage | < 50% (STCS), < 70% (LCS) | > threshold | Risk of compaction failure |
| Dropped mutations | 0 | > 0 | Nodes overloaded, writes being dropped |
| Key cache hit rate | > 85% | < 50% | Cache too small or access pattern changed |
# Key metrics from nodetool: # Read/write latency per table nodetool tablestats my_keyspace.my_table | grep -E "latency|SSTable count" # GC stats nodetool gcstats # Thread pool status (look for pending/blocked) nodetool tpstats | grep -E "Pool|Pending|Blocked" # Critical pools: ReadStage, MutationStage, CompactionExecutor # Compaction progress nodetool compactionstats # Network streaming (during repair/bootstrap) nodetool netstats # JMX metrics for Prometheus/Grafana: # org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency # org.apache.cassandra.metrics:type=Compaction,name=PendingTasks # org.apache.cassandra.metrics:type=Storage,name=Load # org.apache.cassandra.metrics:type=DroppedMessage,scope=MUTATION,name=Dropped
The Three Signals of Trouble
Watch for these three patterns that predict outages: (1) Pending compactions growing over time β compaction can't keep up, reads will degrade. (2) GC pauses increasing β heap pressure, eventually causes node to be marked down. (3) Dropped mutations > 0 β the node is overloaded and silently losing writes.
Cassandra vs ScyllaDB
ScyllaDB is a C++ rewrite of Cassandra that eliminates the JVM entirely. It uses a shard-per-core architecture (one thread per CPU core, no shared state) and achieves 5-10x higher throughput with lower, more predictable latency. It's wire-compatible with Cassandra β same CQL, same drivers, same tools.
| Aspect | Apache Cassandra | ScyllaDB |
|---|---|---|
| Language | Java (JVM) | C++ (no GC) |
| Architecture | Thread-per-request | Shard-per-core (Seastar framework) |
| GC pauses | Yes (G1GC/ZGC, 50-500ms) | None (no garbage collector) |
| Throughput | Baseline | 5-10x higher per node |
| Tail latency (p99) | Variable (GC spikes) | Consistent (no GC) |
| Nodes needed | More (lower per-node throughput) | Fewer (higher per-node throughput) |
| CQL compatibility | Native | 100% compatible (same protocol) |
| Driver compatibility | Native | Uses same Cassandra drivers |
| Compaction | Single-threaded per table | Per-shard (parallel, no locks) |
| Cost | Open source (free) | Open source + Enterprise edition |
| Community | Large, mature (Apache) | Growing, backed by ScyllaDB Inc |
ScyllaDB Shard-Per-Core Architecture: Traditional Cassandra (thread-per-request): βββββββββββββββββββββββββββββββββββββββββββ β JVM Heap (shared memory, GC managed) β β β β Thread 1 βββ β β Thread 2 βββΌββ Shared data structures β β Thread 3 βββ€ (locks, contention) β β Thread N βββ β β β β GC pauses affect ALL threads β βββββββββββββββββββββββββββββββββββββββββββ ScyllaDB (shard-per-core, Seastar framework): ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β Core 0 β β Core 1 β β Core 2 β β Core N β β β β β β β β β β Own data β β Own data β β Own data β β Own data β β Own mem β β Own mem β β Own mem β β Own mem β β Own I/O β β Own I/O β β Own I/O β β Own I/O β β β β β β β β β β No locks β β No locks β β No locks β β No locks β β No GC β β No GC β β No GC β β No GC β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ Each core is independent β no shared state, no locks, no garbage collection. Cross-core communication via message passing (like Erlang actors).
When to Choose ScyllaDB
Choose ScyllaDB when: you need lower tail latency (no GC pauses), higher throughput per node (fewer nodes = lower cost), or you're starting a new project with no existing Cassandra investment. Stick with Cassandra when: you have existing Cassandra expertise, need the larger community/ecosystem, or prefer Apache governance.
Interview Questions
Q:How does Cassandra handle conflicts in a multi-DC active-active setup?
A: Last-Write-Wins (LWW) based on write timestamp, resolved per-cell (column), not per-row. If the same column is updated in two DCs simultaneously, the write with the higher timestamp wins. Resolution is deterministic β all replicas reach the same result independently. This requires synchronized clocks (NTP). There's no conflict detection or notification β the 'losing' write is silently discarded.
Q:Why is LOCAL_QUORUM preferred over QUORUM in multi-DC deployments?
A: QUORUM counts replicas across ALL DCs: with RF={'dc1':3,'dc2':3}, QUORUM=4, requiring cross-DC acknowledgment (80-120ms latency). LOCAL_QUORUM counts only the local DC: floor(3/2)+1=2 local replicas. Benefits: (1) no cross-DC latency on critical path, (2) survives remote DC failure, (3) strong consistency within each DC. Trade-off: brief cross-DC inconsistency window (~100ms).
Q:What hardware recommendations would you give for a production Cassandra cluster?
A: SSDs (NVMe preferred) for data β random reads are critical. 32-64 GB RAM (8-16 GB heap, rest for OS page cache). 8-16 CPU cores. Separate SSD for commit log. 10 Gbps network. No RAID (Cassandra handles replication). Prefer more smaller nodes over fewer large nodes (smaller blast radius). Never exceed 16 GB JVM heap β GC pauses become unacceptable.
Q:What is ScyllaDB and how does it compare to Cassandra?
A: ScyllaDB is a C++ rewrite of Cassandra using the Seastar framework (shard-per-core architecture). No JVM means no GC pauses. Each CPU core owns its data independently β no locks, no shared state. Result: 5-10x throughput per node, consistent tail latency. It's wire-compatible (same CQL, same drivers). Trade-off: smaller community, commercial backing vs Apache governance.
Q:How do you add a new datacenter to an existing Cassandra cluster?
A: (1) ALTER KEYSPACE to add the new DC to NetworkTopologyStrategy. (2) Start nodes in the new DC with auto_bootstrap: false (they join empty). (3) Run 'nodetool rebuild --source-dc=existing_dc' on each new node to stream data. (4) Update client configuration to include new DC contact points. (5) The cluster remains fully available throughout β zero downtime.
Common Mistakes
Not synchronizing clocks across DCs
Last-Write-Wins depends on timestamps. Clock skew between DCs means the 'wrong' write can win conflicts. A 1-second skew means a write from 1 second ago can override a write from now.
β Run NTP (or chrony) on all nodes with tight synchronization. Monitor clock drift as a critical metric. Alert on skew > 100ms between any two nodes.
Using global QUORUM in multi-DC deployments
QUORUM across DCs adds cross-DC latency (80-120ms) to every request and fails if the remote DC is unreachable. This defeats the purpose of multi-DC (availability and low latency).
β Use LOCAL_QUORUM for both reads and writes. Accept eventual consistency across DCs (typically < 100ms). Use EACH_QUORUM only for critical operations that truly need global confirmation.
Using HDDs instead of SSDs
Cassandra's read path involves random I/O (seeking to specific offsets in SSTables). HDDs have 5-10ms seek time vs SSDs at 0.1ms. This makes p99 read latency 50-100x worse on HDDs.
β Always use SSDs for Cassandra data directories. NVMe SSDs are preferred for high-throughput workloads. HDDs are only acceptable for cold archive data that's rarely read.
Running too few large nodes instead of more small nodes
A 3-node cluster with 64 cores and 256 GB RAM each. When one node goes down, 33% of capacity is lost. Repair and streaming take hours because each node holds so much data.
β Prefer more smaller nodes (e.g., 9 nodes with 16 cores, 64 GB RAM). Smaller blast radius on failure, faster repairs, better load distribution. Target 1-2 TB data per node maximum.
Setting JVM heap too large
Giving Cassandra 32 GB or 64 GB heap thinking 'more is better.' Large heaps cause long GC pauses (seconds), which cause the node to be marked down by other nodes, triggering unnecessary streaming and instability.
β Keep heap at 8 GB (max 16 GB). Let the OS page cache use the remaining RAM to cache SSTable data. If you need more capacity, add nodes β don't increase heap.