Replication & Consistency
Cassandra's tunable consistency lets you choose the trade-off between availability and correctness on a per-query basis. Understanding R + W > RF is the key to getting it right.
Table of Contents
Replication Factor & Strategies
The Replication Factor (RF) determines how many copies of each piece of data exist in the cluster. RF=3 means every partition is stored on three different nodes. This is the industry standard for production β it survives one node failure while still achieving QUORUM.
The Important Document
You have a critical contract. RF=1 means one copy in one drawer β if that drawer catches fire, it's gone. RF=3 means three copies in three different buildings. Even if one building burns down, you still have two copies. QUORUM means you need to check at least two copies to be sure you have the latest version.
| Strategy | Configuration | Replica Placement | Use Case |
|---|---|---|---|
| SimpleStrategy | RF=3 | Next N nodes clockwise on ring | Single DC only (dev/test) |
| NetworkTopologyStrategy | {'dc1': 3, 'dc2': 3} | RF per DC, rack-aware placement | Production (always) |
-- SimpleStrategy: NEVER use in production CREATE KEYSPACE dev_keyspace WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': 3 }; -- NetworkTopologyStrategy: ALWAYS use in production CREATE KEYSPACE prod_keyspace WITH replication = { 'class': 'NetworkTopologyStrategy', 'dc1': 3, -- 3 replicas in datacenter 1 'dc2': 3 -- 3 replicas in datacenter 2 }; -- Why NetworkTopologyStrategy even with 1 DC? -- Because SimpleStrategy ignores rack placement. -- NTS ensures replicas are on different racks. CREATE KEYSPACE single_dc_prod WITH replication = { 'class': 'NetworkTopologyStrategy', 'us-east-1': 3 };
Always Use NetworkTopologyStrategy
Even with a single datacenter, use NetworkTopologyStrategy. It respects rack placement (replicas on different racks) and makes future multi-DC expansion seamless. SimpleStrategy places replicas on the next N nodes clockwise regardless of rack β a single rack failure could lose all replicas.
Replication Factor Guidelines
- β RF=3 is the standard for production workloads
- β RF must be β€ number of nodes in the DC (can't have 3 replicas with 2 nodes)
- β Higher RF = more durability but more storage and write amplification
- β RF=1 is only acceptable for ephemeral/cacheable data
- β Odd RF values work best with QUORUM (RF=3: QUORUM=2, RF=5: QUORUM=3)
Write Consistency Levels
The write consistency level (CL) determines how many replica nodes must acknowledge a write before the coordinator returns success to the client. Higher CL = more durable but higher latency and lower availability.
| Consistency Level | Replicas Required | Behavior | Trade-off |
|---|---|---|---|
| ANY | 1 (including hints) | Write succeeds even if stored only as a hint | Highest availability, risk of data loss |
| ONE | 1 replica | One replica confirms write to disk | Fast, but data on only 1 node confirmed |
| TWO | 2 replicas | Two replicas confirm | Rarely used directly |
| THREE | 3 replicas | Three replicas confirm | Rarely used directly |
| QUORUM | βRF/2β + 1 | Majority of all replicas across all DCs | Strong consistency with reads at QUORUM |
| LOCAL_QUORUM | βlocal_RF/2β + 1 | Majority in coordinator's DC only | Production standard for multi-DC |
| EACH_QUORUM | Quorum in each DC | Majority in every DC | Strongest multi-DC guarantee |
| ALL | All replicas | Every replica must confirm | Zero fault tolerance, avoid |
Example: RF=3, Write at QUORUM Client β Coordinator β Replica 1 (ACK β) β Replica 2 (ACK β) β QUORUM met (2/3), return success β Replica 3 (ACK β) β arrives later, still written QUORUM = floor(3/2) + 1 = 2 replicas must acknowledge Timeline: t=0ms Coordinator sends to all 3 replicas t=2ms Replica 1 acknowledges t=3ms Replica 2 acknowledges β SUCCESS returned to client t=15ms Replica 3 acknowledges (slow but still writes) If Replica 3 is down: - Write still succeeds (2/3 = QUORUM) - Hinted handoff stores the write for Replica 3 - When Replica 3 comes back, hint is replayed
CL=ANY β The Dangerous One
ANY means the write succeeds even if it's only stored as a "hint" on the coordinator (no replica confirmed). If the coordinator crashes before replaying the hint, the data is lost. Use ANY only for non-critical data like analytics events where availability matters more than durability.
Read Consistency Levels
The read consistency level determines how many replicas the coordinator queries and how many must respond before returning data to the client. The coordinator always returns the most recent value (by timestamp) among the responses received.
| Consistency Level | Replicas Queried | Behavior |
|---|---|---|
| ONE | 1 (closest) | Fastest read, may return stale data |
| TWO | 2 | Compares 2 replicas, returns newest |
| QUORUM | βRF/2β + 1 | Majority guarantees seeing latest QUORUM write |
| LOCAL_QUORUM | βlocal_RF/2β + 1 | Majority in local DC only |
| LOCAL_ONE | 1 in local DC | Fast local read, may be stale |
| ALL | All replicas | Guaranteed latest, but one down node = failure |
Read at QUORUM (RF=3): Coordinator sends: β Replica 1: full data request β Replica 2: digest request (hash of data only) Replica 1 returns: {name: "Alice", timestamp: 1000} Replica 2 returns: digest(name: "Alice", timestamp: 1000) If digests match β return data immediately If digests DON'T match β full read from all replicas β Return newest by timestamp β Trigger read repair on stale replicas The "digest" optimization avoids transferring full data from all replicas when they agree.
Digest Optimization
For reads at CL > ONE, Cassandra sends a full data request to one replica and digest (hash) requests to the others. If all digests match, only one full response was transferred. This reduces network bandwidth significantly for reads that find consistent data.
The Consistency Formula (R + W > RF)
The formula R + W > RF guarantees strong consistency β every read will see the most recent write. R is the read consistency level (number of replicas read), W is the write consistency level, and RF is the replication factor.
π The Core Formula
If R + W > RF, at least one replica in every read was also part of the write quorum. This guarantees overlap β you always read from at least one node that has the latest data. The coordinator picks the newest value by timestamp.
| Configuration (RF=3) | R + W | Strong? | Use Case |
|---|---|---|---|
| W=QUORUM(2), R=QUORUM(2) | 4 > 3 β | Yes | Standard production setup |
| W=ONE(1), R=ALL(3) | 4 > 3 β | Yes | Fast writes, slow reads |
| W=ALL(3), R=ONE(1) | 4 > 3 β | Yes | Slow writes, fast reads |
| W=ONE(1), R=ONE(1) | 2 < 3 β | No | Eventual consistency (fast) |
| W=QUORUM(2), R=ONE(1) | 3 = 3 β | No | Not guaranteed (edge case) |
| W=ONE(1), R=QUORUM(2) | 3 = 3 β | No | Not guaranteed (edge case) |
Why R + W > RF works (RF=3, W=QUORUM=2, R=QUORUM=2): Replicas: [A, B, C] Write "Alice" at QUORUM β written to A and B (2 of 3) Read at QUORUM β reads from B and C (2 of 3) Overlap: Node B has the latest write AND is in the read set. Result: Read returns "Alice" β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Why R + W = RF fails (RF=3, W=ONE=1, R=QUORUM=2): Write "Alice" at ONE β written to A only (1 of 3) Read at QUORUM β reads from B and C (2 of 3) No overlap! Neither B nor C has "Alice" yet. Result: Read returns stale data β (Eventually B and C get the data via anti-entropy, but there's a window of inconsistency)
The Venn Diagram
Imagine two circles: one for 'nodes that received the write' (W nodes) and one for 'nodes that are read' (R nodes). If R + W > RF, these circles MUST overlap β at least one node is in both circles. That overlapping node guarantees you see the latest write. If R + W β€ RF, the circles CAN be completely separate β you might read only from nodes that missed the write.
LOCAL vs Non-LOCAL Consistency
In multi-datacenter deployments, the difference between QUORUM and LOCAL_QUORUM is critical. QUORUM counts replicas across ALL datacenters, while LOCAL_QUORUM only counts replicas in the coordinator's datacenter.
| Level | Scope | Latency Impact | Availability |
|---|---|---|---|
| QUORUM | All DCs | Waits for cross-DC responses (100ms+) | Fails if remote DC is down |
| LOCAL_QUORUM | Local DC only | Local network only (1-5ms) | Survives remote DC failure |
| EACH_QUORUM | Each DC independently | Waits for all DCs | Strongest guarantee, lowest availability |
| LOCAL_ONE | Local DC only | Single node (fastest) | May read stale data |
Setup: RF={'dc1': 3, 'dc2': 3}, Total RF=6 QUORUM (global): floor(6/2) + 1 = 4 replicas across both DCs β Must wait for cross-DC network (100-200ms latency) β If dc2 is unreachable, writes fail LOCAL_QUORUM: floor(3/2) + 1 = 2 replicas in local DC only β Only local network latency (1-5ms) β dc2 being unreachable doesn't affect dc1 operations β Data still replicates to dc2 asynchronously Production recommendation: Writes: LOCAL_QUORUM (fast, survives remote DC failure) Reads: LOCAL_QUORUM (strong consistency within DC) This gives you: - Strong consistency within each DC - Eventual consistency across DCs (typically < 100ms) - Full availability during DC failures
LOCAL_QUORUM Is the Production Standard
Almost every production Cassandra deployment uses LOCAL_QUORUM for both reads and writes. It provides strong consistency within a datacenter, survives remote DC failures, and avoids cross-DC latency on the critical path. Cross-DC replication happens asynchronously.
Hinted Handoff & Read Repair
Cassandra has multiple mechanisms to keep replicas in sync when nodes are temporarily unavailable. Hinted handoff handles short outages, and read repair fixes inconsistencies detected during reads.
Hinted Handoff
Node Goes Down
Replica node C becomes unreachable during a write
Hint Stored
Coordinator stores a 'hint' β the write intended for C β in a local hints directory
Write Succeeds
If enough other replicas acknowledge (meeting CL), the write succeeds
Node Returns
When C comes back online, gossip detects it
Hint Replayed
The coordinator replays stored hints to C, bringing it up to date
# Hinted handoff configuration hinted_handoff_enabled: true max_hint_window_in_ms: 10800000 # 3 hours (default) # Hints older than this are dropped β node needs full repair # Hint storage hints_directory: /var/lib/cassandra/hints max_hints_delivery_threads: 2 hints_flush_period_in_ms: 10000
Read Repair
When a read at CL > ONE detects that replicas have different versions of data (digest mismatch), the coordinator triggers a read repair: it reads full data from all replicas, determines the newest version by timestamp, and writes the newest version back to any stale replicas.
Read Repair Example (RF=3, CL=QUORUM): 1. Coordinator reads from Replica A (full) and Replica B (digest) 2. Digests don't match! 3. Coordinator reads full data from all 3 replicas: - Replica A: {name: "Alice", ts: 1000} - Replica B: {name: "Alice", ts: 1000} - Replica C: {name: "Alicia", ts: 800} β STALE 4. Coordinator returns newest (ts=1000) to client 5. Coordinator writes {name: "Alice", ts: 1000} to Replica C 6. All replicas now consistent Read repair is opportunistic β it only fixes data that is read. Data that is never read remains inconsistent until anti-entropy repair.
Read Repair Limitations
Read repair only fixes data that is actively read. If a partition is never queried, stale replicas remain stale forever. This is why anti-entropy repair (nodetool repair) is mandatory β it fixes ALL data, not just data that happens to be read.
Anti-Entropy Repair
Anti-entropy repair (nodetool repair) is the process of comparing all replicas of a token range and synchronizing any differences. It uses Merkle trees to efficiently identify which partitions differ without comparing every row.
Build Merkle Trees
Each replica builds a hash tree of its data for the requested token ranges
Compare Trees
Coordinator compares Merkle trees from all replicas β differences identified at the partition level
Stream Differences
Only differing partitions are streamed between replicas β not the entire dataset
Apply Mutations
Stale replicas apply the newer data, resolving by timestamp (last-write-wins)
# Full repair of a keyspace (repairs all token ranges this node owns) nodetool repair my_keyspace # Incremental repair (only repairs data written since last repair) nodetool repair my_keyspace --incremental # Repair a specific table nodetool repair my_keyspace my_table # Parallel repair (repairs multiple ranges simultaneously) nodetool repair my_keyspace --parallel # Sub-range repair (for large clusters, repair in chunks) nodetool repair my_keyspace -st <start_token> -et <end_token> # Check repair status nodetool netstats
Repair Must Run Before gc_grace_seconds
Tombstones (delete markers) are kept for gc_grace_seconds (default 10 days). If repair doesn't run within this window, deleted data can resurrect β a stale replica that missed the delete still has the old data, and after the tombstone is garbage collected, read repair will treat the old data as "newer" and propagate it back. This is called zombie data.
| Repair Type | Scope | Performance Impact | Use Case |
|---|---|---|---|
| Full repair | All data on node | High (builds full Merkle trees) | First repair, after long outage |
| Incremental repair | Only unrepaired SSTables | Lower (less data to compare) | Regular scheduled repairs |
| Sub-range repair | Specific token range | Controlled | Large clusters, parallel execution |
| Primary range repair | Only ranges this node is primary for | Avoids redundant work | Orchestrated repair tools |
Interview Questions
Q:Explain the formula R + W > RF and why it guarantees strong consistency.
A: R (read replicas) + W (write replicas) > RF (total replicas) ensures that the set of nodes read from and the set of nodes written to MUST overlap. At least one node in every read participated in the most recent write, so the coordinator always sees the latest value. Example: RF=3, W=QUORUM(2), R=QUORUM(2) β 2+2=4 > 3 β guaranteed overlap.
Q:What is the difference between QUORUM and LOCAL_QUORUM in a multi-DC setup?
A: QUORUM counts replicas across ALL datacenters: floor(total_RF/2)+1. With RF={'dc1':3,'dc2':3}, QUORUM=4, requiring cross-DC acknowledgment (high latency). LOCAL_QUORUM counts only the coordinator's DC: floor(local_RF/2)+1=2. It avoids cross-DC latency and survives remote DC failures. Production standard is LOCAL_QUORUM for both reads and writes.
Q:How does hinted handoff work and what are its limitations?
A: When a replica is down, the coordinator stores the intended write as a 'hint' locally. When the node returns, hints are replayed. Limitations: (1) hints are kept only for max_hint_window (default 3 hours) β longer outages need full repair, (2) hints don't count toward consistency level (except CL=ANY), (3) if the coordinator crashes, hints are lost, (4) hints consume disk space on the coordinator.
Q:What is read repair and why isn't it sufficient alone?
A: Read repair detects stale replicas during reads (via digest comparison) and updates them with the latest data. It's insufficient because: (1) it only fixes data that is actively read β unread partitions stay inconsistent, (2) it adds latency to reads when triggered, (3) it doesn't prevent tombstone resurrection. Anti-entropy repair (nodetool repair) is mandatory to fix ALL data.
Q:Why must repair run within gc_grace_seconds?
A: Tombstones (delete markers) are garbage collected after gc_grace_seconds (default 10 days). If a replica missed a delete and repair doesn't run before the tombstone is GC'd, the deleted data still exists on the stale replica. When that replica is read, it appears to have 'newer' data (the tombstone is gone), and read repair propagates the deleted data back β resurrecting it. This is called 'zombie data.'
Common Mistakes
Using QUORUM instead of LOCAL_QUORUM in multi-DC
QUORUM in a multi-DC setup waits for cross-DC acknowledgment, adding 100-200ms latency to every request and failing if the remote DC is unreachable.
β Use LOCAL_QUORUM for both reads and writes in multi-DC deployments. You get strong consistency within each DC and eventual consistency across DCs.
Not running repair regularly
Skipping nodetool repair because 'hinted handoff and read repair handle it.' They don't β hinted handoff has a time window, and read repair only fixes read data. Unrepaired clusters accumulate inconsistencies and risk zombie data.
β Schedule repair to complete a full cycle within gc_grace_seconds (10 days). Use tools like Reaper or cassandra-medusa to orchestrate repairs across the cluster.
Using CL=ALL for 'safety'
ALL means every replica must respond β one slow or down node fails the entire request. This gives you zero fault tolerance and defeats the purpose of a distributed database.
β Use QUORUM or LOCAL_QUORUM. They provide strong consistency while tolerating node failures. ALL should only be used in very specific scenarios like schema migrations.
Using SimpleStrategy in production
SimpleStrategy ignores rack topology. All 3 replicas might end up on nodes in the same rack. A single rack failure (power, network switch) loses all copies.
β Always use NetworkTopologyStrategy, even with a single DC. It ensures replicas are placed on different racks for true fault tolerance.
Setting gc_grace_seconds too low
Reducing gc_grace_seconds to reclaim disk space faster. If repair doesn't complete within this window, tombstones are GC'd before all replicas see the delete β causing zombie data resurrection.
β Keep gc_grace_seconds at 10 days (default) and ensure repair completes within that window. Only reduce it if you have very frequent repair cycles AND understand the zombie data risk.