RFQUORUMLOCAL_QUORUMHinted HandoffRead RepairAnti-Entropy

Replication & Consistency

Cassandra's tunable consistency lets you choose the trade-off between availability and correctness on a per-query basis. Understanding R + W > RF is the key to getting it right.

50 min read9 sections

Replication Factor & Strategies

The Replication Factor (RF) determines how many copies of each piece of data exist in the cluster. RF=3 means every partition is stored on three different nodes. This is the industry standard for production — it survives one node failure while still achieving QUORUM.

📋

The Important Document

You have a critical contract. RF=1 means one copy in one drawer — if that drawer catches fire, it's gone. RF=3 means three copies in three different buildings. Even if one building burns down, you still have two copies. QUORUM means you need to check at least two copies to be sure you have the latest version.

Strategy	Configuration	Replica Placement	Use Case
SimpleStrategy	RF=3	Next N nodes clockwise on ring	Single DC only (dev/test)
NetworkTopologyStrategy	{'dc1': 3, 'dc2': 3}	RF per DC, rack-aware placement	Production (always)

create-keyspace.cqlsql

-- SimpleStrategy: NEVER use in production
CREATE KEYSPACE dev_keyspace
WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': 3
};

-- NetworkTopologyStrategy: ALWAYS use in production
CREATE KEYSPACE prod_keyspace
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'dc1': 3,    -- 3 replicas in datacenter 1
  'dc2': 3     -- 3 replicas in datacenter 2
};

-- Why NetworkTopologyStrategy even with 1 DC?
-- Because SimpleStrategy ignores rack placement.
-- NTS ensures replicas are on different racks.
CREATE KEYSPACE single_dc_prod
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east-1': 3
};

Always Use NetworkTopologyStrategy

Even with a single datacenter, use NetworkTopologyStrategy. It respects rack placement (replicas on different racks) and makes future multi-DC expansion seamless. SimpleStrategy places replicas on the next N nodes clockwise regardless of rack — a single rack failure could lose all replicas.

Replication Factor Guidelines

✅RF=3 is the standard for production workloads
✅RF must be ≤ number of nodes in the DC (can't have 3 replicas with 2 nodes)
✅Higher RF = more durability but more storage and write amplification
✅RF=1 is only acceptable for ephemeral/cacheable data
✅Odd RF values work best with QUORUM (RF=3: QUORUM=2, RF=5: QUORUM=3)

Write Consistency Levels

The write consistency level (CL) determines how many replica nodes must acknowledge a write before the coordinator returns success to the client. Higher CL = more durable but higher latency and lower availability.

Consistency Level	Replicas Required	Behavior	Trade-off
ANY	1 (including hints)	Write succeeds even if stored only as a hint	Highest availability, risk of data loss
ONE	1 replica	One replica confirms write to disk	Fast, but data on only 1 node confirmed
TWO	2 replicas	Two replicas confirm	Rarely used directly
THREE	3 replicas	Three replicas confirm	Rarely used directly
QUORUM	⌊RF/2⌋ + 1	Majority of all replicas across all DCs	Strong consistency with reads at QUORUM
LOCAL_QUORUM	⌊local_RF/2⌋ + 1	Majority in coordinator's DC only	Production standard for multi-DC
EACH_QUORUM	Quorum in each DC	Majority in every DC	Strongest multi-DC guarantee
ALL	All replicas	Every replica must confirm	Zero fault tolerance, avoid

write-consistency.txttext

Example: RF=3, Write at QUORUM

Client → Coordinator → Replica 1 (ACK ✓)
                     → Replica 2 (ACK ✓)  ← QUORUM met (2/3), return success
                     → Replica 3 (ACK ✓)  ← arrives later, still written

QUORUM = floor(3/2) + 1 = 2 replicas must acknowledge

Timeline:
  t=0ms   Coordinator sends to all 3 replicas
  t=2ms   Replica 1 acknowledges
  t=3ms   Replica 2 acknowledges → SUCCESS returned to client
  t=15ms  Replica 3 acknowledges (slow but still writes)

If Replica 3 is down:
  - Write still succeeds (2/3 = QUORUM)
  - Hinted handoff stores the write for Replica 3
  - When Replica 3 comes back, hint is replayed

CL=ANY — The Dangerous One

ANY means the write succeeds even if it's only stored as a "hint" on the coordinator (no replica confirmed). If the coordinator crashes before replaying the hint, the data is lost. Use ANY only for non-critical data like analytics events where availability matters more than durability.

Read Consistency Levels

The read consistency level determines how many replicas the coordinator queries and how many must respond before returning data to the client. The coordinator always returns the most recent value (by timestamp) among the responses received.

Consistency Level	Replicas Queried	Behavior
ONE	1 (closest)	Fastest read, may return stale data
TWO	2	Compares 2 replicas, returns newest
QUORUM	⌊RF/2⌋ + 1	Majority guarantees seeing latest QUORUM write
LOCAL_QUORUM	⌊local_RF/2⌋ + 1	Majority in local DC only
LOCAL_ONE	1 in local DC	Fast local read, may be stale
ALL	All replicas	Guaranteed latest, but one down node = failure

read-path.txttext

Read at QUORUM (RF=3):

Coordinator sends:
  → Replica 1: full data request
  → Replica 2: digest request (hash of data only)
  
Replica 1 returns: {name: "Alice", timestamp: 1000}
Replica 2 returns: digest(name: "Alice", timestamp: 1000)

If digests match → return data immediately
If digests DON'T match → full read from all replicas
  → Return newest by timestamp
  → Trigger read repair on stale replicas

The "digest" optimization avoids transferring full data
from all replicas when they agree.

Digest Optimization

For reads at CL > ONE, Cassandra sends a full data request to one replica and digest (hash) requests to the others. If all digests match, only one full response was transferred. This reduces network bandwidth significantly for reads that find consistent data.

The Consistency Formula (R + W > RF)

The formula R + W > RF guarantees strong consistency — every read will see the most recent write. R is the read consistency level (number of replicas read), W is the write consistency level, and RF is the replication factor.

🔑 The Core Formula

If R + W > RF, at least one replica in every read was also part of the write quorum. This guarantees overlap — you always read from at least one node that has the latest data. The coordinator picks the newest value by timestamp.

Configuration (RF=3)	R + W	Strong?	Use Case
W=QUORUM(2), R=QUORUM(2)	4 > 3 ✓	Yes	Standard production setup
W=ONE(1), R=ALL(3)	4 > 3 ✓	Yes	Fast writes, slow reads
W=ALL(3), R=ONE(1)	4 > 3 ✓	Yes	Slow writes, fast reads
W=ONE(1), R=ONE(1)	2 < 3 ✗	No	Eventual consistency (fast)
W=QUORUM(2), R=ONE(1)	3 = 3 ✗	No	Not guaranteed (edge case)
W=ONE(1), R=QUORUM(2)	3 = 3 ✗	No	Not guaranteed (edge case)

consistency-scenarios.txttext

Why R + W > RF works (RF=3, W=QUORUM=2, R=QUORUM=2):

Replicas: [A, B, C]

Write "Alice" at QUORUM → written to A and B (2 of 3)
Read at QUORUM → reads from B and C (2 of 3)

Overlap: Node B has the latest write AND is in the read set.
Result: Read returns "Alice" ✓

─────────────────────────────────────────────────────────

Why R + W = RF fails (RF=3, W=ONE=1, R=QUORUM=2):

Write "Alice" at ONE → written to A only (1 of 3)
Read at QUORUM → reads from B and C (2 of 3)

No overlap! Neither B nor C has "Alice" yet.
Result: Read returns stale data ✗

(Eventually B and C get the data via anti-entropy,
 but there's a window of inconsistency)

🎯

The Venn Diagram

Imagine two circles: one for 'nodes that received the write' (W nodes) and one for 'nodes that are read' (R nodes). If R + W > RF, these circles MUST overlap — at least one node is in both circles. That overlapping node guarantees you see the latest write. If R + W ≤ RF, the circles CAN be completely separate — you might read only from nodes that missed the write.

LOCAL vs Non-LOCAL Consistency

In multi-datacenter deployments, the difference between QUORUM and LOCAL_QUORUM is critical. QUORUM counts replicas across ALL datacenters, while LOCAL_QUORUM only counts replicas in the coordinator's datacenter.

Level	Scope	Latency Impact	Availability
QUORUM	All DCs	Waits for cross-DC responses (100ms+)	Fails if remote DC is down
LOCAL_QUORUM	Local DC only	Local network only (1-5ms)	Survives remote DC failure
EACH_QUORUM	Each DC independently	Waits for all DCs	Strongest guarantee, lowest availability
LOCAL_ONE	Local DC only	Single node (fastest)	May read stale data

multi-dc-consistency.txttext

Setup: RF={'dc1': 3, 'dc2': 3}, Total RF=6

QUORUM (global): floor(6/2) + 1 = 4 replicas across both DCs
  → Must wait for cross-DC network (100-200ms latency)
  → If dc2 is unreachable, writes fail

LOCAL_QUORUM: floor(3/2) + 1 = 2 replicas in local DC only
  → Only local network latency (1-5ms)
  → dc2 being unreachable doesn't affect dc1 operations
  → Data still replicates to dc2 asynchronously

Production recommendation:
  Writes:  LOCAL_QUORUM (fast, survives remote DC failure)
  Reads:   LOCAL_QUORUM (strong consistency within DC)
  
  This gives you:
  - Strong consistency within each DC
  - Eventual consistency across DCs (typically < 100ms)
  - Full availability during DC failures

LOCAL_QUORUM Is the Production Standard

Almost every production Cassandra deployment uses LOCAL_QUORUM for both reads and writes. It provides strong consistency within a datacenter, survives remote DC failures, and avoids cross-DC latency on the critical path. Cross-DC replication happens asynchronously.

Hinted Handoff & Read Repair

Cassandra has multiple mechanisms to keep replicas in sync when nodes are temporarily unavailable. Hinted handoff handles short outages, and read repair fixes inconsistencies detected during reads.

Hinted Handoff

Node Goes Down

Replica node C becomes unreachable during a write

Hint Stored

Coordinator stores a 'hint' — the write intended for C — in a local hints directory

Write Succeeds

If enough other replicas acknowledge (meeting CL), the write succeeds

Node Returns

When C comes back online, gossip detects it

Hint Replayed

The coordinator replays stored hints to C, bringing it up to date

cassandra.yamlyaml

# Hinted handoff configuration
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000  # 3 hours (default)
# Hints older than this are dropped — node needs full repair

# Hint storage
hints_directory: /var/lib/cassandra/hints
max_hints_delivery_threads: 2
hints_flush_period_in_ms: 10000

Read Repair

When a read at CL > ONE detects that replicas have different versions of data (digest mismatch), the coordinator triggers a read repair: it reads full data from all replicas, determines the newest version by timestamp, and writes the newest version back to any stale replicas.

read-repair.txttext

Read Repair Example (RF=3, CL=QUORUM):

1. Coordinator reads from Replica A (full) and Replica B (digest)
2. Digests don't match!
3. Coordinator reads full data from all 3 replicas:
   - Replica A: {name: "Alice", ts: 1000}
   - Replica B: {name: "Alice", ts: 1000}
   - Replica C: {name: "Alicia", ts: 800}  ← STALE
4. Coordinator returns newest (ts=1000) to client
5. Coordinator writes {name: "Alice", ts: 1000} to Replica C
6. All replicas now consistent

Read repair is opportunistic — it only fixes data that is read.
Data that is never read remains inconsistent until anti-entropy repair.

Read Repair Limitations

Read repair only fixes data that is actively read. If a partition is never queried, stale replicas remain stale forever. This is why anti-entropy repair (nodetool repair) is mandatory — it fixes ALL data, not just data that happens to be read.

Anti-Entropy Repair

Anti-entropy repair (nodetool repair) is the process of comparing all replicas of a token range and synchronizing any differences. It uses Merkle trees to efficiently identify which partitions differ without comparing every row.

Build Merkle Trees

Each replica builds a hash tree of its data for the requested token ranges

Compare Trees

Coordinator compares Merkle trees from all replicas — differences identified at the partition level

Stream Differences

Only differing partitions are streamed between replicas — not the entire dataset

Apply Mutations

Stale replicas apply the newer data, resolving by timestamp (last-write-wins)

repair-commands.shbash

# Full repair of a keyspace (repairs all token ranges this node owns)
nodetool repair my_keyspace

# Incremental repair (only repairs data written since last repair)
nodetool repair my_keyspace --incremental

# Repair a specific table
nodetool repair my_keyspace my_table

# Parallel repair (repairs multiple ranges simultaneously)
nodetool repair my_keyspace --parallel

# Sub-range repair (for large clusters, repair in chunks)
nodetool repair my_keyspace -st <start_token> -et <end_token>

# Check repair status
nodetool netstats

Repair Must Run Before gc_grace_seconds

Tombstones (delete markers) are kept for gc_grace_seconds (default 10 days). If repair doesn't run within this window, deleted data can resurrect — a stale replica that missed the delete still has the old data, and after the tombstone is garbage collected, read repair will treat the old data as "newer" and propagate it back. This is called zombie data.

Repair Type	Scope	Performance Impact	Use Case
Full repair	All data on node	High (builds full Merkle trees)	First repair, after long outage
Incremental repair	Only unrepaired SSTables	Lower (less data to compare)	Regular scheduled repairs
Sub-range repair	Specific token range	Controlled	Large clusters, parallel execution
Primary range repair	Only ranges this node is primary for	Avoids redundant work	Orchestrated repair tools

Interview Questions

Q:Explain the formula R + W > RF and why it guarantees strong consistency.

A: R (read replicas) + W (write replicas) > RF (total replicas) ensures that the set of nodes read from and the set of nodes written to MUST overlap. At least one node in every read participated in the most recent write, so the coordinator always sees the latest value. Example: RF=3, W=QUORUM(2), R=QUORUM(2) → 2+2=4 > 3 → guaranteed overlap.

Q:What is the difference between QUORUM and LOCAL_QUORUM in a multi-DC setup?

A: QUORUM counts replicas across ALL datacenters: floor(total_RF/2)+1. With RF={'dc1':3,'dc2':3}, QUORUM=4, requiring cross-DC acknowledgment (high latency). LOCAL_QUORUM counts only the coordinator's DC: floor(local_RF/2)+1=2. It avoids cross-DC latency and survives remote DC failures. Production standard is LOCAL_QUORUM for both reads and writes.

Q:How does hinted handoff work and what are its limitations?

A: When a replica is down, the coordinator stores the intended write as a 'hint' locally. When the node returns, hints are replayed. Limitations: (1) hints are kept only for max_hint_window (default 3 hours) — longer outages need full repair, (2) hints don't count toward consistency level (except CL=ANY), (3) if the coordinator crashes, hints are lost, (4) hints consume disk space on the coordinator.

Q:What is read repair and why isn't it sufficient alone?

A: Read repair detects stale replicas during reads (via digest comparison) and updates them with the latest data. It's insufficient because: (1) it only fixes data that is actively read — unread partitions stay inconsistent, (2) it adds latency to reads when triggered, (3) it doesn't prevent tombstone resurrection. Anti-entropy repair (nodetool repair) is mandatory to fix ALL data.

Q:Why must repair run within gc_grace_seconds?

A: Tombstones (delete markers) are garbage collected after gc_grace_seconds (default 10 days). If a replica missed a delete and repair doesn't run before the tombstone is GC'd, the deleted data still exists on the stale replica. When that replica is read, it appears to have 'newer' data (the tombstone is gone), and read repair propagates the deleted data back — resurrecting it. This is called 'zombie data.'

Common Mistakes

🌍

Using QUORUM instead of LOCAL_QUORUM in multi-DC

QUORUM in a multi-DC setup waits for cross-DC acknowledgment, adding 100-200ms latency to every request and failing if the remote DC is unreachable.

✅Use LOCAL_QUORUM for both reads and writes in multi-DC deployments. You get strong consistency within each DC and eventual consistency across DCs.

🔧

Not running repair regularly

Skipping nodetool repair because 'hinted handoff and read repair handle it.' They don't — hinted handoff has a time window, and read repair only fixes read data. Unrepaired clusters accumulate inconsistencies and risk zombie data.

✅Schedule repair to complete a full cycle within gc_grace_seconds (10 days). Use tools like Reaper or cassandra-medusa to orchestrate repairs across the cluster.

⚡

Using CL=ALL for 'safety'

ALL means every replica must respond — one slow or down node fails the entire request. This gives you zero fault tolerance and defeats the purpose of a distributed database.

✅Use QUORUM or LOCAL_QUORUM. They provide strong consistency while tolerating node failures. ALL should only be used in very specific scenarios like schema migrations.

📊

Using SimpleStrategy in production

SimpleStrategy ignores rack topology. All 3 replicas might end up on nodes in the same rack. A single rack failure (power, network switch) loses all copies.

✅Always use NetworkTopologyStrategy, even with a single DC. It ensures replicas are placed on different racks for true fault tolerance.

🕐

Setting gc_grace_seconds too low

Reducing gc_grace_seconds to reclaim disk space faster. If repair doesn't complete within this window, tombstones are GC'd before all replicas see the delete — causing zombie data resurrection.

✅Keep gc_grace_seconds at 10 days (default) and ensure repair completes within that window. Only reduce it if you have very frequent repair cycles AND understand the zombie data risk.