Leader-FollowerRedis SentinelRedis ClusterHash SlotsAutomatic FailoverRead Scaling

Redis Replication & High Availability

Scale reads and survive failures — leader-follower replication, Redis Sentinel for automatic failover, and Redis Cluster for horizontal scaling across hash slots.

30 min read8 sections

Why Replication Matters

A single Redis instance is a single point of failure. If that node goes down — hardware failure, network partition, OOM kill — every service that depends on it loses access to cached data, session state, rate-limit counters, and anything else stored in Redis. The blast radius is enormous.

Replication solves two problems at once: fault tolerance (if the primary dies, a replica takes over) and read scaling (replicas can serve read traffic, reducing load on the primary). In most production systems, Redis isn't optional — it's a critical dependency. Treating it as a single disposable node is a design flaw.

📋

The Head Chef Analogy

Imagine a restaurant with one head chef. Every order goes through them — they cook, plate, and serve. If the chef gets sick, the restaurant closes. Replication is like training sous chefs who watch the head chef and learn every recipe in real time. If the head chef is out, a sous chef steps up immediately. And on busy nights, sous chefs can handle read-only tasks (plating, garnishing) while the head chef focuses on cooking (writes).

🔑 The Two Axes of Replication

Replication gives you two things: (1) Availability — if the primary fails, a replica can be promoted. (2) Read throughput — replicas can serve GET commands, multiplying your read capacity. These are independent benefits. You might want replication purely for availability even if you don't need read scaling.

Single Instance vs Replicated — Risk Profiletext

Single Redis instance:
  Write throughput:  100,000 ops/sec     ✅
  Read throughput:   100,000 ops/sec     ✅
  Availability:      ONE node failure = TOTAL outage  ❌
  Data durability:   Lost if node dies (unless AOF/RDB)  ❌
  Recovery time:     Minutes (restart + reload from disk)  ❌

Redis with 2 replicas:
  Write throughput:  100,000 ops/sec (primary only)  ✅
  Read throughput:   300,000 ops/sec (primary + 2 replicas)  ✅
  Availability:      Survives 1-2 node failures  ✅
  Data durability:   Data exists on 3 nodes  ✅
  Recovery time:     Seconds (promote replica)  ✅

Leader-Follower Replication

Redis uses asynchronous leader-follower (master-replica) replication. The primary node accepts all writes. Replicas connect to the primary, receive a stream of write commands, and replay them locally. Reads can be served by any replica.

How Replication Works

Initial Sync (Full Resynchronization)

When a replica connects for the first time (or after a long disconnect), the primary triggers a BGSAVE — it forks and creates an RDB snapshot. The snapshot is sent to the replica, which loads it into memory. During the snapshot, the primary buffers new writes in a replication backlog.

Backlog Replay

After the replica loads the RDB snapshot, the primary sends all buffered writes from the replication backlog. The replica replays them to catch up to the primary's current state.

Continuous Streaming (PSYNC)

Once caught up, the primary streams every write command to the replica in real time. This is the steady state — the replica stays in sync by replaying the same commands the primary executes. This uses the PSYNC protocol.

Partial Resync After Disconnect

If a replica briefly disconnects (network blip), it doesn't need a full resync. It sends its replication offset to the primary. If the offset is still in the replication backlog, the primary sends only the missed commands — a partial resync. Much faster than a full sync.

redis.conf — Replica Configurationbash

# On the replica node
replicaof 192.168.1.100 6379

# Authentication (if primary requires a password)
masterauth your-secret-password

# Allow reads on the replica (default: yes)
replica-read-only yes

# Replication backlog size (for partial resync)
# Larger = survives longer disconnects without full resync
repl-backlog-size 64mb

# Diskless replication (primary sends RDB over socket, no disk I/O)
repl-diskless-sync yes
repl-diskless-sync-delay 5

Replica-of-Replica Chaining

Replicas can replicate from other replicas, forming a chain: Primary → Replica A → Replica B. This reduces load on the primary (it only sends data to Replica A, not to both). Useful when you have many replicas — the primary doesn't need to maintain connections to all of them.

Chained Replication Topologytext

Direct replication (primary sends to all):
  Primary ──→ Replica 1
         ──→ Replica 2
         ──→ Replica 3
  Primary handles 3 replication streams (high CPU/bandwidth)

Chained replication (reduces primary load):
  Primary ──→ Replica 1 ──→ Replica 3
                         ──→ Replica 4
         ──→ Replica 2 ──→ Replica 5
  Primary handles only 2 streams
  Trade-off: Replica 3-5 have higher replication lag

Reading from Replicas — Use Cases and Risks

Replicas can serve read traffic, effectively multiplying your read throughput. But because replication is asynchronous, replicas may be slightly behind the primary. This means reads from replicas can return stale data.

Use Case	Read from Replica?	Why
Analytics dashboards	✅ Yes	Stale data by a few seconds is acceptable
Product catalog browsing	✅ Yes	Eventual consistency is fine for display
Session validation	⚠️ Careful	User logs in on primary, reads session from replica — might not exist yet
Inventory count for purchase	❌ No	Stale count could oversell — read from primary
Rate limiting counters	❌ No	Stale counter allows exceeding limits

Monitoring Replication Lag

Checking Replication Lagbash

# On the primary — shows connected replicas and their lag
redis-cli INFO replication

# Key fields:
#   connected_slaves: 2
#   slave0: ip=10.0.0.2,port=6379,state=online,offset=1234567,lag=0
#   slave1: ip=10.0.0.3,port=6379,state=online,offset=1234560,lag=1
#
# "lag" = seconds since last ACK from replica
# "offset" difference = bytes of replication stream behind

# Alert if lag > 1 second for latency-sensitive reads
# Alert if offset difference > 10MB for data safety

Diskless Replication

By default, full resync writes an RDB file to disk, then sends it to the replica. With diskless replication (repl-diskless-sync yes), the primary streams the RDB directly over the socket — no disk I/O. This is faster on systems with slow disks but fast networks. The trade-off: if the transfer fails, it must restart from scratch (no file to resume from).

🎯 Interview Insight

When discussing Redis replication, always mention that it's asynchronous. This means: (1) writes are acknowledged before replicas confirm, so data can be lost on primary failure, and (2) replicas may serve stale reads. These are fundamental trade-offs, not bugs.

Redis Sentinel

Leader-follower replication gives you replicas, but it doesn't give you automatic failover. If the primary dies, someone has to manually promote a replica and reconfigure clients. Redis Sentinel automates this — it monitors your Redis instances, detects failures, and promotes a replica to primary without human intervention.

🏥

The Hospital Shift Supervisor

Think of Sentinel as a team of shift supervisors in a hospital. They constantly check if the lead surgeon (primary) is available. If the lead surgeon doesn't respond, the supervisors vote: 'Is the surgeon really unavailable, or is it just my pager that's broken?' If a majority agree the surgeon is down, they promote the most experienced resident (replica) to lead surgeon and notify all the nurses (clients) about the change. You need at least 3 supervisors so a majority vote is meaningful.

What Sentinel Does

🔍 Monitoring

Continuously pings primary and replicas
Detects when a node stops responding
Distinguishes subjective vs objective down

📢 Notification

Publishes events via Pub/Sub
Notifies clients of topology changes
Can trigger alerts to ops teams

🔄 Automatic Failover

Promotes a replica to primary
Reconfigures other replicas to follow new primary
Updates clients with new primary address

⚙️ Configuration Provider

Clients connect to Sentinel, not directly to Redis
Sentinel tells clients the current primary address
Clients auto-reconnect after failover

Quorum — Why You Need at Least 3 Sentinels

A single Sentinel can't reliably detect failures — it might be the Sentinel that's having network issues, not the primary. The quorum is the minimum number of Sentinels that must agree the primary is down before failover begins. With 3 Sentinels and a quorum of 2, you can tolerate 1 Sentinel failure and still perform failover.

sentinel.conf — Minimal Configurationbash

# Each Sentinel node runs this configuration
port 26379

# Monitor the primary named "mymaster" at 192.168.1.100:6379
# Quorum of 2 — at least 2 Sentinels must agree primary is down
sentinel monitor mymaster 192.168.1.100 6379 2

# Primary is considered down after 5 seconds of no response
sentinel down-after-milliseconds mymaster 5000

# Only 1 replica syncs from new primary at a time during failover
# (prevents all replicas from being unavailable simultaneously)
sentinel parallel-syncs mymaster 1

# Failover times out after 60 seconds
sentinel failover-timeout mymaster 60000

# Authentication
sentinel auth-pass mymaster your-secret-password

Failover Process — Step by Step

Subjective Down (SDOWN)

A single Sentinel detects the primary isn't responding to PING within the down-after-milliseconds threshold. It marks the primary as 'subjectively down' — this Sentinel thinks it's down, but it could be a local network issue.

Objective Down (ODOWN)

The Sentinel asks other Sentinels: 'Do you also think the primary is down?' If the quorum number of Sentinels agree, the primary is marked 'objectively down.' This prevents false positives from a single Sentinel's network issues.

Sentinel Leader Election

The Sentinels elect a leader among themselves using the Raft-like algorithm. The leader Sentinel will coordinate the failover. This prevents multiple Sentinels from trying to promote different replicas simultaneously.

Replica Selection

The leader Sentinel picks the best replica to promote based on: (1) replica priority (configurable), (2) replication offset (most up-to-date data), (3) run ID (tie-breaker). The replica with the most data and highest priority wins.

Promotion

The leader Sentinel sends REPLICAOF NO ONE to the chosen replica, making it a standalone primary. It then reconfigures all other replicas to replicate from the new primary using REPLICAOF <new-primary-ip> <port>.

Client Notification

Sentinel publishes a +switch-master event. Sentinel-aware clients (like Jedis, Lettuce, ioredis) subscribe to these events and automatically reconnect to the new primary. The old primary, if it comes back, is reconfigured as a replica.

Client Connection via Sentinel (Node.js / ioredis)text

// Clients connect to Sentinel, NOT directly to Redis
const redis = new Redis({
  sentinels: [
    { host: "sentinel-1", port: 26379 },
    { host: "sentinel-2", port: 26379 },
    { host: "sentinel-3", port: 26379 },
  ],
  name: "mymaster",  // The monitored master name
});

// ioredis automatically:
// 1. Asks Sentinel for the current primary address
// 2. Connects to the primary
// 3. Subscribes to failover events
// 4. Reconnects to the new primary after failover

Sentinel Limitations

⚠️ What Sentinel Does NOT Do

Sentinel provides high availability for a single dataset. It does NOT shard data across nodes — all data must fit on one primary. If your dataset is 100GB and your nodes have 64GB RAM, Sentinel can't help. You need Redis Cluster for horizontal scaling. Sentinel also doesn't improve write throughput — all writes still go to a single primary.

Redis Cluster

Redis Cluster is the built-in solution for horizontal scaling. When your dataset doesn't fit on a single node — or when you need more write throughput than one primary can handle — Cluster distributes data across multiple primaries, each responsible for a subset of the keyspace.

🏢

The Department Store Analogy

A single-store restaurant (Sentinel) can only serve so many customers — eventually you run out of kitchen space. Redis Cluster is like opening a food court with multiple restaurants, each specializing in a section of the menu. Customer orders are routed to the right restaurant based on what they ordered. Each restaurant has its own kitchen (primary) and backup chef (replica). If one restaurant closes, its backup chef takes over. The food court can serve far more customers than any single restaurant.

Hash Slots — How Data Is Distributed

Redis Cluster divides the keyspace into 16,384 hash slots. Every key is mapped to a slot using the formula:CRC16(key) % 16384. Each primary node owns a range of slots. When you add or remove nodes, slots are redistributed — not individual keys.

Hash Slot Distribution — 3 Primary Nodestext

Total hash slots: 16,384

Node A (primary): slots 0 – 5460      (5,461 slots)
Node B (primary): slots 5461 – 10922  (5,462 slots)
Node C (primary): slots 10923 – 16383 (5,461 slots)

Key routing examples:
  SET user:1000 "Alice"
  → CRC16("user:1000") = 7142
  → 7142 % 16384 = 7142
  → Slot 7142 → owned by Node B

  SET session:abc "data"
  → CRC16("session:abc") = 13104
  → 13104 % 16384 = 13104
  → Slot 13104 → owned by Node C

Each node only stores keys that map to its assigned slots.

Hash Tags — Co-locating Related Keys

By default, different keys land on different nodes. But sometimes you need related keys on the same node — for multi-key operations like MGET or Lua scripts. Hash tags solve this: if a key contains{...}, only the content inside the braces is hashed.

Hash Tags — Forcing Keys to the Same Slottext

Without hash tags (keys land on different nodes):
  SET user:1000:profile "..."   → CRC16("user:1000:profile") → Slot 3921
  SET user:1000:settings "..."  → CRC16("user:1000:settings") → Slot 11842
  SET user:1000:cart "..."      → CRC16("user:1000:cart") → Slot 7291
  ❌ MGET user:1000:profile user:1000:settings → CROSSSLOT error

With hash tags (all hash to the same substring):
  SET {user:1000}:profile "..."   → CRC16("user:1000") → Slot 7142
  SET {user:1000}:settings "..."  → CRC16("user:1000") → Slot 7142
  SET {user:1000}:cart "..."      → CRC16("user:1000") → Slot 7142
  ✅ MGET {user:1000}:profile {user:1000}:settings → works!

Rule: Redis hashes only the content between the first { and first }.
Use this to co-locate all data for a single entity on one node.

Cluster Topology — Primary + Replica Per Shard

Each primary in the cluster can have one or more replicas. If a primary fails, its replica is automatically promoted — no Sentinel needed. The cluster handles its own failover.

Typical 3-Shard Cluster Topologytext

Shard 1:  Primary A (slots 0-5460)      ←→  Replica A'
Shard 2:  Primary B (slots 5461-10922)  ←→  Replica B'
Shard 3:  Primary C (slots 10923-16383) ←→  Replica C'

Total: 6 nodes (3 primaries + 3 replicas)
  - Write capacity: 3x single node (writes distributed across primaries)
  - Read capacity: 6x single node (if reading from replicas)
  - Fault tolerance: survives 1 primary failure per shard

Minimum recommended: 6 nodes (3 primaries + 3 replicas)
Production: often 6-12+ nodes depending on data size and throughput

MOVED and ASK Redirections

When a client sends a command to the wrong node (the node doesn't own that key's slot), the node responds with a redirection. Smart clients cache the slot-to-node mapping and rarely need redirections after the initial discovery.

Redirection Examplestext

MOVED — permanent redirection (slot ownership changed):
  Client → Node A: GET user:1000
  Node A → Client: MOVED 7142 10.0.0.2:6379
  Meaning: "Slot 7142 permanently lives on 10.0.0.2. Update your routing table."
  Client updates its slot map and retries on Node B.

ASK — temporary redirection (slot is being migrated):
  Client → Node A: GET user:1000
  Node A → Client: ASK 7142 10.0.0.2:6379
  Meaning: "Slot 7142 is being moved to 10.0.0.2. Try there THIS TIME ONLY."
  Client sends ASKING + GET to Node B, but doesn't update its slot map.

Smart clients (ioredis, Jedis, Lettuce):
  - Cache the full slot map on startup (CLUSTER SLOTS command)
  - Route commands directly to the correct node
  - Handle MOVED by updating the cached map
  - Handle ASK with a one-time redirect

Resharding — Adding and Removing Nodes

Add the New Node to the Cluster

The new node joins the cluster with no slots assigned. It's part of the cluster but doesn't own any data yet.

Migrate Slots to the New Node

Use redis-cli --cluster reshard to move slots from existing nodes to the new one. During migration, keys in the moving slot are gradually transferred. Clients get ASK redirections for keys that have already moved.

Rebalance

After resharding, use redis-cli --cluster rebalance to evenly distribute slots across all nodes. This ensures no single node is a hotspot.

Resharding Commandsbash

# Create a 6-node cluster (3 primaries + 3 replicas)
redis-cli --cluster create   10.0.0.1:6379 10.0.0.2:6379 10.0.0.3:6379   10.0.0.4:6379 10.0.0.5:6379 10.0.0.6:6379   --cluster-replicas 1

# Add a new node to an existing cluster
redis-cli --cluster add-node 10.0.0.7:6379 10.0.0.1:6379

# Reshard: move 1000 slots to the new node
redis-cli --cluster reshard 10.0.0.1:6379   --cluster-from all   --cluster-to <new-node-id>   --cluster-slots 1000   --cluster-yes

# Rebalance slots evenly across all primaries
redis-cli --cluster rebalance 10.0.0.1:6379

Cluster Limitations

⚠️ What Cluster Cannot Do

Redis Cluster has real constraints you must design around: (1) Multi-key operations (MGET, SUNION, etc.) only work if all keys are in the same slot — use hash tags to co-locate them. (2) Lua scripts can only access keys in a single slot. (3) Transactions (MULTI/EXEC) are limited to keys on the same node. (4) SELECT (multiple databases) is not supported — only database 0. (5) Large keys that are hot can create shard imbalance since a slot can't be split across nodes.

Cluster vs Sentinel vs Standalone

Choosing the right Redis topology depends on your data size, throughput requirements, and availability needs. Each option adds complexity — don't over-engineer.

	Standalone	Sentinel	Cluster
Data size	Fits on 1 node	Fits on 1 node	Exceeds 1 node's memory
Write scaling	Single primary	Single primary	Multiple primaries (sharded)
Read scaling	No (1 node)	Yes (read from replicas)	Yes (replicas per shard)
Automatic failover	No	Yes (Sentinel manages it)	Yes (built-in)
Multi-key operations	Full support	Full support	Same slot only (use hash tags)
Lua scripts	Full support	Full support	Single slot only
Minimum nodes	1	3 Redis + 3 Sentinel	6 (3 primary + 3 replica)
Operational complexity	Low	Medium	High
Best for	Dev, small apps, caching	Production HA, single dataset	Large datasets, high write throughput

🟢 Standalone

Development and staging
Small datasets (< 10GB)
Cache-only (data is rebuildable)
Low traffic applications
When downtime is acceptable

🟡 Sentinel

Production systems needing HA
Dataset fits on one node
Need automatic failover
Read scaling via replicas
Complex Lua scripts or transactions

🔴 Cluster

Dataset exceeds single node memory
Need horizontal write scaling
High throughput (> 100K ops/sec)
Can design around slot constraints
Team has operational expertise

🎯 Interview Insight

Interviewers love asking "When would you use Sentinel vs Cluster?" The key distinction: Sentinel = high availability for a single dataset (failover). Cluster = horizontal scaling across multiple nodes (sharding + failover). If the data fits on one node, Sentinel is simpler and has fewer constraints.

Scaling Decision Framework

Use these signals to decide when to scale your Redis topology. Start simple and add complexity only when the signals tell you to.

When to Add Read Replicas

✅Primary CPU is consistently above 60% and most operations are reads
✅Read latency is increasing due to load on the primary
✅You need geographic read distribution (replicas in different regions)
✅Analytics or reporting queries are competing with production reads
✅You want a hot standby for manual failover without Sentinel

When to Deploy Sentinel

✅Your application cannot tolerate Redis downtime (sessions, rate limits, queues)
✅You already have replicas and want automatic failover instead of manual promotion
✅On-call engineers are being paged to manually promote replicas during outages
✅Your SLA requires 99.9%+ availability for the Redis-dependent service
✅You need client auto-discovery of the current primary address

When to Move to Redis Cluster

✅Your dataset exceeds the memory of a single node (e.g., > 64GB)
✅Write throughput on the primary is saturated and you can't scale vertically
✅You need to partition data across nodes for compliance or data residency
✅You've already optimized key expiration, eviction, and data structures
✅Your access patterns can be designed around hash slot constraints

When Sentinel Is Enough (Don't Over-Engineer)

❌Dataset is under 25-50GB and fits comfortably on one node
❌Write throughput is well within a single primary's capacity
❌You rely heavily on multi-key operations, transactions, or Lua scripts
❌Your team doesn't have experience operating sharded systems
❌The added complexity of Cluster isn't justified by current or projected load

💡 The Scaling Ladder

Most teams follow this progression: Standalone → Standalone + Replicas → Sentinel + Replicas → Cluster. Each step adds complexity. Don't skip steps — if you haven't outgrown Sentinel, you don't need Cluster. Vertical scaling (bigger node) is often cheaper and simpler than horizontal scaling (more nodes).

Interview Questions

These questions test your understanding of Redis replication mechanics, failover behavior, and cluster design trade-offs.

Q:Redis replication is asynchronous. What are the consequences of this?

A: Because replication is async, the primary acknowledges writes before replicas confirm receipt. This has two consequences: (1) Data loss on failover — if the primary crashes before a write is replicated, that write is lost when a replica is promoted. The WAIT command can force synchronous replication for critical writes, but it adds latency. (2) Stale reads from replicas — replicas may be milliseconds behind the primary. A client that writes to the primary and immediately reads from a replica might not see its own write. Design for this: use read-from-primary for consistency-sensitive operations, and read-from-replica for eventually-consistent workloads like dashboards.

Q:Explain the Redis Sentinel failover process. What happens when the primary goes down?

A: Step 1: A Sentinel detects the primary isn't responding to PING within the configured timeout and marks it as Subjectively Down (SDOWN). Step 2: The Sentinel asks other Sentinels if they also see the primary as down. If the quorum number agrees, the primary is marked Objectively Down (ODOWN). Step 3: Sentinels elect a leader among themselves to coordinate the failover. Step 4: The leader selects the best replica based on priority, replication offset (most data), and run ID. Step 5: The leader sends REPLICAOF NO ONE to the chosen replica, promoting it to primary. Other replicas are reconfigured to follow the new primary. Step 6: Sentinel publishes a +switch-master event. Sentinel-aware clients detect this and reconnect to the new primary. The old primary, if it recovers, is automatically reconfigured as a replica.

Q:How does Redis Cluster distribute data across nodes? What happens when a client sends a command to the wrong node?

A: Redis Cluster divides the keyspace into 16,384 hash slots. Each key is assigned to a slot via CRC16(key) % 16384, and each primary node owns a range of slots. When a client sends a command to a node that doesn't own the key's slot, the node responds with a MOVED redirection containing the correct node's address. Smart clients (ioredis, Jedis) cache the slot-to-node mapping after initial discovery and route commands directly, so redirections are rare in steady state. During resharding, clients may receive ASK redirections for slots being migrated — these are temporary one-time redirects that don't update the cached mapping.

Q:When would you choose Redis Sentinel over Redis Cluster?

A: Choose Sentinel when: (1) your dataset fits on a single node — Sentinel provides HA without sharding complexity, (2) you need full support for multi-key operations, Lua scripts, and transactions — Cluster restricts these to same-slot keys, (3) your write throughput is within a single primary's capacity, and (4) operational simplicity matters — Sentinel is easier to deploy and debug. Choose Cluster when: your data exceeds one node's memory, you need horizontal write scaling, or you need to partition data. The key insight: Sentinel = HA for one dataset, Cluster = HA + sharding. Don't use Cluster just for HA if Sentinel suffices.

Q:What are hash tags in Redis Cluster and why are they important?

A: Hash tags let you control which slot a key maps to by wrapping part of the key in curly braces: {user:1000}:profile and {user:1000}:settings both hash on 'user:1000', so they land in the same slot on the same node. This is critical because Redis Cluster only supports multi-key operations (MGET, SUNION, pipeline), Lua scripts, and transactions when all involved keys are in the same slot. Without hash tags, related keys scatter across nodes and these operations fail with CROSSSLOT errors. The trade-off: over-using hash tags can create hot slots if one entity gets disproportionate traffic, since all its keys are on one node.

Common Mistakes

These mistakes cause real production outages and data loss in Redis deployments.

🔌

Running a single Redis instance in production

The primary Redis node is a single point of failure. When it goes down — and it will — every dependent service loses access to cached data, sessions, and rate limits simultaneously. Teams assume Redis 'never goes down' until it does, and there's no replica to fail over to.

✅Always run at least one replica in production, even if you don't use Sentinel. A replica gives you a hot standby for manual failover and a backup of your data. For critical workloads, deploy Sentinel with 3 nodes for automatic failover. The cost of 2 extra nodes is trivial compared to the cost of a full outage.

📖

Reading from replicas without understanding stale reads

Teams route all reads to replicas for performance, then get bug reports: 'I just updated my profile but the page shows old data.' Because replication is async, replicas can be milliseconds to seconds behind. For read-after-write scenarios, the replica hasn't received the write yet.

✅Classify your reads: eventually-consistent reads (dashboards, analytics, product listings) can safely go to replicas. Read-after-write scenarios (user updates profile, then views it) must read from the primary. Use a routing strategy: write to primary, read from primary for the same session/request, read from replica for everything else. Monitor replication lag and alert if it exceeds your tolerance.

🏷️

Ignoring hash tags in Redis Cluster and getting CROSSSLOT errors

Teams migrate from Standalone/Sentinel to Cluster and discover that MGET, pipelines, Lua scripts, and transactions break with CROSSSLOT errors. They didn't design their key naming scheme for Cluster's slot-based routing, and now related keys are scattered across nodes.

✅Design your key naming scheme before migrating to Cluster. Use hash tags to co-locate related keys: {user:1000}:profile, {user:1000}:cart, {user:1000}:session. All keys for the same entity share a hash tag and land on the same node. Audit all multi-key operations and Lua scripts before migration. If your application relies heavily on cross-key operations, Sentinel might be a better fit.

⚖️

Deploying Sentinel with only 2 nodes

Teams deploy 2 Sentinel nodes with a quorum of 2. If either Sentinel goes down, the remaining one can't reach quorum and failover is impossible. With 2 Sentinels and quorum of 1, a network partition can cause split-brain — both sides think they should promote a replica, resulting in two primaries and data divergence.

✅Always deploy an odd number of Sentinels — minimum 3. With 3 Sentinels and quorum of 2, you can lose 1 Sentinel and still perform failover. Place Sentinels on different physical machines or availability zones so a single failure doesn't take out multiple Sentinels. Never set quorum to 1 in production — it defeats the purpose of distributed consensus.