RDB SnapshotsAOF LoggingHybrid PersistenceEviction PoliciesLRU vs LFUTTL ManagementBGSAVEfsync

Redis Persistence & Durability

Redis is not purely in-memory — RDB snapshots, AOF logging, and hybrid persistence let you balance durability against performance.

25 min read9 sections

The Persistence Mental Model

Most engineers think of Redis as a purely in-memory store — fast but volatile. Restart the server and everything is gone. That mental model is incomplete. Redis actually offers a full spectrum of durability, from zero persistence (pure cache) to near-zero data loss (AOF with fsync always).

The key insight is that persistence is a dial, not a switch. You choose where on the spectrum your application sits based on the trade-off between performance and durability. A session cache doesn't need the same guarantees as a job queue.

📓

The Notebook Analogy

Imagine you're doing math in your head (pure in-memory). Fast, but if you get distracted, you lose everything. RDB snapshots are like taking a photo of your whiteboard every hour — if the building loses power, you lose at most one hour of work. AOF is like writing every calculation into a notebook as you go — you can replay the entire session. Hybrid is like taking periodic photos AND keeping the notebook — the photo gives you a fast starting point, and the notebook fills in the gap.

The Persistence Spectrum

✅No persistence — pure cache, data lives only in memory, restart = empty Redis, maximum performance
✅RDB only — periodic snapshots, you lose data between the last snapshot and the crash, good balance for most caches
✅AOF only (everysec) — logs every write, fsyncs once per second, lose at most ~1 second of writes, good for queues and counters
✅AOF only (always) — logs and fsyncs every write, near-zero data loss, significant performance cost
✅RDB + AOF hybrid — best of both worlds, RDB for fast restarts, AOF for minimal data loss, the recommended production setup

🔑 Key Takeaway

Redis persistence is not about whether to persist — it's about how much data loss you can tolerate. A session cache can afford to lose everything on restart. A rate limiter or job queue cannot. Choose the persistence mode that matches your durability requirements.

RDB (Redis Database Snapshots)

RDB persistence produces point-in-time snapshots of the entire dataset. At configured intervals, Redis forks the process and the child writes the dataset to a compact binary file (dump.rdb). The parent continues serving requests with zero downtime.

How BGSAVE Works

Trigger

Redis decides it's time for a snapshot — either because a save rule matched (e.g., 300 seconds passed and at least 10 keys changed) or because you manually ran BGSAVE.

Fork

Redis calls fork() to create a child process. Thanks to copy-on-write (COW), the child gets a consistent snapshot of memory without actually copying it. The parent continues serving reads and writes.

Write

The child process iterates over the dataset and writes it to a temporary RDB file on disk. This is a compact binary format — much smaller than the in-memory representation.

Replace

Once the child finishes writing, it atomically replaces the old dump.rdb with the new one. The child exits. Redis logs the save time and dataset size.

redis.conf — RDB Configurationbash

# Save rules: save <seconds> <min-changes>
# Snapshot if at least 1 key changed in 900 seconds (15 min)
save 900 1
# Snapshot if at least 10 keys changed in 300 seconds (5 min)
save 300 10
# Snapshot if at least 10000 keys changed in 60 seconds (1 min)
save 60 10000

# The filename for the RDB snapshot
dbfilename dump.rdb

# Directory where the RDB file is stored
dir /var/lib/redis

# Compress the RDB file using LZF compression
rdbcompression yes

# Add a CRC64 checksum at the end of the file for integrity
rdbchecksum yes

# Stop accepting writes if the background save fails
# (prevents silent data loss)
stop-writes-on-bgsave-error yes

RDB Commandsbash

# Trigger a background save (non-blocking)
BGSAVE

# Trigger a foreground save (BLOCKS all clients — avoid in production)
SAVE

# Check when the last successful save happened
LASTSAVE
# Returns a Unix timestamp, e.g., 1700000000

# Check RDB save status
INFO persistence
# rdb_last_save_time:1700000000
# rdb_last_bgsave_status:ok
# rdb_last_bgsave_time_sec:2

✅ RDB Pros

Compact single-file format — easy to backup and transfer
Fast restart — loading an RDB file is much faster than replaying AOF
Low runtime overhead — the child process does all the work
Perfect for disaster recovery and point-in-time backups
Great for replication — Redis sends RDB to replicas during full sync

❌ RDB Cons

Data loss between snapshots — if Redis crashes, you lose all writes since the last save
Fork can be slow on large datasets — a 20GB dataset may take 200ms+ to fork
Copy-on-write memory spike — if many keys are modified during BGSAVE, memory usage can temporarily double
Not suitable when you need near-zero data loss

🎯 Interview Insight

When asked about RDB, always mention the fork + copy-on-write mechanism. Explain that the parent process continues serving requests while the child writes the snapshot. This shows you understand why RDB has low runtime overhead but can cause memory spikes during the save.

AOF (Append Only File)

AOF persistence logs every write operation received by the server. Instead of snapshotting the dataset, Redis appends each command to a file. On restart, Redis replays the AOF to rebuild the dataset. This gives you much finer-grained durability than RDB — you can configure it to lose at most one second of writes, or even zero.

📝

The Transaction Ledger

RDB is like taking a photo of your bank balance every hour. If the system crashes, you know your balance as of the last photo but lose recent transactions. AOF is like keeping a ledger of every transaction: 'deposited $100', 'withdrew $50', 'transferred $200'. Even if the system crashes, you replay the ledger from the last known balance to reconstruct the exact state. The ledger grows over time, so periodically you 'compact' it — replace the full history with a fresh starting balance plus only recent transactions.

How AOF Works

Command Received

Redis receives a write command (SET, HSET, LPUSH, etc.) and executes it in memory.

Append to Buffer

The command is appended to an in-memory AOF buffer in Redis protocol format (RESP). This is fast — just a memory write.

Write to Disk

Redis writes the buffer to the AOF file on disk. The timing depends on the fsync policy: always (every command), everysec (once per second), or no (let the OS decide).

Replay on Restart

When Redis restarts, it reads the AOF file and re-executes every command to rebuild the in-memory dataset. This is slower than loading an RDB file but provides better durability.

The Three fsync Policies

Policy	Behavior	Durability	Performance
appendfsync always	fsync after every write command	Zero data loss — every command is on disk before acknowledgment	Slowest — each write waits for disk I/O. ~100-1000x slower than no fsync.
appendfsync everysec	fsync once per second in a background thread	Lose at most ~1 second of writes on crash	Good — nearly as fast as no persistence. The recommended default.
appendfsync no	Never explicitly fsync — let the OS flush when it wants (typically every 30s)	Could lose up to 30 seconds of writes	Fastest — Redis never waits for disk. OS handles flushing.

redis.conf — AOF Configurationbash

# Enable AOF persistence
appendonly yes

# AOF filename
appendfilename "appendonly.aof"

# fsync policy — choose one:
# appendfsync always    # Zero data loss, slowest
appendfsync everysec    # ≤1 second data loss, recommended
# appendfsync no        # OS-controlled, fastest

# Directory for AOF files
dir /var/lib/redis

# Trigger AOF rewrite when the file is 100% larger than after last rewrite
auto-aof-rewrite-percentage 100

# Don't rewrite if the AOF file is smaller than 64MB
auto-aof-rewrite-min-size 64mb

# Don't fsync during AOF rewrite (reduces I/O contention)
no-appendfsync-on-rewrite no

AOF Rewriting (Compaction)

The AOF file grows indefinitely — every write is appended. If you SET the same key 1,000 times, the AOF contains 1,000 entries for that key, but only the last one matters. AOF rewriting compacts the file by generating the minimal set of commands needed to reconstruct the current dataset.

AOF Rewriting — Before and Aftertext

Before rewrite (appendonly.aof — 500MB):
  SET user:1 "Alice"
  SET user:1 "Alice Smith"
  SET user:1 "Alice Johnson"     ← only this one matters
  INCR counter
  INCR counter
  INCR counter
  INCR counter                   ← net result: counter = 4
  LPUSH queue "job1"
  LPUSH queue "job2"
  RPOP queue                     ← "job1" removed
  ... (millions of commands)

After rewrite (appendonly.aof — 50MB):
  SET user:1 "Alice Johnson"     ← single command, current value
  SET counter 4                  ← single command, current value
  LPUSH queue "job2"             ← only remaining item
  ... (only commands needed to rebuild current state)

Triggered automatically when:
  AOF size > auto-aof-rewrite-min-size (64MB)
  AND AOF size > last_rewrite_size * (1 + auto-aof-rewrite-percentage/100)

Or manually:
  BGREWRITEAOF

✅ AOF Pros

Near-zero data loss with everysec — lose at most 1 second
Human-readable format — you can inspect and even edit the AOF file
Automatic rewriting keeps file size manageable
More durable than RDB for write-heavy workloads

❌ AOF Cons

Larger files than RDB — even after rewriting, AOF is bigger
Slower restart — replaying commands is slower than loading a binary snapshot
Write amplification — every command hits disk, increasing I/O
Rewriting uses CPU and memory (fork + rebuild)

💡 everysec Is Almost Always the Right Choice

The appendfsync everysec policy gives you the best trade-off: you lose at most 1 second of data on a crash, and performance is nearly identical to having no persistence at all. Use always only when you truly cannot afford to lose a single write (rare). Use no only when you're treating Redis as a pure cache.

RDB + AOF Hybrid

Since Redis 4.0, you can combine RDB and AOF into a single hybrid persistence mode. When AOF rewriting is triggered, Redis writes an RDB snapshot as the base of the new AOF file, then appends only the write commands that arrived during the rewrite. On restart, Redis loads the RDB portion first (fast), then replays the small AOF tail (minimal commands). You get fast restarts AND near-zero data loss.

📸

Photo + Sticky Notes

Imagine you take a photo of your whiteboard every morning (RDB base). Throughout the day, you jot changes on sticky notes (AOF tail). If you need to reconstruct the whiteboard, you start from the photo (fast) and apply the sticky notes (small). You don't need to replay an entire day's worth of notes from scratch — just the ones since the last photo.

How Hybrid Persistence Works

AOF Rewrite Triggered

Either automatically (file size threshold) or manually (BGREWRITEAOF). Redis forks a child process.

RDB Base Written

The child process writes the current dataset as an RDB-format preamble at the beginning of the new AOF file. This is the compact binary snapshot.

AOF Tail Appended

While the child was writing the RDB base, the parent buffered any new write commands. These are appended to the AOF file after the RDB preamble in standard AOF format.

Fast Restart

On restart, Redis detects the RDB preamble, loads it (fast binary load), then replays only the small AOF tail. Much faster than replaying a full AOF file.

redis.conf — Hybrid Persistence (Recommended Production Setup)bash

# Enable AOF
appendonly yes
appendfilename "appendonly.aof"

# Use everysec fsync — lose at most 1 second on crash
appendfsync everysec

# Enable hybrid RDB+AOF format
# When AOF rewrites, the new file starts with an RDB snapshot
# followed by AOF commands for changes during the rewrite
aof-use-rdb-preamble yes

# Also keep RDB snapshots as a backup safety net
save 900 1
save 300 10
save 60 10000
dbfilename dump.rdb

# Rewrite AOF when it doubles in size (and is at least 64MB)
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

# Directory for all persistence files
dir /var/lib/redis

Hybrid AOF File Structuretext

appendonly.aof file layout:

┌─────────────────────────────────────────┐
│  RDB Preamble (binary snapshot)         │
│  - Full dataset as of rewrite time      │
│  - Compact binary format                │
│  - Fast to load (~seconds for 10GB)     │
├─────────────────────────────────────────┤
│  AOF Tail (text commands)               │
│  - Only commands AFTER the RDB snapshot │
│  - Typically very small                 │
│  - *3\r\n$3\r\nSET\r\n$5\r\n...       │
└─────────────────────────────────────────┘

Restart sequence:
  1. Detect RDB preamble → load binary snapshot (fast)
  2. Detect AOF tail → replay commands (small)
  3. Ready to serve

Comparison — restart time for 10GB dataset:
  RDB only:    ~10 seconds (load binary)
  AOF only:    ~60 seconds (replay all commands)
  Hybrid:      ~12 seconds (load binary + replay small tail)

🏆 This Is the Recommended Production Setup

For most production Redis deployments, use hybrid persistence with aof-use-rdb-preamble yes and appendfsync everysec. You get fast restarts (RDB base), minimal data loss (AOF tail with 1-second fsync), and automatic compaction (AOF rewriting). Keep RDB snapshots enabled as an additional backup.

Expiration & Eviction

Persistence controls what happens when Redis restarts. Expiration and eviction control what happens while Redis is running — how keys are removed when they expire or when memory runs out. These are two different mechanisms that work together.

TTL Commands

Setting and Inspecting TTLsbash

# Set a key with a TTL (seconds)
SET session:abc123 "user_data" EX 3600    # expires in 1 hour

# Set TTL on an existing key
EXPIRE session:abc123 3600                # 3600 seconds
PEXPIRE session:abc123 3600000            # 3600000 milliseconds

# Set expiration to a specific Unix timestamp
EXPIREAT session:abc123 1700000000        # seconds
PEXPIREAT session:abc123 1700000000000    # milliseconds

# Check remaining TTL
TTL session:abc123                        # returns seconds, e.g., 3542
PTTL session:abc123                       # returns milliseconds

# Remove TTL (make key persistent)
PERSIST session:abc123                    # key no longer expires

# TTL return values:
#   -1 = key exists but has no TTL (persistent)
#   -2 = key does not exist
#   N  = seconds until expiration

How Redis Expires Keys

Redis uses two complementary strategies to remove expired keys. Neither strategy guarantees instant removal — an expired key may linger in memory briefly.

Lazy Expiration (Passive)

When a client tries to access a key, Redis checks if it's expired. If yes, Redis deletes it and returns nil. This means expired keys that are never accessed can sit in memory indefinitely — which is why active expiration exists.

Active Expiration (Background)

Redis runs a background task 10 times per second. Each cycle: (1) randomly sample 20 keys with TTLs, (2) delete any that are expired, (3) if more than 25% were expired, repeat immediately. This probabilistic approach ensures expired keys are cleaned up even if nobody accesses them, without scanning the entire keyspace.

⚠️ Expired ≠ Immediately Deleted

A key with TTL=0 is logically expired but may still occupy memory until lazy or active expiration removes it. In practice, the active expiration cycle cleans up most expired keys within milliseconds. But under heavy load with millions of expiring keys, there can be a brief lag. This is normal and by design.

The 8 Eviction Policies

When Redis reaches its maxmemory limit, it must decide what to do with new writes. The eviction policy controls which keys get removed to make room. There are 8 policies, divided into two groups: allkeys (consider all keys) and volatile (only consider keys with a TTL set).

Policy	Scope	Strategy	Best For
noeviction	N/A	Return errors on writes when memory is full. Reads still work.	When data loss is unacceptable — fail loudly rather than silently dropping data.
allkeys-lru	All keys	Evict the least recently used key across the entire keyspace.	General-purpose caching. The most common choice.
volatile-lru	Keys with TTL	Evict the least recently used key among keys that have a TTL set.	Mixed workloads — cache entries have TTLs, persistent data does not.
allkeys-lfu	All keys	Evict the least frequently used key across the entire keyspace.	When access frequency matters more than recency. Keeps popular keys longer.
volatile-lfu	Keys with TTL	Evict the least frequently used key among keys with a TTL.	Mixed workloads where frequency-based eviction is preferred.
allkeys-random	All keys	Evict a random key from the entire keyspace.	When all keys have roughly equal importance.
volatile-random	Keys with TTL	Evict a random key among keys with a TTL.	Simple eviction when you don't need smart selection.
volatile-ttl	Keys with TTL	Evict the key with the shortest remaining TTL.	When keys closest to expiration are least valuable.

LRU vs LFU

LRU — Least Recently Used

Evicts keys that haven't been accessed recently
Good when recent access predicts future access
Problem: a key accessed once recently beats a key accessed 1,000 times yesterday
Redis approximates LRU by sampling N random keys and evicting the oldest

LFU — Least Frequently Used

Evicts keys that are accessed least often
Good when popular keys should stay cached regardless of recency
Uses a logarithmic frequency counter that decays over time
Better for workloads with stable hot keys (e.g., product catalog)

redis.conf — Eviction Configurationbash

# Set maximum memory limit
maxmemory 4gb

# Choose eviction policy
maxmemory-policy allkeys-lru

# Number of keys to sample for LRU/LFU approximation
# Higher = more accurate but slightly slower
# Default is 5, 10 is a good balance
maxmemory-samples 10

# LFU tuning (only relevant for *-lfu policies)
# lfu-log-factor: higher = slower frequency counter growth
# lfu-decay-time: minutes before frequency counter is halved
lfu-log-factor 10
lfu-decay-time 1

How Redis Approximates LRUtext

True LRU requires tracking access order for ALL keys — O(N) memory.
Redis uses approximated LRU instead:

1. When eviction is needed, sample maxmemory-samples (e.g., 10) random keys
2. Among those 10 keys, evict the one with the oldest last-access time
3. Repeat until enough memory is freed

With maxmemory-samples=10, the approximation is very close to true LRU.
With maxmemory-samples=5 (default), it's slightly less accurate but faster.

Why not true LRU?
  - True LRU needs a doubly-linked list + hash map for every key
  - For 10 million keys, that's ~160MB of overhead just for the LRU structure
  - Redis's sampling approach uses zero extra memory
  - At samples=10, the eviction quality is nearly identical to true LRU

🎯 Choosing the Right Policy

For most caching use cases, allkeys-lru is the right default. Use allkeys-lfu if you have a stable set of hot keys that should survive occasional cold-key access spikes. Use noeviction for data stores where losing any key is unacceptable (queues, rate limiters). Use volatile-* variants when you mix cache entries (with TTL) and persistent data (without TTL) in the same Redis instance.

No Persistence

Sometimes the right persistence strategy is none at all. When Redis is used as a pure cache — where every key can be regenerated from the source of truth — persistence adds complexity and I/O overhead with no real benefit. If Redis restarts, the cache warms up naturally as requests come in.

When No Persistence Makes Sense

✅Pure cache layer — every cached value can be re-fetched from the database, a cold cache is slow but not incorrect
✅Ephemeral data — rate limiter counters, real-time analytics buffers, temporary computation results only valuable for seconds or minutes
✅Derived data — materialized views, pre-computed aggregations, search indexes, anything that can be rebuilt from primary storage
✅Development and testing — local Redis instances where persistence just slows down restarts and clutters the filesystem

redis.conf — Disable All Persistencebash

# Disable RDB snapshots entirely
save ""

# Disable AOF
appendonly no

# Set eviction policy for cache behavior
maxmemory 4gb
maxmemory-policy allkeys-lru

# Optional: disable RDB file loading on startup
# (prevents accidentally loading a stale dump.rdb)
dbfilename ""

# Result: Redis is a pure in-memory cache
# - Maximum performance (no disk I/O)
# - Restart = empty Redis (cold cache)
# - All data must be re-fetchable from the source of truth

⚠️ Know What You're Giving Up

With no persistence, a Redis restart means a completely cold cache. If your system depends on Redis for rate limiting, session storage, or job queues, disabling persistence means those features break on restart. Only disable persistence when every key in Redis is truly expendable.

Decision Framework

Choosing the right persistence mode depends on your durability requirements, performance budget, and operational complexity tolerance. Here's how the four options compare across the dimensions that matter.

Dimension	RDB Only	AOF Only	RDB + AOF Hybrid	No Persistence
Durability	Minutes of data loss (between snapshots)	≤1 second (everysec) or zero (always)	≤1 second (AOF tail with everysec)	Total data loss on restart
Restart Speed	Fast — binary load (~10s for 10GB)	Slow — replay all commands (~60s for 10GB)	Fast — RDB base + small AOF tail (~12s for 10GB)	Instant — nothing to load
Disk Usage	Low — compact binary snapshots	High — command log grows between rewrites	Medium — RDB base + small tail	None
Write Performance	Minimal impact — BGSAVE runs in background	Slight impact — fsync everysec adds ~1-2% overhead	Same as AOF — fsync everysec	Maximum — no disk I/O at all
Memory Overhead	COW spike during fork (up to 2x briefly)	COW spike during rewrite	COW spike during rewrite	None
Complexity	Low — simple to configure and monitor	Medium — need to monitor file size and rewrites	Medium — same as AOF	Lowest — nothing to configure
Best For	Backups, disaster recovery, non-critical caches	Queues, counters, anything needing strong durability	Production deployments needing both speed and durability	Pure caches, ephemeral data, dev environments

Quick Decision Guide

🟢 Use RDB + AOF Hybrid When

You need durability AND fast restarts
Running Redis as a primary data store (not just cache)
Production environment with SLA requirements
You want the safest default — this is the recommended setup

🔵 Use RDB Only When

Data can tolerate minutes of loss on crash
You primarily need backups and disaster recovery
Disk I/O budget is very tight
Dataset is large and you want minimal write amplification

🟡 Use AOF Only When

Maximum durability is the top priority
You need fsync always for zero data loss
Restart speed is not critical
You want a human-readable persistence log

⚪ Use No Persistence When

Redis is a pure cache — all data is re-fetchable
Maximum performance is required
Data is ephemeral (rate limits, temp counters)
Development or testing environments

🎯 Interview Framework

When asked "how would you configure Redis persistence?" — don't jump to a single answer. Ask: "What's the durability requirement? Can we tolerate data loss on crash? Is this a cache or a primary store?" Then map the answer to the right mode. This shows you think in trade-offs, not defaults.

Interview Questions

These questions test whether you understand Redis persistence trade-offs and can make informed decisions for production systems.

Q:What's the difference between RDB and AOF persistence in Redis?

A: RDB takes point-in-time snapshots of the entire dataset at configured intervals using fork + BGSAVE. It produces compact binary files that are fast to load on restart, but you lose all writes between the last snapshot and a crash. AOF logs every write command to a file. With 'appendfsync everysec', you lose at most 1 second of data. AOF files are larger and slower to replay on restart, but provide much better durability. The recommended production setup is hybrid: AOF with an RDB preamble (aof-use-rdb-preamble yes). This gives you fast restarts (RDB base) and minimal data loss (AOF tail).

Q:How does Redis handle the fork() for BGSAVE without blocking clients?

A: When Redis triggers BGSAVE, it calls fork() to create a child process. The child gets a copy of the parent's memory space via the OS's copy-on-write (COW) mechanism. The parent continues serving clients normally. The child iterates over the dataset and writes it to the RDB file. COW means memory pages are only duplicated when the parent modifies them — so if the dataset is mostly read-heavy, the fork is cheap. However, on write-heavy workloads with large datasets, COW can cause significant memory spikes (up to 2x) because modified pages must be copied. This is why you should monitor memory during BGSAVE and ensure you have enough headroom.

Q:What eviction policy would you choose for a Redis cache, and why?

A: For a general-purpose cache, allkeys-lru is the best default. It evicts the least recently used key across the entire keyspace, which works well when recent access predicts future access. For workloads with stable hot keys (e.g., a product catalog where the top 1,000 products get 80% of traffic), allkeys-lfu is better — it keeps frequently accessed keys even if they weren't accessed in the last few seconds. Use volatile-lru when you mix cache entries (with TTL) and persistent data (without TTL) in the same instance — it only evicts keys that have a TTL set, protecting your persistent data. Use noeviction for data stores like job queues where losing any key is unacceptable — Redis returns errors on writes instead of silently dropping data.

Q:Your Redis instance restarts and takes 5 minutes to recover. How do you fix this?

A: A 5-minute restart means Redis is replaying a large AOF file. Three solutions: (1) Enable hybrid persistence (aof-use-rdb-preamble yes) — the AOF file starts with a compact RDB snapshot, so Redis loads the binary base quickly and only replays the small AOF tail. This typically cuts restart time by 80-90%. (2) Tune AOF rewrite thresholds — lower auto-aof-rewrite-percentage so rewrites happen more frequently, keeping the AOF file smaller. (3) If using AOF only, switch to hybrid mode. If restart speed is critical and durability is not, consider RDB only — binary loads are the fastest. Also check if the dataset size is appropriate for the instance — a 50GB dataset on a single node will always be slow to load.

Q:How does Redis approximate LRU, and why doesn't it use true LRU?

A: True LRU requires maintaining a doubly-linked list ordered by access time plus a hash map for O(1) lookups — this adds ~16 bytes per key of overhead. For 100 million keys, that's 1.6GB just for the LRU data structure. Redis instead uses approximated LRU: when eviction is needed, it samples maxmemory-samples random keys (default 5, recommended 10) and evicts the one with the oldest last-access timestamp. Each key stores its last access time in 24 bits (3 bytes) that are already part of the key metadata — zero additional memory. With 10 samples, the approximation is statistically very close to true LRU. The trade-off is worth it: near-identical eviction quality with zero memory overhead.

Common Mistakes

These mistakes cause real production incidents — from silent data loss to unexpected downtime.

💾

Using Redis as a primary store with no persistence

Teams store critical data in Redis (job queues, rate limiter state, user sessions) but leave persistence disabled because 'Redis is a cache.' When Redis restarts — planned maintenance, OOM kill, or crash — all data is gone. Jobs are lost, rate limits reset, users are logged out. The system doesn't crash, but it silently loses data.

✅If any data in Redis cannot be regenerated from another source, enable persistence. Use hybrid mode (aof-use-rdb-preamble yes, appendfsync everysec) as the default for any Redis instance that holds non-cache data. Audit every key namespace: can this be re-fetched from the database? If not, it needs persistence.

🧠

Not reserving memory for fork overhead

Redis is configured with maxmemory 8gb on an 8GB instance. During BGSAVE or AOF rewrite, Redis forks. Copy-on-write means modified pages are duplicated. On a write-heavy workload, memory usage spikes to 12-14GB. The OS kills Redis (OOM) or starts swapping, causing massive latency spikes. The background save fails, and if stop-writes-on-bgsave-error is enabled, Redis stops accepting writes entirely.

✅Reserve at least 30-50% extra memory beyond maxmemory for fork overhead. On an 8GB instance, set maxmemory to 4-5GB. Monitor memory usage during BGSAVE with INFO memory (used_memory_rss). On Linux, set vm.overcommit_memory=1 to prevent fork failures, but still ensure physical memory is sufficient.

⏱️

Using appendfsync always without understanding the cost

A team configures appendfsync always for 'maximum durability' on a high-throughput Redis instance. Every write now waits for a disk fsync before returning. Throughput drops from 100,000 ops/sec to 1,000 ops/sec. Latency jumps from sub-millisecond to 5-10ms. The application slows to a crawl, and the team doesn't connect it to the Redis config change.

✅Use appendfsync everysec for nearly all production workloads. It provides ≤1 second of data loss on crash with negligible performance impact. Only use always when you genuinely cannot afford to lose a single write AND your throughput is low enough to absorb the cost (e.g., a financial ledger with 100 writes/sec, not a cache with 100,000 writes/sec).

🎯

Wrong eviction policy for the workload

A team uses the default noeviction policy for a cache. When Redis hits maxmemory, it starts returning OOM errors on every write. The application crashes because it doesn't handle Redis write failures. Or: a team uses allkeys-lru for a mixed workload where some keys are persistent config data without TTLs. LRU evicts the config keys because they haven't been accessed recently, breaking the application.

✅Match the eviction policy to your workload. For pure caches: allkeys-lru or allkeys-lfu. For mixed cache + persistent data: volatile-lru (only evicts keys with TTLs). For data stores where no key should be evicted: noeviction — but then your application MUST handle write errors gracefully. Always test what happens when Redis hits maxmemory before it happens in production.