Redis Persistence & Durability
Redis is not purely in-memory — RDB snapshots, AOF logging, and hybrid persistence let you balance durability against performance.
Table of Contents
The Persistence Mental Model
Most engineers think of Redis as a purely in-memory store — fast but volatile. Restart the server and everything is gone. That mental model is incomplete. Redis actually offers a full spectrum of durability, from zero persistence (pure cache) to near-zero data loss (AOF with fsync always).
The key insight is that persistence is a dial, not a switch. You choose where on the spectrum your application sits based on the trade-off between performance and durability. A session cache doesn't need the same guarantees as a job queue.
The Notebook Analogy
Imagine you're doing math in your head (pure in-memory). Fast, but if you get distracted, you lose everything. RDB snapshots are like taking a photo of your whiteboard every hour — if the building loses power, you lose at most one hour of work. AOF is like writing every calculation into a notebook as you go — you can replay the entire session. Hybrid is like taking periodic photos AND keeping the notebook — the photo gives you a fast starting point, and the notebook fills in the gap.
The Persistence Spectrum
- ✅No persistence — pure cache, data lives only in memory, restart = empty Redis, maximum performance
- ✅RDB only — periodic snapshots, you lose data between the last snapshot and the crash, good balance for most caches
- ✅AOF only (everysec) — logs every write, fsyncs once per second, lose at most ~1 second of writes, good for queues and counters
- ✅AOF only (always) — logs and fsyncs every write, near-zero data loss, significant performance cost
- ✅RDB + AOF hybrid — best of both worlds, RDB for fast restarts, AOF for minimal data loss, the recommended production setup
🔑 Key Takeaway
Redis persistence is not about whether to persist — it's about how much data loss you can tolerate. A session cache can afford to lose everything on restart. A rate limiter or job queue cannot. Choose the persistence mode that matches your durability requirements.
RDB (Redis Database Snapshots)
RDB persistence produces point-in-time snapshots of the entire dataset. At configured intervals, Redis forks the process and the child writes the dataset to a compact binary file (dump.rdb). The parent continues serving requests with zero downtime.
How BGSAVE Works
Trigger
Redis decides it's time for a snapshot — either because a save rule matched (e.g., 300 seconds passed and at least 10 keys changed) or because you manually ran BGSAVE.
Fork
Redis calls fork() to create a child process. Thanks to copy-on-write (COW), the child gets a consistent snapshot of memory without actually copying it. The parent continues serving reads and writes.
Write
The child process iterates over the dataset and writes it to a temporary RDB file on disk. This is a compact binary format — much smaller than the in-memory representation.
Replace
Once the child finishes writing, it atomically replaces the old dump.rdb with the new one. The child exits. Redis logs the save time and dataset size.
# Save rules: save <seconds> <min-changes> # Snapshot if at least 1 key changed in 900 seconds (15 min) save 900 1 # Snapshot if at least 10 keys changed in 300 seconds (5 min) save 300 10 # Snapshot if at least 10000 keys changed in 60 seconds (1 min) save 60 10000 # The filename for the RDB snapshot dbfilename dump.rdb # Directory where the RDB file is stored dir /var/lib/redis # Compress the RDB file using LZF compression rdbcompression yes # Add a CRC64 checksum at the end of the file for integrity rdbchecksum yes # Stop accepting writes if the background save fails # (prevents silent data loss) stop-writes-on-bgsave-error yes
# Trigger a background save (non-blocking) BGSAVE # Trigger a foreground save (BLOCKS all clients — avoid in production) SAVE # Check when the last successful save happened LASTSAVE # Returns a Unix timestamp, e.g., 1700000000 # Check RDB save status INFO persistence # rdb_last_save_time:1700000000 # rdb_last_bgsave_status:ok # rdb_last_bgsave_time_sec:2
✅ RDB Pros
- Compact single-file format — easy to backup and transfer
- Fast restart — loading an RDB file is much faster than replaying AOF
- Low runtime overhead — the child process does all the work
- Perfect for disaster recovery and point-in-time backups
- Great for replication — Redis sends RDB to replicas during full sync
❌ RDB Cons
- Data loss between snapshots — if Redis crashes, you lose all writes since the last save
- Fork can be slow on large datasets — a 20GB dataset may take 200ms+ to fork
- Copy-on-write memory spike — if many keys are modified during BGSAVE, memory usage can temporarily double
- Not suitable when you need near-zero data loss
🎯 Interview Insight
When asked about RDB, always mention the fork + copy-on-write mechanism. Explain that the parent process continues serving requests while the child writes the snapshot. This shows you understand why RDB has low runtime overhead but can cause memory spikes during the save.
AOF (Append Only File)
AOF persistence logs every write operation received by the server. Instead of snapshotting the dataset, Redis appends each command to a file. On restart, Redis replays the AOF to rebuild the dataset. This gives you much finer-grained durability than RDB — you can configure it to lose at most one second of writes, or even zero.
The Transaction Ledger
RDB is like taking a photo of your bank balance every hour. If the system crashes, you know your balance as of the last photo but lose recent transactions. AOF is like keeping a ledger of every transaction: 'deposited $100', 'withdrew $50', 'transferred $200'. Even if the system crashes, you replay the ledger from the last known balance to reconstruct the exact state. The ledger grows over time, so periodically you 'compact' it — replace the full history with a fresh starting balance plus only recent transactions.
How AOF Works
Command Received
Redis receives a write command (SET, HSET, LPUSH, etc.) and executes it in memory.
Append to Buffer
The command is appended to an in-memory AOF buffer in Redis protocol format (RESP). This is fast — just a memory write.
Write to Disk
Redis writes the buffer to the AOF file on disk. The timing depends on the fsync policy: always (every command), everysec (once per second), or no (let the OS decide).
Replay on Restart
When Redis restarts, it reads the AOF file and re-executes every command to rebuild the in-memory dataset. This is slower than loading an RDB file but provides better durability.
The Three fsync Policies
| Policy | Behavior | Durability | Performance |
|---|---|---|---|
| appendfsync always | fsync after every write command | Zero data loss — every command is on disk before acknowledgment | Slowest — each write waits for disk I/O. ~100-1000x slower than no fsync. |
| appendfsync everysec | fsync once per second in a background thread | Lose at most ~1 second of writes on crash | Good — nearly as fast as no persistence. The recommended default. |
| appendfsync no | Never explicitly fsync — let the OS flush when it wants (typically every 30s) | Could lose up to 30 seconds of writes | Fastest — Redis never waits for disk. OS handles flushing. |
# Enable AOF persistence appendonly yes # AOF filename appendfilename "appendonly.aof" # fsync policy — choose one: # appendfsync always # Zero data loss, slowest appendfsync everysec # ≤1 second data loss, recommended # appendfsync no # OS-controlled, fastest # Directory for AOF files dir /var/lib/redis # Trigger AOF rewrite when the file is 100% larger than after last rewrite auto-aof-rewrite-percentage 100 # Don't rewrite if the AOF file is smaller than 64MB auto-aof-rewrite-min-size 64mb # Don't fsync during AOF rewrite (reduces I/O contention) no-appendfsync-on-rewrite no
AOF Rewriting (Compaction)
The AOF file grows indefinitely — every write is appended. If you SET the same key 1,000 times, the AOF contains 1,000 entries for that key, but only the last one matters. AOF rewriting compacts the file by generating the minimal set of commands needed to reconstruct the current dataset.
Before rewrite (appendonly.aof — 500MB): SET user:1 "Alice" SET user:1 "Alice Smith" SET user:1 "Alice Johnson" ← only this one matters INCR counter INCR counter INCR counter INCR counter ← net result: counter = 4 LPUSH queue "job1" LPUSH queue "job2" RPOP queue ← "job1" removed ... (millions of commands) After rewrite (appendonly.aof — 50MB): SET user:1 "Alice Johnson" ← single command, current value SET counter 4 ← single command, current value LPUSH queue "job2" ← only remaining item ... (only commands needed to rebuild current state) Triggered automatically when: AOF size > auto-aof-rewrite-min-size (64MB) AND AOF size > last_rewrite_size * (1 + auto-aof-rewrite-percentage/100) Or manually: BGREWRITEAOF
✅ AOF Pros
- Near-zero data loss with
everysec— lose at most 1 second - Human-readable format — you can inspect and even edit the AOF file
- Automatic rewriting keeps file size manageable
- More durable than RDB for write-heavy workloads
❌ AOF Cons
- Larger files than RDB — even after rewriting, AOF is bigger
- Slower restart — replaying commands is slower than loading a binary snapshot
- Write amplification — every command hits disk, increasing I/O
- Rewriting uses CPU and memory (fork + rebuild)
💡 everysec Is Almost Always the Right Choice
The appendfsync everysec policy gives you the best trade-off: you lose at most 1 second of data on a crash, and performance is nearly identical to having no persistence at all. Use always only when you truly cannot afford to lose a single write (rare). Use no only when you're treating Redis as a pure cache.
RDB + AOF Hybrid
Since Redis 4.0, you can combine RDB and AOF into a single hybrid persistence mode. When AOF rewriting is triggered, Redis writes an RDB snapshot as the base of the new AOF file, then appends only the write commands that arrived during the rewrite. On restart, Redis loads the RDB portion first (fast), then replays the small AOF tail (minimal commands). You get fast restarts AND near-zero data loss.
Photo + Sticky Notes
Imagine you take a photo of your whiteboard every morning (RDB base). Throughout the day, you jot changes on sticky notes (AOF tail). If you need to reconstruct the whiteboard, you start from the photo (fast) and apply the sticky notes (small). You don't need to replay an entire day's worth of notes from scratch — just the ones since the last photo.
How Hybrid Persistence Works
AOF Rewrite Triggered
Either automatically (file size threshold) or manually (BGREWRITEAOF). Redis forks a child process.
RDB Base Written
The child process writes the current dataset as an RDB-format preamble at the beginning of the new AOF file. This is the compact binary snapshot.
AOF Tail Appended
While the child was writing the RDB base, the parent buffered any new write commands. These are appended to the AOF file after the RDB preamble in standard AOF format.
Fast Restart
On restart, Redis detects the RDB preamble, loads it (fast binary load), then replays only the small AOF tail. Much faster than replaying a full AOF file.
# Enable AOF appendonly yes appendfilename "appendonly.aof" # Use everysec fsync — lose at most 1 second on crash appendfsync everysec # Enable hybrid RDB+AOF format # When AOF rewrites, the new file starts with an RDB snapshot # followed by AOF commands for changes during the rewrite aof-use-rdb-preamble yes # Also keep RDB snapshots as a backup safety net save 900 1 save 300 10 save 60 10000 dbfilename dump.rdb # Rewrite AOF when it doubles in size (and is at least 64MB) auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb # Directory for all persistence files dir /var/lib/redis
appendonly.aof file layout: ┌─────────────────────────────────────────┐ │ RDB Preamble (binary snapshot) │ │ - Full dataset as of rewrite time │ │ - Compact binary format │ │ - Fast to load (~seconds for 10GB) │ ├─────────────────────────────────────────┤ │ AOF Tail (text commands) │ │ - Only commands AFTER the RDB snapshot │ │ - Typically very small │ │ - *3\r\n$3\r\nSET\r\n$5\r\n... │ └─────────────────────────────────────────┘ Restart sequence: 1. Detect RDB preamble → load binary snapshot (fast) 2. Detect AOF tail → replay commands (small) 3. Ready to serve Comparison — restart time for 10GB dataset: RDB only: ~10 seconds (load binary) AOF only: ~60 seconds (replay all commands) Hybrid: ~12 seconds (load binary + replay small tail)
🏆 This Is the Recommended Production Setup
For most production Redis deployments, use hybrid persistence with aof-use-rdb-preamble yes and appendfsync everysec. You get fast restarts (RDB base), minimal data loss (AOF tail with 1-second fsync), and automatic compaction (AOF rewriting). Keep RDB snapshots enabled as an additional backup.
Expiration & Eviction
Persistence controls what happens when Redis restarts. Expiration and eviction control what happens while Redis is running — how keys are removed when they expire or when memory runs out. These are two different mechanisms that work together.
TTL Commands
# Set a key with a TTL (seconds) SET session:abc123 "user_data" EX 3600 # expires in 1 hour # Set TTL on an existing key EXPIRE session:abc123 3600 # 3600 seconds PEXPIRE session:abc123 3600000 # 3600000 milliseconds # Set expiration to a specific Unix timestamp EXPIREAT session:abc123 1700000000 # seconds PEXPIREAT session:abc123 1700000000000 # milliseconds # Check remaining TTL TTL session:abc123 # returns seconds, e.g., 3542 PTTL session:abc123 # returns milliseconds # Remove TTL (make key persistent) PERSIST session:abc123 # key no longer expires # TTL return values: # -1 = key exists but has no TTL (persistent) # -2 = key does not exist # N = seconds until expiration
How Redis Expires Keys
Redis uses two complementary strategies to remove expired keys. Neither strategy guarantees instant removal — an expired key may linger in memory briefly.
Lazy Expiration (Passive)
When a client tries to access a key, Redis checks if it's expired. If yes, Redis deletes it and returns nil. This means expired keys that are never accessed can sit in memory indefinitely — which is why active expiration exists.
Active Expiration (Background)
Redis runs a background task 10 times per second. Each cycle: (1) randomly sample 20 keys with TTLs, (2) delete any that are expired, (3) if more than 25% were expired, repeat immediately. This probabilistic approach ensures expired keys are cleaned up even if nobody accesses them, without scanning the entire keyspace.
⚠️ Expired ≠ Immediately Deleted
A key with TTL=0 is logically expired but may still occupy memory until lazy or active expiration removes it. In practice, the active expiration cycle cleans up most expired keys within milliseconds. But under heavy load with millions of expiring keys, there can be a brief lag. This is normal and by design.
The 8 Eviction Policies
When Redis reaches its maxmemory limit, it must decide what to do with new writes. The eviction policy controls which keys get removed to make room. There are 8 policies, divided into two groups: allkeys (consider all keys) and volatile (only consider keys with a TTL set).
| Policy | Scope | Strategy | Best For |
|---|---|---|---|
| noeviction | N/A | Return errors on writes when memory is full. Reads still work. | When data loss is unacceptable — fail loudly rather than silently dropping data. |
| allkeys-lru | All keys | Evict the least recently used key across the entire keyspace. | General-purpose caching. The most common choice. |
| volatile-lru | Keys with TTL | Evict the least recently used key among keys that have a TTL set. | Mixed workloads — cache entries have TTLs, persistent data does not. |
| allkeys-lfu | All keys | Evict the least frequently used key across the entire keyspace. | When access frequency matters more than recency. Keeps popular keys longer. |
| volatile-lfu | Keys with TTL | Evict the least frequently used key among keys with a TTL. | Mixed workloads where frequency-based eviction is preferred. |
| allkeys-random | All keys | Evict a random key from the entire keyspace. | When all keys have roughly equal importance. |
| volatile-random | Keys with TTL | Evict a random key among keys with a TTL. | Simple eviction when you don't need smart selection. |
| volatile-ttl | Keys with TTL | Evict the key with the shortest remaining TTL. | When keys closest to expiration are least valuable. |
LRU vs LFU
LRU — Least Recently Used
- Evicts keys that haven't been accessed recently
- Good when recent access predicts future access
- Problem: a key accessed once recently beats a key accessed 1,000 times yesterday
- Redis approximates LRU by sampling N random keys and evicting the oldest
LFU — Least Frequently Used
- Evicts keys that are accessed least often
- Good when popular keys should stay cached regardless of recency
- Uses a logarithmic frequency counter that decays over time
- Better for workloads with stable hot keys (e.g., product catalog)
# Set maximum memory limit maxmemory 4gb # Choose eviction policy maxmemory-policy allkeys-lru # Number of keys to sample for LRU/LFU approximation # Higher = more accurate but slightly slower # Default is 5, 10 is a good balance maxmemory-samples 10 # LFU tuning (only relevant for *-lfu policies) # lfu-log-factor: higher = slower frequency counter growth # lfu-decay-time: minutes before frequency counter is halved lfu-log-factor 10 lfu-decay-time 1
True LRU requires tracking access order for ALL keys — O(N) memory. Redis uses approximated LRU instead: 1. When eviction is needed, sample maxmemory-samples (e.g., 10) random keys 2. Among those 10 keys, evict the one with the oldest last-access time 3. Repeat until enough memory is freed With maxmemory-samples=10, the approximation is very close to true LRU. With maxmemory-samples=5 (default), it's slightly less accurate but faster. Why not true LRU? - True LRU needs a doubly-linked list + hash map for every key - For 10 million keys, that's ~160MB of overhead just for the LRU structure - Redis's sampling approach uses zero extra memory - At samples=10, the eviction quality is nearly identical to true LRU
🎯 Choosing the Right Policy
For most caching use cases, allkeys-lru is the right default. Use allkeys-lfu if you have a stable set of hot keys that should survive occasional cold-key access spikes. Use noeviction for data stores where losing any key is unacceptable (queues, rate limiters). Use volatile-* variants when you mix cache entries (with TTL) and persistent data (without TTL) in the same Redis instance.
No Persistence
Sometimes the right persistence strategy is none at all. When Redis is used as a pure cache — where every key can be regenerated from the source of truth — persistence adds complexity and I/O overhead with no real benefit. If Redis restarts, the cache warms up naturally as requests come in.
When No Persistence Makes Sense
- ✅Pure cache layer — every cached value can be re-fetched from the database, a cold cache is slow but not incorrect
- ✅Ephemeral data — rate limiter counters, real-time analytics buffers, temporary computation results only valuable for seconds or minutes
- ✅Derived data — materialized views, pre-computed aggregations, search indexes, anything that can be rebuilt from primary storage
- ✅Development and testing — local Redis instances where persistence just slows down restarts and clutters the filesystem
# Disable RDB snapshots entirely save "" # Disable AOF appendonly no # Set eviction policy for cache behavior maxmemory 4gb maxmemory-policy allkeys-lru # Optional: disable RDB file loading on startup # (prevents accidentally loading a stale dump.rdb) dbfilename "" # Result: Redis is a pure in-memory cache # - Maximum performance (no disk I/O) # - Restart = empty Redis (cold cache) # - All data must be re-fetchable from the source of truth
⚠️ Know What You're Giving Up
With no persistence, a Redis restart means a completely cold cache. If your system depends on Redis for rate limiting, session storage, or job queues, disabling persistence means those features break on restart. Only disable persistence when every key in Redis is truly expendable.
Decision Framework
Choosing the right persistence mode depends on your durability requirements, performance budget, and operational complexity tolerance. Here's how the four options compare across the dimensions that matter.
| Dimension | RDB Only | AOF Only | RDB + AOF Hybrid | No Persistence |
|---|---|---|---|---|
| Durability | Minutes of data loss (between snapshots) | ≤1 second (everysec) or zero (always) | ≤1 second (AOF tail with everysec) | Total data loss on restart |
| Restart Speed | Fast — binary load (~10s for 10GB) | Slow — replay all commands (~60s for 10GB) | Fast — RDB base + small AOF tail (~12s for 10GB) | Instant — nothing to load |
| Disk Usage | Low — compact binary snapshots | High — command log grows between rewrites | Medium — RDB base + small tail | None |
| Write Performance | Minimal impact — BGSAVE runs in background | Slight impact — fsync everysec adds ~1-2% overhead | Same as AOF — fsync everysec | Maximum — no disk I/O at all |
| Memory Overhead | COW spike during fork (up to 2x briefly) | COW spike during rewrite | COW spike during rewrite | None |
| Complexity | Low — simple to configure and monitor | Medium — need to monitor file size and rewrites | Medium — same as AOF | Lowest — nothing to configure |
| Best For | Backups, disaster recovery, non-critical caches | Queues, counters, anything needing strong durability | Production deployments needing both speed and durability | Pure caches, ephemeral data, dev environments |
Quick Decision Guide
🟢 Use RDB + AOF Hybrid When
- You need durability AND fast restarts
- Running Redis as a primary data store (not just cache)
- Production environment with SLA requirements
- You want the safest default — this is the recommended setup
🔵 Use RDB Only When
- Data can tolerate minutes of loss on crash
- You primarily need backups and disaster recovery
- Disk I/O budget is very tight
- Dataset is large and you want minimal write amplification
🟡 Use AOF Only When
- Maximum durability is the top priority
- You need
fsync alwaysfor zero data loss - Restart speed is not critical
- You want a human-readable persistence log
⚪ Use No Persistence When
- Redis is a pure cache — all data is re-fetchable
- Maximum performance is required
- Data is ephemeral (rate limits, temp counters)
- Development or testing environments
🎯 Interview Framework
When asked "how would you configure Redis persistence?" — don't jump to a single answer. Ask: "What's the durability requirement? Can we tolerate data loss on crash? Is this a cache or a primary store?" Then map the answer to the right mode. This shows you think in trade-offs, not defaults.
Interview Questions
These questions test whether you understand Redis persistence trade-offs and can make informed decisions for production systems.
Q:What's the difference between RDB and AOF persistence in Redis?
A: RDB takes point-in-time snapshots of the entire dataset at configured intervals using fork + BGSAVE. It produces compact binary files that are fast to load on restart, but you lose all writes between the last snapshot and a crash. AOF logs every write command to a file. With 'appendfsync everysec', you lose at most 1 second of data. AOF files are larger and slower to replay on restart, but provide much better durability. The recommended production setup is hybrid: AOF with an RDB preamble (aof-use-rdb-preamble yes). This gives you fast restarts (RDB base) and minimal data loss (AOF tail).
Q:How does Redis handle the fork() for BGSAVE without blocking clients?
A: When Redis triggers BGSAVE, it calls fork() to create a child process. The child gets a copy of the parent's memory space via the OS's copy-on-write (COW) mechanism. The parent continues serving clients normally. The child iterates over the dataset and writes it to the RDB file. COW means memory pages are only duplicated when the parent modifies them — so if the dataset is mostly read-heavy, the fork is cheap. However, on write-heavy workloads with large datasets, COW can cause significant memory spikes (up to 2x) because modified pages must be copied. This is why you should monitor memory during BGSAVE and ensure you have enough headroom.
Q:What eviction policy would you choose for a Redis cache, and why?
A: For a general-purpose cache, allkeys-lru is the best default. It evicts the least recently used key across the entire keyspace, which works well when recent access predicts future access. For workloads with stable hot keys (e.g., a product catalog where the top 1,000 products get 80% of traffic), allkeys-lfu is better — it keeps frequently accessed keys even if they weren't accessed in the last few seconds. Use volatile-lru when you mix cache entries (with TTL) and persistent data (without TTL) in the same instance — it only evicts keys that have a TTL set, protecting your persistent data. Use noeviction for data stores like job queues where losing any key is unacceptable — Redis returns errors on writes instead of silently dropping data.
Q:Your Redis instance restarts and takes 5 minutes to recover. How do you fix this?
A: A 5-minute restart means Redis is replaying a large AOF file. Three solutions: (1) Enable hybrid persistence (aof-use-rdb-preamble yes) — the AOF file starts with a compact RDB snapshot, so Redis loads the binary base quickly and only replays the small AOF tail. This typically cuts restart time by 80-90%. (2) Tune AOF rewrite thresholds — lower auto-aof-rewrite-percentage so rewrites happen more frequently, keeping the AOF file smaller. (3) If using AOF only, switch to hybrid mode. If restart speed is critical and durability is not, consider RDB only — binary loads are the fastest. Also check if the dataset size is appropriate for the instance — a 50GB dataset on a single node will always be slow to load.
Q:How does Redis approximate LRU, and why doesn't it use true LRU?
A: True LRU requires maintaining a doubly-linked list ordered by access time plus a hash map for O(1) lookups — this adds ~16 bytes per key of overhead. For 100 million keys, that's 1.6GB just for the LRU data structure. Redis instead uses approximated LRU: when eviction is needed, it samples maxmemory-samples random keys (default 5, recommended 10) and evicts the one with the oldest last-access timestamp. Each key stores its last access time in 24 bits (3 bytes) that are already part of the key metadata — zero additional memory. With 10 samples, the approximation is statistically very close to true LRU. The trade-off is worth it: near-identical eviction quality with zero memory overhead.
Common Mistakes
These mistakes cause real production incidents — from silent data loss to unexpected downtime.
Using Redis as a primary store with no persistence
Teams store critical data in Redis (job queues, rate limiter state, user sessions) but leave persistence disabled because 'Redis is a cache.' When Redis restarts — planned maintenance, OOM kill, or crash — all data is gone. Jobs are lost, rate limits reset, users are logged out. The system doesn't crash, but it silently loses data.
✅If any data in Redis cannot be regenerated from another source, enable persistence. Use hybrid mode (aof-use-rdb-preamble yes, appendfsync everysec) as the default for any Redis instance that holds non-cache data. Audit every key namespace: can this be re-fetched from the database? If not, it needs persistence.
Not reserving memory for fork overhead
Redis is configured with maxmemory 8gb on an 8GB instance. During BGSAVE or AOF rewrite, Redis forks. Copy-on-write means modified pages are duplicated. On a write-heavy workload, memory usage spikes to 12-14GB. The OS kills Redis (OOM) or starts swapping, causing massive latency spikes. The background save fails, and if stop-writes-on-bgsave-error is enabled, Redis stops accepting writes entirely.
✅Reserve at least 30-50% extra memory beyond maxmemory for fork overhead. On an 8GB instance, set maxmemory to 4-5GB. Monitor memory usage during BGSAVE with INFO memory (used_memory_rss). On Linux, set vm.overcommit_memory=1 to prevent fork failures, but still ensure physical memory is sufficient.
Using appendfsync always without understanding the cost
A team configures appendfsync always for 'maximum durability' on a high-throughput Redis instance. Every write now waits for a disk fsync before returning. Throughput drops from 100,000 ops/sec to 1,000 ops/sec. Latency jumps from sub-millisecond to 5-10ms. The application slows to a crawl, and the team doesn't connect it to the Redis config change.
✅Use appendfsync everysec for nearly all production workloads. It provides ≤1 second of data loss on crash with negligible performance impact. Only use always when you genuinely cannot afford to lose a single write AND your throughput is low enough to absorb the cost (e.g., a financial ledger with 100 writes/sec, not a cache with 100,000 writes/sec).
Wrong eviction policy for the workload
A team uses the default noeviction policy for a cache. When Redis hits maxmemory, it starts returning OOM errors on every write. The application crashes because it doesn't handle Redis write failures. Or: a team uses allkeys-lru for a mixed workload where some keys are persistent config data without TTLs. LRU evicts the config keys because they haven't been accessed recently, breaking the application.
✅Match the eviction policy to your workload. For pure caches: allkeys-lru or allkeys-lfu. For mixed cache + persistent data: volatile-lru (only evicts keys with TTLs). For data stores where no key should be evicted: noeviction — but then your application MUST handle write errors gracefully. Always test what happens when Redis hits maxmemory before it happens in production.