Module A-13·18 min read

RDB backup scheduling, AOF log shipping to cold storage, BGSAVE + BGREWRITEAOF interaction, DEBUG RELOAD for in-memory consistency checks, and a documented recovery runbook for the three most common failure scenarios.

A-13 — Disaster Recovery, Backup, and Point-in-Time Restore

Q: You are designing the backup strategy for a Redis instance that acts exclusively as an in-memory session store. A server crash resulting in the loss of the last 5 minutes of session data (forcing those users to log in again) is acceptable, but losing all sessions is not. Which backup strategy offers the best balance of performance and recovery capability for this specific workload?

Configure automated hourly RDB snapshots and ship the `.rdb` files to an S3 bucket. — The scenario defines a Recovery Point Objective (RPO) where losing minutes of data is acceptable. AOF with `appendfsync always` (A) or continuous AOF shipping (D) adds significant I/O overhead to achieve a near-zero RPO, which is unnecessary here. While replication (B) provides high availability, it is not a backup—an accidental `FLUSHALL` will instantly replicate and wipe the standby. Hourly RDB snapshots provide point-in-time recovery with minimal performance impact, perfectly matching a workload where minor data loss (re-logins) is a tolerable trade-off for performance.

Q: During a disaster recovery drill, you notice that your automated RDB backup script occasionally uploads corrupted or incomplete `.rdb` files to S3. Looking at the script, you see it runs `BGSAVE` and then immediately copies the `/var/lib/redis/dump.rdb` file. What is the fatal flaw in this script?

`BGSAVE` is asynchronous. The script is copying the file while the Redis background child process is still writing to it. — The `BGSAVE` command simply *starts* the background saving process and returns `Background saving started` immediately. If you copy the file right after issuing the command, you are likely copying a partially written file or the previous, outdated snapshot. A robust backup script must poll `INFO persistence` and check the `rdb_bgsave_in_progress` metric, waiting until it returns `0` before copying the file.

Q: A junior developer accidentally runs `FLUSHALL` on production Redis, instantly deleting all keys. You have an automated RDB backup from 1 hour ago stored in S3. What is the safest and most reliable recovery procedure?

Stop the application from writing to Redis, spin up an isolated test instance, load the S3 backup into the test instance to verify it, and then migrate the data from the test instance back to production. — When `FLUSHALL` occurs, urgency is high but caution is critical. If you simply restart the production node with the backup (A), any writes that occurred *after* you restarted but *before* you secured the system will be lost. Furthermore, if the backup is corrupt, you've now completely destroyed your production instance. The standard runbook dictates loading the backup into a sandboxed instance first. This allows you to verify the integrity of the backup and selectively migrate the recovered data back into production using tools like `redis-cli --pipe` or custom migration scripts, ensuring you don't inadvertently overwrite new, valid data that arrived post-`FLUSHALL`.

Who this module is for: Redis is holding data that matters — sessions, job queues, rate limit state, coordination locks. If the server dies, you need to recover. This module covers RDB backup scheduling, AOF log shipping, recovery runbooks for the three most common failure scenarios, and the one practice most teams skip: testing restore procedures before they need them.

The Backup Strategy Pyramid

Not all Redis data needs the same protection:

Data Type	Recovery Strategy	RTO	RPO
Pure cache	Regenerate from DB on miss	Seconds	Any (cache is disposable)
Session store	Restore from RDB + accept some logouts	Minutes	Last snapshot
Job queue	Restore from RDB + reprocess in-flight jobs	Minutes	Last snapshot
Rate limit state	Restore from RDB + accept burst on restart	Minutes	Last snapshot
Event log / audit trail	AOF log shipping + S3	< 30 min	< 1 minute

RPO (Recovery Point Objective): Maximum acceptable data loss (time).
RTO (Recovery Time Objective): Maximum acceptable downtime (time to recover).

Design your persistence configuration to meet the RPO; design your restore procedure to meet the RTO.

RDB Backup Scheduling

Automated Periodic Snapshots

bash

#!/bin/bash
# /opt/scripts/redis-backup.sh — run via cron

REDIS_CLI="/usr/bin/redis-cli"
REDIS_HOST="127.0.0.1"
REDIS_PORT="6379"
BACKUP_DIR="/backups/redis"
S3_BUCKET="s3://my-company-backups/redis"
DATE=$(date +%Y%m%d-%H%M)

# Trigger a background save
$REDIS_CLI -h $REDIS_HOST -p $REDIS_PORT BGSAVE

# Wait for save to complete (poll rdb_bgsave_in_progress)
while [ $($REDIS_CLI -h $REDIS_HOST -p $REDIS_PORT INFO persistence | grep rdb_bgsave_in_progress | cut -d: -f2 | tr -d '\r') -eq 1 ]; do
  sleep 1
done

# Check if last save was successful
BGSAVE_STATUS=$($REDIS_CLI -h $REDIS_HOST -p $REDIS_PORT INFO persistence | grep rdb_last_bgsave_status | cut -d: -f2 | tr -d '\r\n ')
if [ "$BGSAVE_STATUS" != "ok" ]; then
  echo "BGSAVE failed!" >&2
  exit 1
fi

# Copy the RDB file
cp /var/lib/redis/dump.rdb "$BACKUP_DIR/dump-$DATE.rdb"

# Upload to S3 with server-side encryption
aws s3 cp "$BACKUP_DIR/dump-$DATE.rdb" "$S3_BUCKET/dump-$DATE.rdb" --sse aws:kms

# Verify upload
aws s3 ls "$S3_BUCKET/dump-$DATE.rdb"

# Cleanup local backups older than 7 days
find "$BACKUP_DIR" -name "dump-*.rdb" -mtime +7 -delete

echo "Backup completed: dump-$DATE.rdb"

Add to cron:

text

# /etc/cron.d/redis-backup
# Every 6 hours
0 */6 * * * redis /opt/scripts/redis-backup.sh >> /var/log/redis-backup.log 2>&1

Backup Retention Policy

text

Hourly backups:  keep 24 (24 hours)
Daily backups:   keep 30 (30 days)
Weekly backups:  keep 52 (1 year)
Monthly backups: keep forever

Implement with S3 Lifecycle policies:

json

{
  "Rules": [{
    "Filter": {"Prefix": "redis/hourly/"},
    "Expiration": {"Days": 2},
    "Status": "Enabled"
  }, {
    "Filter": {"Prefix": "redis/daily/"},
    "Expiration": {"Days": 30},
    "Status": "Enabled"
  }]
}

AOF Log Shipping

For near-real-time backup (RPO < 1 minute), ship the AOF file to durable storage continuously:

bash

#!/bin/bash
# Ship new AOF data to S3 every minute
# Uses inotifywait to detect file changes

AOF_PATH="/var/lib/redis/appendonlydir"
S3_BUCKET="s3://my-company-backups/redis/aof"

inotifywait -m -e close_write "$AOF_PATH" | while read path action file; do
  aws s3 sync "$AOF_PATH" "$S3_BUCKET/" --exclude "*.tmp"
  echo "$(date): AOF shipped to S3"
done

For Redis 7.0+ multi-part AOF (RDB base + incremental AOF files):

bash

# Ship only the incremental AOF files (the base is the RDB snapshot)
aws s3 sync /var/lib/redis/appendonlydir/ s3://backups/redis/aof/ \
  --include "appendonly.aof.*.incr.aof"

Recovery Runbooks

Scenario 1: Single Node Crash (Data Loss ≤ Last Snapshot)

Symptoms: Redis process crashed or OOM killed. No failover configured.
Data loss: Everything since last successful BGSAVE.

bash

# 1. Start a fresh Redis instance
systemctl start redis

# The instance loads dump.rdb automatically on startup
# Verify:
redis-cli INFO persistence | grep rdb_last_save_time

# 2. Verify key count matches expectations
redis-cli DBSIZE

# 3. Warm the cache if needed (re-populate from database)
# ... application-specific cache warming ...

echo "Recovery complete. Data restored to: $(date -d @$(redis-cli INFO persistence | grep rdb_last_save_time | cut -d: -f2 | tr -d '\r'))"

RTO: ~2–10 minutes (time to start Redis + load RDB).
RPO: Time since last BGSAVE (up to your snapshot interval).

Scenario 2: Data File Corruption

Symptoms: Redis refuses to start; logs show "Bad file format" or checksum error.

bash

# 1. Check what Redis says about the RDB file
redis-check-rdb /var/lib/redis/dump.rdb
# Output: corrupt? what position?

# 2. If RDB is corrupt: restore from backup
aws s3 cp s3://backups/redis/dump-20240315-1200.rdb /var/lib/redis/dump.rdb

# 3. Verify the backup file
redis-check-rdb /var/lib/redis/dump.rdb

# 4. Start Redis
systemctl start redis

# --- OR if using AOF ---

# Check and fix a corrupt AOF file
redis-check-aof --fix /var/lib/redis/appendonlydir/appendonly.aof.1.incr.aof
# This truncates the file at the corruption point (data after the corruption is lost)

# Start Redis
systemctl start redis

Scenario 3: Accidental FLUSHALL (Data Loss: Complete)

Symptoms: Someone ran FLUSHALL. All keys are gone.
Urgency: Every write since FLUSHALL makes recovery harder (overwrites restored data).

bash

# 1. IMMEDIATELY stop the application from writing to Redis
# Update your application config to disable Redis, or put Redis in read-only mode:
redis-cli CONFIG SET replica-read-only yes  # this doesn't help for the primary
# Best: take the Redis primary offline temporarily, start a replica from backup

# 2. Find the most recent backup before the FLUSHALL
# Check S3 timestamps vs when FLUSHALL was detected

# 3. Start a restore instance (do NOT restore on the running primary)
# Start a separate Redis instance:
redis-server --port 6380 --dbfilename restore.rdb --dir /tmp/redis-restore

# 4. Copy the backup RDB to the restore instance
cp /tmp/dump-backup.rdb /tmp/redis-restore/restore.rdb
redis-cli -p 6380 DEBUG RELOAD  # or restart the instance to load the file

# 5. Verify the restore instance has the data
redis-cli -p 6380 DBSIZE
redis-cli -p 6380 KEYS "sample:*" | head -20

# 6. Migrate data from restore to production using redis-cli --pipe
redis-cli -p 6380 --scan | while read key; do
  redis-cli -p 6380 DUMP "$key" | redis-cli -p 6379 RESTORE "$key" 0 -
done

# This is slow for large datasets; use redis-cli --cluster import or a DUMP/RESTORE script

# 7. Restore application writes

Alternative (faster for large datasets): Use the --pipe mode of redis-cli to migrate keys using the RDB protocol.

Testing Your Restore Procedure

This is the step most teams skip. A backup you have never restored is not a backup — it is a file that might be a backup.

Monthly Restore Drill

bash

#!/bin/bash
# /opt/scripts/restore-drill.sh — run monthly

# Start a test Redis instance
docker run -d --name redis-restore-test -p 6381:6379 redis:7-alpine

# Download the most recent backup
aws s3 cp s3://backups/redis/dump-latest.rdb /tmp/redis-restore/dump.rdb

# Copy to the test container
docker cp /tmp/redis-restore/dump.rdb redis-restore-test:/data/dump.rdb

# Restart to load the RDB
docker restart redis-restore-test
sleep 5

# Verify key count
KEY_COUNT=$(redis-cli -p 6381 DBSIZE)
echo "Restore drill: $KEY_COUNT keys restored"

# Sample a few known keys
redis-cli -p 6381 GET "user:1001"
redis-cli -p 6381 HGETALL "config:global"

# Clean up
docker stop redis-restore-test
docker rm redis-restore-test

echo "Restore drill complete at $(date)"

What to Validate

RDB file integrity: redis-check-rdb dump.rdb returns clean
Key count: approximately matches production DBSIZE
Sample key existence: spot-check 10–20 known keys that should exist
Data correctness: verify a few known values match expected values
Restore time: measure end-to-end RTO — does it meet your SLA?

Managed Redis Backup

For AWS ElastiCache, GCP Memorystore, and Redis Cloud:

AWS ElastiCache:

Enable automatic daily backups (snapshot window configuration)
Snapshots stored in S3 with configurable retention (1–35 days)
Manual snapshots: aws elasticache create-snapshot --cache-cluster-id my-cluster --snapshot-name manual-backup
Restore by creating a new cluster from a snapshot

GCP Memorystore:

Persistence (RDB) enabled per instance
Manual exports to GCS: gcloud redis instances export my-instance gs://my-bucket/dump.rdb
Restore by importing from GCS

Redis Cloud:

Automatic backup to S3/GCS/Azure Blob, configurable frequency (every hour to daily)
Point-in-time restore via the console or API

For managed services: use the platform's backup mechanisms rather than running custom scripts. They handle the coordination with the managed service's storage layer.

Summary

Match backup strategy to data criticality: pure caches need no backup; audit logs need AOF shipping
RDB backup script: BGSAVE → wait for completion → copy dump.rdb → upload to S3 → prune old backups
AOF log shipping: ship incremental AOF files to S3 continuously (RPO < 1 minute)
Recovery runbooks for three scenarios: node crash (restore RDB), corruption (restore from backup), FLUSHALL (restore to test instance, migrate keys)
Test your restore procedure monthly — a backup you have never restored is not a backup
For managed services (ElastiCache, Memorystore, Redis Cloud): use the platform's backup mechanisms
RTO measurement: include it in monthly restore drills to verify you can recover within your SLA

Next: A-14 — Performance Benchmarking and Production Tuning — redis-benchmark, OS-level tuning, slowlog analysis, and the configuration changes that meaningfully improve throughput.

Knowledge Check

You are designing the backup strategy for a Redis instance that acts exclusively as an in-memory session store. A server crash resulting in the loss of the last 5 minutes of session data (forcing those users to log in again) is acceptable, but losing all sessions is not. Which backup strategy offers the best balance of performance and recovery capability for this specific workload?

During a disaster recovery drill, you notice that your automated RDB backup script occasionally uploads corrupted or incomplete .rdb files to S3. Looking at the script, you see it runs BGSAVE and then immediately copies the /var/lib/redis/dump.rdb file. What is the fatal flaw in this script?

A junior developer accidentally runs FLUSHALL on production Redis, instantly deleting all keys. You have an automated RDB backup from 1 hour ago stored in S3. What is the safest and most reliable recovery procedure?

Test your knowledge with more question sets

PreviousModule A-12: Multi-Region Redis: Active-Active and Geo-Replication Next Module A-14: Performance Benchmarking and Production Tuning

Discussion

Join the discussion

Loading comments...