RDB backup scheduling, AOF log shipping to cold storage, BGSAVE + BGREWRITEAOF interaction, DEBUG RELOAD for in-memory consistency checks, and a documented recovery runbook for the three most common failure scenarios.
A-13 — Disaster Recovery, Backup, and Point-in-Time Restore
Who this module is for: Redis is holding data that matters — sessions, job queues, rate limit state, coordination locks. If the server dies, you need to recover. This module covers RDB backup scheduling, AOF log shipping, recovery runbooks for the three most common failure scenarios, and the one practice most teams skip: testing restore procedures before they need them.
The Backup Strategy Pyramid
Not all Redis data needs the same protection:
| Data Type | Recovery Strategy | RTO | RPO |
|---|---|---|---|
| Pure cache | Regenerate from DB on miss | Seconds | Any (cache is disposable) |
| Session store | Restore from RDB + accept some logouts | Minutes | Last snapshot |
| Job queue | Restore from RDB + reprocess in-flight jobs | Minutes | Last snapshot |
| Rate limit state | Restore from RDB + accept burst on restart | Minutes | Last snapshot |
| Event log / audit trail | AOF log shipping + S3 | < 30 min | < 1 minute |
RPO (Recovery Point Objective): Maximum acceptable data loss (time).
RTO (Recovery Time Objective): Maximum acceptable downtime (time to recover).
Design your persistence configuration to meet the RPO; design your restore procedure to meet the RTO.
RDB Backup Scheduling
Automated Periodic Snapshots
bash#!/bin/bash # /opt/scripts/redis-backup.sh — run via cron REDIS_CLI="/usr/bin/redis-cli" REDIS_HOST="127.0.0.1" REDIS_PORT="6379" BACKUP_DIR="/backups/redis" S3_BUCKET="s3://my-company-backups/redis" DATE=$(date +%Y%m%d-%H%M) # Trigger a background save $REDIS_CLI -h $REDIS_HOST -p $REDIS_PORT BGSAVE # Wait for save to complete (poll rdb_bgsave_in_progress) while [ $($REDIS_CLI -h $REDIS_HOST -p $REDIS_PORT INFO persistence | grep rdb_bgsave_in_progress | cut -d: -f2 | tr -d '\r') -eq 1 ]; do sleep 1 done # Check if last save was successful BGSAVE_STATUS=$($REDIS_CLI -h $REDIS_HOST -p $REDIS_PORT INFO persistence | grep rdb_last_bgsave_status | cut -d: -f2 | tr -d '\r\n ') if [ "$BGSAVE_STATUS" != "ok" ]; then echo "BGSAVE failed!" >&2 exit 1 fi # Copy the RDB file cp /var/lib/redis/dump.rdb "$BACKUP_DIR/dump-$DATE.rdb" # Upload to S3 with server-side encryption aws s3 cp "$BACKUP_DIR/dump-$DATE.rdb" "$S3_BUCKET/dump-$DATE.rdb" --sse aws:kms # Verify upload aws s3 ls "$S3_BUCKET/dump-$DATE.rdb" # Cleanup local backups older than 7 days find "$BACKUP_DIR" -name "dump-*.rdb" -mtime +7 -delete echo "Backup completed: dump-$DATE.rdb"
Add to cron:
# /etc/cron.d/redis-backup
# Every 6 hours
0 */6 * * * redis /opt/scripts/redis-backup.sh >> /var/log/redis-backup.log 2>&1
Backup Retention Policy
Hourly backups: keep 24 (24 hours)
Daily backups: keep 30 (30 days)
Weekly backups: keep 52 (1 year)
Monthly backups: keep forever
Implement with S3 Lifecycle policies:
json{ "Rules": [{ "Filter": {"Prefix": "redis/hourly/"}, "Expiration": {"Days": 2}, "Status": "Enabled" }, { "Filter": {"Prefix": "redis/daily/"}, "Expiration": {"Days": 30}, "Status": "Enabled" }] }
AOF Log Shipping
For near-real-time backup (RPO < 1 minute), ship the AOF file to durable storage continuously:
bash#!/bin/bash # Ship new AOF data to S3 every minute # Uses inotifywait to detect file changes AOF_PATH="/var/lib/redis/appendonlydir" S3_BUCKET="s3://my-company-backups/redis/aof" inotifywait -m -e close_write "$AOF_PATH" | while read path action file; do aws s3 sync "$AOF_PATH" "$S3_BUCKET/" --exclude "*.tmp" echo "$(date): AOF shipped to S3" done
For Redis 7.0+ multi-part AOF (RDB base + incremental AOF files):
bash# Ship only the incremental AOF files (the base is the RDB snapshot) aws s3 sync /var/lib/redis/appendonlydir/ s3://backups/redis/aof/ \ --include "appendonly.aof.*.incr.aof"
Recovery Runbooks
Scenario 1: Single Node Crash (Data Loss ≤ Last Snapshot)
Symptoms: Redis process crashed or OOM killed. No failover configured.
Data loss: Everything since last successful BGSAVE.
bash# 1. Start a fresh Redis instance systemctl start redis # The instance loads dump.rdb automatically on startup # Verify: redis-cli INFO persistence | grep rdb_last_save_time # 2. Verify key count matches expectations redis-cli DBSIZE # 3. Warm the cache if needed (re-populate from database) # ... application-specific cache warming ... echo "Recovery complete. Data restored to: $(date -d @$(redis-cli INFO persistence | grep rdb_last_save_time | cut -d: -f2 | tr -d '\r'))"
RTO: ~2–10 minutes (time to start Redis + load RDB).
RPO: Time since last BGSAVE (up to your snapshot interval).
Scenario 2: Data File Corruption
Symptoms: Redis refuses to start; logs show "Bad file format" or checksum error.
bash# 1. Check what Redis says about the RDB file redis-check-rdb /var/lib/redis/dump.rdb # Output: corrupt? what position? # 2. If RDB is corrupt: restore from backup aws s3 cp s3://backups/redis/dump-20240315-1200.rdb /var/lib/redis/dump.rdb # 3. Verify the backup file redis-check-rdb /var/lib/redis/dump.rdb # 4. Start Redis systemctl start redis # --- OR if using AOF --- # Check and fix a corrupt AOF file redis-check-aof --fix /var/lib/redis/appendonlydir/appendonly.aof.1.incr.aof # This truncates the file at the corruption point (data after the corruption is lost) # Start Redis systemctl start redis
Scenario 3: Accidental FLUSHALL (Data Loss: Complete)
Symptoms: Someone ran FLUSHALL. All keys are gone.
Urgency: Every write since FLUSHALL makes recovery harder (overwrites restored data).
bash# 1. IMMEDIATELY stop the application from writing to Redis # Update your application config to disable Redis, or put Redis in read-only mode: redis-cli CONFIG SET replica-read-only yes # this doesn't help for the primary # Best: take the Redis primary offline temporarily, start a replica from backup # 2. Find the most recent backup before the FLUSHALL # Check S3 timestamps vs when FLUSHALL was detected # 3. Start a restore instance (do NOT restore on the running primary) # Start a separate Redis instance: redis-server --port 6380 --dbfilename restore.rdb --dir /tmp/redis-restore # 4. Copy the backup RDB to the restore instance cp /tmp/dump-backup.rdb /tmp/redis-restore/restore.rdb redis-cli -p 6380 DEBUG RELOAD # or restart the instance to load the file # 5. Verify the restore instance has the data redis-cli -p 6380 DBSIZE redis-cli -p 6380 KEYS "sample:*" | head -20 # 6. Migrate data from restore to production using redis-cli --pipe redis-cli -p 6380 --scan | while read key; do redis-cli -p 6380 DUMP "$key" | redis-cli -p 6379 RESTORE "$key" 0 - done # This is slow for large datasets; use redis-cli --cluster import or a DUMP/RESTORE script # 7. Restore application writes
Alternative (faster for large datasets): Use the --pipe mode of redis-cli to migrate keys using the RDB protocol.
Testing Your Restore Procedure
This is the step most teams skip. A backup you have never restored is not a backup — it is a file that might be a backup.
Monthly Restore Drill
bash#!/bin/bash # /opt/scripts/restore-drill.sh — run monthly # Start a test Redis instance docker run -d --name redis-restore-test -p 6381:6379 redis:7-alpine # Download the most recent backup aws s3 cp s3://backups/redis/dump-latest.rdb /tmp/redis-restore/dump.rdb # Copy to the test container docker cp /tmp/redis-restore/dump.rdb redis-restore-test:/data/dump.rdb # Restart to load the RDB docker restart redis-restore-test sleep 5 # Verify key count KEY_COUNT=$(redis-cli -p 6381 DBSIZE) echo "Restore drill: $KEY_COUNT keys restored" # Sample a few known keys redis-cli -p 6381 GET "user:1001" redis-cli -p 6381 HGETALL "config:global" # Clean up docker stop redis-restore-test docker rm redis-restore-test echo "Restore drill complete at $(date)"
What to Validate
- RDB file integrity:
redis-check-rdb dump.rdbreturns clean - Key count: approximately matches production DBSIZE
- Sample key existence: spot-check 10–20 known keys that should exist
- Data correctness: verify a few known values match expected values
- Restore time: measure end-to-end RTO — does it meet your SLA?
Managed Redis Backup
For AWS ElastiCache, GCP Memorystore, and Redis Cloud:
AWS ElastiCache:
- Enable automatic daily backups (snapshot window configuration)
- Snapshots stored in S3 with configurable retention (1–35 days)
- Manual snapshots:
aws elasticache create-snapshot --cache-cluster-id my-cluster --snapshot-name manual-backup - Restore by creating a new cluster from a snapshot
GCP Memorystore:
- Persistence (RDB) enabled per instance
- Manual exports to GCS:
gcloud redis instances export my-instance gs://my-bucket/dump.rdb - Restore by importing from GCS
Redis Cloud:
- Automatic backup to S3/GCS/Azure Blob, configurable frequency (every hour to daily)
- Point-in-time restore via the console or API
For managed services: use the platform's backup mechanisms rather than running custom scripts. They handle the coordination with the managed service's storage layer.
Summary
- Match backup strategy to data criticality: pure caches need no backup; audit logs need AOF shipping
- RDB backup script: BGSAVE → wait for completion → copy dump.rdb → upload to S3 → prune old backups
- AOF log shipping: ship incremental AOF files to S3 continuously (RPO < 1 minute)
- Recovery runbooks for three scenarios: node crash (restore RDB), corruption (restore from backup), FLUSHALL (restore to test instance, migrate keys)
- Test your restore procedure monthly — a backup you have never restored is not a backup
- For managed services (ElastiCache, Memorystore, Redis Cloud): use the platform's backup mechanisms
- RTO measurement: include it in monthly restore drills to verify you can recover within your SLA
Next: A-14 — Performance Benchmarking and Production Tuning — redis-benchmark, OS-level tuning, slowlog analysis, and the configuration changes that meaningfully improve throughput.