Module P-11·20 min read

The INFO command section by section (server, clients, memory, stats, replication, keyspace), SLOWLOG for identifying slow commands, LATENCY HISTORY, MONITOR for live command tracing, and the 10 metrics every Redis dashboard must have.

P-11 — Monitoring and Observability

Q: A Redis cache is exhibiting a 75% cache hit rate (`keyspace_hits / (keyspace_hits + keyspace_misses)`), and the monitoring dashboard shows `evicted_keys` consistently hovering above zero. Which of the following is the most likely root cause and the appropriate remediation?

Redis is under memory pressure and is actively discarding data to make room for new writes, prematurely evicting keys before their TTL expires. `maxmemory` should be increased, or the dataset size reduced. — The key metric here is `evicted_keys > 0`. Eviction is distinct from expiration. Expiration happens naturally when a TTL runs out. Eviction happens forcibly when Redis hits its `maxmemory` limit and must delete existing data to accept new data (according to the `maxmemory-policy`, like `allkeys-lru`). Forced eviction of active cache data is the primary reason cache hit rates drop below healthy levels (~90%+).

Q: An application experiences periodic latency spikes. An engineer runs the `MONITOR` command in the production Redis instance to debug the issue. Within seconds, the latency spikes become continuous, and the Redis CPU utilization hits 100%. What happened?

`MONITOR` streams every single executed command back to the client in real-time. On a high-throughput production instance, this generates massive internal processing and network I/O overhead, effectively choking the server. — `MONITOR` is incredibly useful for debugging, but it is dangerous. Redis must construct a string representation of every incoming command and its arguments, and write it to the socket of the monitoring client. This bypasses many internal optimizations and can easily cut a Redis instance's maximum throughput in half (or worse), immediately overloading a busy server.

Q: You want to find out if there are any specific, computationally expensive queries slowing down the Redis event loop. You execute `SLOWLOG GET 10`. The log shows multiple entries for `HGETALL` commands taking over 15,000 microseconds (15ms). Which of the following is the most direct way to resolve this specific bottleneck?

Redesign the application to use `HGET` for specific fields or `HSCAN` for iterative retrieval, rather than loading the entire massive Hash into memory at once with `HGETALL`. — `HGETALL` on a large Hash (especially one that has grown past the `listpack` threshold and is stored as a full hashtable) is O(N) where N is the size of the Hash. It blocks the single-threaded event loop for the duration of the retrieval and serialization. Pipelining (A) or reading from replicas (D) might hide the issue from the primary node slightly, but they don't fix the fundamental inefficiency. The correct approach is to stop requesting the entire massive object when only a subset of data is needed (`HGET`/`HMGET`), or to paginate the read (`HSCAN`).

Who this module is for: You have a Redis instance in production but no visibility into what it is doing — what commands are slow, whether memory is healthy, how close you are to hitting limits. This module covers the full observability surface: INFO sections, SLOWLOG, LATENCY, MONITOR, and the 10 metrics every Redis dashboard must include.

The INFO Command

INFO is the primary observability tool. It returns a structured plaintext report across multiple sections. You can request all sections or a specific one:

text

INFO           → all sections
INFO server    → server metadata
INFO clients   → connected client counts
INFO memory    → memory usage and fragmentation
INFO stats     → command stats, hit/miss rates
INFO replication → primary/replica state
INFO cpu       → CPU time consumed
INFO keyspace  → per-database key counts and TTL stats
INFO persistence → RDB/AOF state
INFO commandstats → per-command call counts and latency
INFO latencystats → latency percentiles per command (Redis 7.0+)

INFO server

text

redis_version: 7.2.4
os: Linux 5.15.0-92 x86_64
arch_bits: 64
tcp_port: 6379
uptime_in_seconds: 864000    → 10 days of uptime
hz: 10                        → event loop frequency (affects expiry and other timers)
configured_hz: 10
aof_rewrites: 14
rdb_changes_since_last_save: 1423

uptime_in_seconds matters for fragmentation analysis — fragmentation grows over time and a very long uptime with high key churn warrants active defragmentation.

INFO clients

text

connected_clients: 48
blocked_clients: 2
tracking_clients: 0
clients_in_timeout_table: 0
maxclients: 10000
client_recent_max_input_buffer: 20480
client_recent_max_output_buffer: 0

Watch connected_clients approaching maxclients. Watch client_recent_max_output_buffer — a large output buffer means slow clients accumulating data faster than they read.

INFO stats — The Most Important Section

text

total_commands_processed: 948273841
total_connections_received: 1284723
rejected_connections: 0           → > 0 means you hit maxclients
expired_keys: 4829341             → total keys expired since start
evicted_keys: 0                   → should be 0; > 0 means memory pressure
keyspace_hits: 921847392          → commands that found their key
keyspace_misses: 26426449         → commands that returned nil
pubsub_channels: 3
pubsub_patterns: 1
instantaneous_ops_per_sec: 42841  → current throughput
instantaneous_input_kbps: 6284
instantaneous_output_kbps: 12847
total_net_input_bytes: 48293847192
total_net_output_bytes: 98473829384

Cache hit rate = keyspace_hits / (keyspace_hits + keyspace_misses)

For the example above: 921847392 / (921847392 + 26426449) = 97.2% — healthy.

Below 90%: investigate why. Causes: TTLs too short, maxmemory too small, cache warming not working, wrong key patterns.

evicted_keys > 0: Your cache is under memory pressure. Redis is actively deleting data to make room. Increase maxmemory or reduce your dataset.

rejected_connections > 0: You have hit maxclients. Increase the limit or fix connection leaks.

INFO replication

text

role: master
connected_slaves: 2
slave0: ip=10.0.1.50,port=6379,state=online,offset=84729384,lag=0
slave1: ip=10.0.1.51,port=6379,state=online,offset=84729382,lag=1
master_replid: a3f9c2d7e8b1...
master_repl_offset: 84729384
repl_backlog_active: 1
repl_backlog_size: 1048576    → 1MB replication backlog
repl_backlog_first_byte_offset: 83680808
repl_backlog_histlen: 1048576

lag = replication lag in seconds for each replica. A non-zero lag means the replica is behind.

repl_backlog_size — if a replica disconnects and reconnects with an offset that is no longer in the backlog, it requires a full resync (expensive). Increase repl-backlog-size if replicas frequently reconnect: CONFIG SET repl-backlog-size 64mb.

INFO keyspace

db0:keys=142883,expires=141204,avg_ttl=3591847

expires vs keys ratio — if expires << keys, most of your keys have no TTL. For a cache, this is a problem: memory fills up without natural eviction.

avg_ttl — average remaining TTL in milliseconds. If this is very short (< 60,000 = 60 seconds), keys are expiring rapidly and you may have high expiry overhead.

INFO commandstats

text

cmdstat_get:calls=18492834,usec=92464170,usec_per_call=5.00
cmdstat_set:calls=4293847,usec=17175388,usec_per_call=4.00
cmdstat_hgetall:calls=293847,usec=29384700,usec_per_call=100.00
cmdstat_zadd:calls=1293847,usec=5175388,usec_per_call=4.00

usec_per_call — microseconds per command call. High values for specific commands reveal which commands are slow. In the example, HGETALL at 100µs vs GET at 5µs — these HGETALL calls are expensive (likely large Hashes).

INFO latencystats (Redis 7.0+)

text

latency_percentiles_usec_get: p50=3,p99=12,p99.9=45
latency_percentiles_usec_hgetall: p50=8,p99=148,p99.9=2140

Per-command latency percentiles. p99.9 for HGETALL at 2,140µs (2ms) is a signal that some HGETALL calls are very expensive — likely on large Hashes that crossed the listpack→hashtable threshold.

SLOWLOG

SLOWLOG records commands that exceed a configurable latency threshold.

text

CONFIG SET slowlog-log-slower-than 10000   → log commands slower than 10ms (10,000µs)
CONFIG SET slowlog-max-len 128             → keep last 128 slow commands

text

SLOWLOG GET 10         → show last 10 slow commands
SLOWLOG LEN            → count of entries in the log
SLOWLOG RESET          → clear the log

text

127.0.0.1:6379> SLOWLOG GET 3
1) 1) (integer) 42          → log entry ID
   2) (integer) 1717000000  → Unix timestamp
   3) (integer) 14823       → execution time in microseconds (14.8ms)
   4) 1) "KEYS"             → the command
      2) "*"
   5) "10.0.1.100:52394"    → client address
   6) "myapp"               → client name (set with CLIENT SETNAME)

2) 1) (integer) 41
   2) (integer) 1717000000
   3) (integer) 12100
   4) 1) "HGETALL"
      2) "user:99999"       → this specific key is slow
   5) "10.0.1.100:52395"
   6) "myapp"

Common slow command findings:

KEYS * — scans all keys, blocks Redis. Replace with SCAN.
HGETALL large_hash — Hash in hashtable encoding with thousands of fields.
SMEMBERS large_set — returns all Set members at once. Use SSCAN.
SORT — sorts a List or Set; O(N+M log M). Computationally expensive.
LRANGE key 0 -1 — returns entire List. Cache long lists with pagination.

Set slowlog-log-slower-than 1000 (1ms) in development to catch all slow commands during development and testing. In production, use 10,000–20,000µs to avoid log noise.

LATENCY Monitoring

Redis has a built-in latency monitoring system that tracks event-level latency — not per-command, but per internal event type (fork, AOF flush, RDB save, etc.).

text

CONFIG SET latency-monitor-threshold 100   → track events with latency > 100ms
LATENCY LATEST                             → most recent latency sample per event
LATENCY HISTORY event-name                 → historical latency for an event
LATENCY RESET [event-name]                 → clear latency history

text

127.0.0.1:6379> LATENCY LATEST
1) 1) "aof-stat"
   2) (integer) 1717000000   → timestamp
   3) (integer) 120          → latency in ms
   4) (integer) 350          → max latency seen

Event names to watch:

fork — BGSAVE/BGREWRITEAOF fork latency (high = large dataset or memory pressure)
aof-stat — AOF write latency (high = disk I/O bottleneck)
rdb-* — RDB save events
command — command execution latency (aggregate)

MONITOR: Live Command Stream

MONITOR

MONITOR streams every command executed by every client in real time. It is invaluable for debugging unexpected behaviour ("what is sending KEYS * in production?") but adds 50%+ CPU overhead. Never leave MONITOR running in production.

text

127.0.0.1:6379> MONITOR
OK
1717000000.123456 [0 10.0.1.100:52394] "GET" "user:1001"
1717000000.124123 [0 10.0.1.101:52395] "HSET" "session:abc123" "lastSeen" "1717000000"
1717000000.124200 [0 10.0.1.100:52394] "SET" "cache:product:999" "..." "EX" "300"

Format: {unix_timestamp} [{db} {client_ip:port}] {command} {args...}

Use it briefly to identify which clients are issuing which commands, then disconnect immediately.

CLIENT LIST and CLIENT INFO

text

CLIENT LIST   → one line per connected client
CLIENT INFO   → info for the current client

text

id=42 addr=10.0.1.100:52394 laddr=10.0.0.10:6379 fd=23 name=myapp age=1234
cmd=get flags=N db=0 sub=0 psub=0 multi=-1 watch=0
qbuf=0 qbuf-free=32768 argv-mem=10 multi-mem=0
tot-mem=20512 rbs=16384 rbp=0 obl=0 oll=0 omem=0
events=r resp=2 uid=0 user=default library-name=ioredis library-ver=5.3.3

Key fields:

cmd — last command issued by this client
age — seconds since connection was established
sub — number of channels subscribed
omem — output buffer memory (large = slow client)
flags — b = blocked (BLPOP), S = subscriber

Identify stuck clients: CLIENT LIST + filter for cmd=blpop with high age values.

The 10 Metrics Every Redis Dashboard Must Include

#	Metric	Source	Alert Threshold
1	Cache hit rate	`keyspace_hits / (hits + misses)`	< 90%
2	Evicted keys/sec	`evicted_keys` delta	> 0
3	Memory fragmentation ratio	`mem_fragmentation_ratio`	> 1.5 or < 1.0
4	Memory used / maxmemory	`used_memory / maxmemory`	> 80%
5	Connected clients	`connected_clients`	> 80% of `maxclients`
6	Ops per second	`instantaneous_ops_per_sec`	Baseline ± 3σ
7	Replication lag	`slave.lag` (INFO replication)	> 5 seconds
8	Slow commands	`SLOWLOG LEN` delta	Any increase
9	Last BGSAVE status	`rdb_last_bgsave_status`	`err`
10	Rejected connections	`rejected_connections` delta	> 0

Export these metrics from INFO every 15–60 seconds to your monitoring system (Prometheus via redis_exporter, Datadog, CloudWatch, etc.).

redis-cli Monitoring Shortcuts

bash

# Live stats (refreshes every second)
redis-cli --stat

# Live latency monitoring
redis-cli --latency
redis-cli --latency-history -i 5   # sample every 5 seconds

# Live memory usage
redis-cli --memkeys                 # memory usage per key pattern (sampling)

# Count keys matching a pattern
redis-cli --scan --pattern "session:*" | wc -l

# Big keys scan (find top memory consumers)
redis-cli --bigkeys

redis-cli --bigkeys scans the entire keyspace using SCAN and samples key sizes — it reports the largest key per type. Safe to run on production (uses cursor-based scan, not blocking KEYS *).

Summary

INFO is the starting point — use INFO stats for throughput and hit rate, INFO memory for memory health, INFO replication for lag, INFO keyspace for key distribution
Cache hit rate (keyspace_hits / total) should be > 90% — below this, investigate TTLs, eviction, and cache warming
evicted_keys > 0 means memory pressure — increase maxmemory or reduce dataset
SLOWLOG GET reveals expensive commands — the most common findings: KEYS *, HGETALL on large hashes, SORT
LATENCY LATEST / LATENCY HISTORY tracks internal event latency (fork, AOF flush, RDB save)
MONITOR streams live commands — invaluable for debugging, catastrophic if left running in production
CLIENT LIST identifies slow/stuck clients by output buffer size and command age
Export INFO metrics every 15–60 seconds to your monitoring system; build dashboards around the 10 core metrics

Next: P-12 — Security: ACLs, TLS, and Network Hardening — per-user command restrictions, TLS for in-transit encryption, bind address configuration, and the most common Redis security misconfigurations.

Knowledge Check

A Redis cache is exhibiting a 75% cache hit rate (keyspace_hits / (keyspace_hits + keyspace_misses)), and the monitoring dashboard shows evicted_keys consistently hovering above zero. Which of the following is the most likely root cause and the appropriate remediation?

An application experiences periodic latency spikes. An engineer runs the MONITOR command in the production Redis instance to debug the issue. Within seconds, the latency spikes become continuous, and the Redis CPU utilization hits 100%. What happened?

You want to find out if there are any specific, computationally expensive queries slowing down the Redis event loop. You execute SLOWLOG GET 10. The log shows multiple entries for HGETALL commands taking over 15,000 microseconds (15ms). Which of the following is the most direct way to resolve this specific bottleneck?

Test your knowledge with more question sets

PreviousModule P-10: Connection Pooling and Client Configuration Next Module P-12: Security: ACLs, TLS, and Network Hardening

Discussion

Join the discussion

Loading comments...