Module P-7·20 min read

Three distinct failure modes requiring different solutions — XFetch probabilistic expiry for stampede, TTL jitter for avalanche, Bloom filter pre-gating for penetration. Detection patterns and production mitigations.

P-7 — Cache Stampede, Avalanche, and Penetration

Q: A media streaming company launches a highly anticipated TV show episode. Exactly one hour after launch, their primary database CPU spikes to 100% for about 5 seconds, causing timeouts, before completely recovering. This exact pattern repeats exactly every hour. What is happening, and what is the appropriate mitigation?

The system is experiencing a Cache Stampede. The cache key for the new episode metadata has a 1-hour TTL. When it expires, thousands of concurrent requests miss the cache and simultaneously query the database before the first query can repopulate the cache. They should implement Probabilistic Early Expiry (XFetch) or a Mutex Lock. — This is the classic signature of a Cache Stampede (or Thundering Herd). It is characterized by a massive, sharp, and very brief spike in database load centered around a single popular key expiring. Because it only takes a few milliseconds for the database to return the data and for the application to cache it, the spike is short-lived. To prevent this, you must either ensure only *one* request is allowed to query the database when the key expires (Mutex Lock) or proactively recompute the key *before* it actually expires (XFetch).

Q: An application is subject to a sustained, malicious scraping attack where the attacker requests user profiles using sequentially generated, fake user IDs (e.g., `/users/9999001`, `/users/9999002`). The application uses a standard cache-aside pattern. Monitoring shows Redis operating normally, but the PostgreSQL database is overloaded with queries returning zero rows. Which mitigation strategy provides the most resilient defense with the lowest memory overhead?

Implement a Bloom Filter in front of the cache. The application checks the Bloom filter first; if it returns false, the application immediately returns a 404 without querying the cache or the database. — This scenario describes Cache Penetration. Because the requested IDs are continually unique, Null Caching (caching the "not found" state) is dangerous here—the attacker will rapidly fill your Redis memory with millions of useless "NULL" keys. A Bloom Filter is the mathematically correct solution: it is extremely memory efficient, can definitively tell you if an ID does *not* exist in the database, and allows you to reject the malicious requests with zero database load.

Q: Following a major deployment, a microservice instances crash and restart simultaneously. Upon restart, they run a script that bulk-loads the top 100,000 product pages from the database into Redis, assigning every key a strict TTL of `3600` seconds (1 hour). Exactly one hour later, the database CPU hits 100% and remains saturated for several minutes, causing widespread timeouts across the entire platform. What caused this, and how should the bulk-load script be modified?

Cache Avalanche. Modifying the script to assign a TTL of `jitteredTTL(3600, 0.1)` (e.g., randomizing the TTLs between 54 and 66 minutes) would spread the expirations over a 12-minute window, preventing the simultaneous database flood. — This is a Cache Avalanche. Unlike a stampede (which is focused on a *single* popular key), an avalanche occurs when a vast number of *different* keys expire at the exact same moment. This typically happens when data is bulk-loaded with identical TTLs. By adding simple mathematical jitter (randomness) to the TTL of each key during the bulk load, the expirations are smoothly distributed over time, eliminating the massive synchronized wave of database queries.

Who this module is for: Your cache is configured and working. Then, under load or at a specific moment, your database CPU spikes to 100%, response times collapse, and the system partially recovers when the cache warms up again. This is a cache failure — and there are three distinct failure modes, each requiring a different fix. Treating all three the same is why most mitigation attempts fail.

The Three Failure Modes

Failure	Trigger	Symptom	Solution
Cache stampede	One popular key expires	Sudden DB spike on one query	Probabilistic expiry, mutex lock
Cache avalanche	Many keys expire simultaneously	Sustained DB overload	TTL jitter, pre-warming
Cache penetration	Requests for non-existent keys	Sustained DB queries with empty results	Null caching, Bloom filter

Each has a different root cause, different detection signature, and different fix.

Cache Stampede (Thundering Herd)

What It Is

A single popular cache key expires. In the milliseconds before the first request can recompute and repopulate it, dozens or hundreds of concurrent requests see a miss and all race to recompute the same expensive query. The database receives N identical queries simultaneously.

Detection

text

# In your metrics:
database_query_count → spike to N×normal for ~duration_of_recompute
cache_miss_rate → spike on a specific key

The spike is narrow and short-lived — it resolves when the first request finishes recomputing and populates the cache. The next expiry cycle causes another spike.

Fix 1: Probabilistic Early Expiry (XFetch Algorithm)

Instead of waiting for the key to expire, some requests recompute before expiry with a probability proportional to how close the key is to expiry. This "warms" the key proactively, preventing the expiry from ever causing a stampede.

typescript

async function getWithXFetch<T>(
  key: string,
  ttl: number,  // desired TTL in seconds
  compute: () => Promise<T>,
  beta: number = 1.0  // higher = more aggressive early recompute
): Promise<T> {
  const data = await redis.get(key);

  if (data) {
    const { value, delta, expiry } = JSON.parse(data);
    const remainingTTL = expiry - Date.now() / 1000;

    // Recompute with probability proportional to remaining time and compute cost
    // XFetch: recompute if -delta * beta * log(random()) > remainingTTL
    const shouldRecompute = -delta * beta * Math.log(Math.random()) > remainingTTL;

    if (!shouldRecompute) {
      return value as T;
    }
    // Fall through to recompute
  }

  const startTime = Date.now();
  const value = await compute();
  const delta = (Date.now() - startTime) / 1000;  // recompute time in seconds

  const payload = JSON.stringify({
    value,
    delta,
    expiry: Date.now() / 1000 + ttl,
  });

  await redis.set(key, payload, 'EX', ttl);
  return value;
}

The delta (recompute time) is stored alongside the value. Expensive queries get earlier preemptive recompute because they have a larger delta. The beta parameter controls aggressiveness — beta = 1 is the standard algorithm.

Fix 2: Mutex Lock (Single Flight)

Only one request recomputes at a time. Others wait for the lock to be released, then serve from the (now-populated) cache.

typescript

async function getWithLock<T>(
  key: string,
  ttl: number,
  compute: () => Promise<T>
): Promise<T> {
  const lockKey = `lock:${key}`;

  // Try cache first
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  // Try to acquire lock (NX = only if not exists, EX = expire in 10s)
  const locked = await redis.set(lockKey, '1', 'NX', 'EX', 10);

  if (locked === 'OK') {
    try {
      // Double-check cache (another process may have populated it)
      const recheck = await redis.get(key);
      if (recheck) return JSON.parse(recheck);

      const value = await compute();
      await redis.set(key, JSON.stringify(value), 'EX', ttl);
      return value;
    } finally {
      await redis.del(lockKey);
    }
  } else {
    // Wait for the lock holder to finish, then serve from cache
    await new Promise(resolve => setTimeout(resolve, 50 + Math.random() * 50));
    return getWithLock(key, ttl, compute);  // recursive retry
  }
}

Trade-off: Lock holders that crash leave the lock in place until EX expires. Set the lock TTL to exceed the maximum expected compute time.

Cache Avalanche

What It Is

Many cache keys expire at roughly the same time. If all your cached data was loaded at startup (cold start after deployment) with the same TTL, all keys expire together. The database receives a flood of queries across many different data types — not a spike on one query, but a sustained overload across all queries.

Detection

text

cache_miss_rate → sustained elevation across many different keys/endpoints
database_query_count → broad sustained spike (not narrow like stampede)
application_response_time → slow across all endpoints, not one

The signature: broad, sustained, across all endpoints simultaneously. Typically occurs ~TTL seconds after a cold start or major deployment.

Fix 1: TTL Jitter

Instead of setting all keys to the same TTL, add random jitter to spread expiry times:

typescript

function jitteredTTL(baseTTL: number, jitterFraction: number = 0.1): number {
  const jitter = baseTTL * jitterFraction;
  return Math.floor(baseTTL + (Math.random() * 2 - 1) * jitter);
}

// 5-minute TTL with ±30 seconds jitter
await redis.set(key, value, 'EX', jitteredTTL(300, 0.1));
// Sets TTL to anywhere from 270 to 330 seconds

With 10% jitter, keys set at the same time expire across a 60-second spread instead of all at once. The database load is distributed over time.

Fix 2: Staggered Pre-Warming

Before traffic arrives (post-deployment, post-restart), warm the cache in batches with delays between batches:

typescript

async function warmCache(userIds: string[]) {
  const batchSize = 100;
  for (let i = 0; i < userIds.length; i += batchSize) {
    const batch = userIds.slice(i, i + batchSize);

    await Promise.all(batch.map(async (userId) => {
      const data = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
      await redis.set(`user:${userId}`, JSON.stringify(data), 'EX', jitteredTTL(3600));
    }));

    // Delay between batches to avoid overwhelming the database
    if (i + batchSize < userIds.length) {
      await new Promise(resolve => setTimeout(resolve, 100));
    }
  }
}

Warm the most-accessed data first (top users, popular products), then less-popular data progressively.

Fix 3: Never-Expiry + Background Refresh

For truly critical keys (homepage, global config), use no TTL and refresh in a background job:

typescript

// Background job (runs every 5 minutes)
async function refreshCriticalCache() {
  const data = await db.query('SELECT * FROM config WHERE active = true');
  await redis.set('config:global', JSON.stringify(data));
  // No EX — key never expires; refreshed proactively
}

The key never expires, so cache misses never happen. Data is at most one refresh interval stale.

Cache Penetration

What It Is

Requests arrive for keys that do not exist — and will never exist. For example: requests for user IDs that are not in your database (invalid IDs, enumeration attacks, deleted users). Each request misses the cache and hits the database with a query that returns no rows.

Unlike stampede and avalanche, penetration is sustained: the non-existent keys never get cached (there is nothing to cache), so every request hits the database.

Detection

text

cache_miss_rate → elevated, but database queries return empty results
database_queries_with_empty_results → elevated
attack_pattern → often a sequence of incrementing or random IDs

The signature: high miss rate, but database is returning empty results (not data). Sustained, not time-bounded.

Fix 1: Null Caching

Cache the "not found" result with a short TTL:

typescript

async function getUser(userId: string) {
  const key = `user:${userId}`;

  const cached = await redis.get(key);
  if (cached !== null) {
    if (cached === 'NULL') return null;  // cached null result
    return JSON.parse(cached);
  }

  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);

  if (user === null) {
    // Cache the null result for 60 seconds (limit repeated DB hits)
    await redis.set(key, 'NULL', 'EX', 60);
    return null;
  }

  await redis.set(key, JSON.stringify(user), 'EX', 3600);
  return user;
}

Limitation: An attacker can enumerate many different non-existent IDs, caching 'NULL' for each. This fills Redis with null entries. Mitigate with a short TTL (60 seconds) and rate limiting on the endpoint.

Fix 2: Bloom Filter

A Bloom filter answers "might this key exist?" with a configurable false-positive rate and zero false negatives. Check the Bloom filter before hitting the cache or database:

text

Request → Bloom filter: "does user:99999 exist?"
         → No → return 404 immediately (never hits cache or DB)
         → Maybe → try cache, then DB

Redis Stack (formerly Redis Modules) includes a Bloom filter implementation:

text

BF.ADD users 1001    → add user 1001 to the Bloom filter
BF.ADD users 1002
BF.EXISTS users 1001 → 1 (definitely or probably exists)
BF.EXISTS users 9999 → 0 (definitely does not exist)

typescript

async function getUserWithBloom(userId: string) {
  // Check Bloom filter first (very fast, no DB hit on definite miss)
  const mightExist = await redis.call('BF.EXISTS', 'users:bloom', userId);
  if (!mightExist) return null;  // definitely not in DB

  // Cache-aside logic
  const key = `user:${userId}`;
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
  if (!user) return null;  // false positive from Bloom filter — rare

  await redis.set(key, JSON.stringify(user), 'EX', 3600);
  return user;
}

Bloom filter management: Pre-populate on startup from the database. Add to the filter whenever a new user is created. The filter never shrinks (Bloom filters are not deletable) — rebuild periodically if many users are deleted.

text

# Create Bloom filter with 1% false positive rate, expected 10M items
BF.RESERVE users:bloom 0.01 10000000

Without Redis Stack, implement a Bloom filter using a Redis Bitmap (manually hash the key N times, set the corresponding bits, check all bits to test membership).

Summary

Cache Stampede:

Cause: one popular key expires while under high traffic
Detection: narrow spike on one query, short duration
Fix: XFetch probabilistic early expiry, or mutex lock with single-flight pattern

Cache Avalanche:

Cause: many keys expire simultaneously (same TTL set at same time)
Detection: broad sustained spike across all endpoints, typically post-deployment
Fix: TTL jitter (spread expiry times), staggered cache warming, never-expiry for critical keys

Cache Penetration:

Cause: requests for non-existent data (invalid IDs, deleted records, attacks)
Detection: high miss rate with empty DB results, sustained, not time-bounded
Fix: null caching with short TTL, Bloom filter pre-gating

Apply all three fixes proactively. Stampede and avalanche are near-certainties for any cache under serious load.

Next: P-8 — Keyspace Notifications and Event-Driven Architectures — using Redis's internal events (key expiry, deletion, write commands) as triggers for application logic.

Knowledge Check

A media streaming company launches a highly anticipated TV show episode. Exactly one hour after launch, their primary database CPU spikes to 100% for about 5 seconds, causing timeouts, before completely recovering. This exact pattern repeats exactly every hour. What is happening, and what is the appropriate mitigation?

An application is subject to a sustained, malicious scraping attack where the attacker requests user profiles using sequentially generated, fake user IDs (e.g., /users/9999001, /users/9999002). The application uses a standard cache-aside pattern. Monitoring shows Redis operating normally, but the PostgreSQL database is overloaded with queries returning zero rows. Which mitigation strategy provides the most resilient defense with the lowest memory overhead?

Following a major deployment, a microservice instances crash and restart simultaneously. Upon restart, they run a script that bulk-loads the top 100,000 product pages from the database into Redis, assigning every key a strict TTL of 3600 seconds (1 hour). Exactly one hour later, the database CPU hits 100% and remains saturated for several minutes, causing widespread timeouts across the entire platform. What caused this, and how should the bulk-load script be modified?

Test your knowledge with more question sets

PreviousModule P-6: BullMQ Internals: The Redis Data Structures Behind the Job Queue Next Module P-8: Keyspace Notifications and Event-Driven Architectures

Discussion

Join the discussion

Loading comments...