ReDoS mitigation, memory exhaustion payload attacks, event loop saturation runbooks, and cascade failure recovery for financial infrastructure.
Module 15 — Resiliency Runbooks & High-Load Security Defenses
What this module covers: A blockchain indexer processing 50,000 events/second is a valuable target. An attacker who can send one malicious payload that blocks the event loop for 30 seconds can effectively take the service offline. A ReDoS attack requires only an HTTP request. A memory exhaustion attack requires only a cleverly crafted JSON payload. This module covers the exact attack surfaces in high-throughput Node.js applications, the defensive patterns that prevent them, and the runbooks your team needs written down before the incident happens — because the worst time to write a runbook is during an outage.
ReDoS: Regular Expression Denial of Service
ReDoS exploits catastrophic backtracking in certain regular expression patterns. When these patterns are given carefully crafted input, the regex engine's backtracking algorithm takes exponential time.
The Vulnerable Pattern
javascript// This looks harmless: const emailValidator = /^([a-zA-Z0-9])(([a-zA-Z0-9])|(\.))+@[a-zA-Z0-9]+\.[a-zA-Z]{2,4}$/; // Input that causes catastrophic backtracking: const maliciousInput = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaa@'; // The regex engine tries all possible ways to match the repeating groups // before concluding there's no match — exponential time
Identifying Vulnerable Patterns
Vulnerable patterns share common characteristics:
- Nested quantifiers:
(a+)+,(a|aa)+ - Alternation with common prefixes:
(abc|abcd)+ - Overlapping quantifiers:
(\w+\s*)+
javascript// VULNERABLE patterns (never use in hot paths with untrusted input): /^(a+)+$/ // nested quantifier /^([a-z]+)*$/ // nested quantifier with alternation /^(.*)(foo)(.*)(bar)(.*)$/ // polynomial backtracking on large strings /(\w+\s)+\w+/ // matching whitespace-separated words // SAFE alternatives using possessive quantifiers / atomic groups: // Node.js 20+ supports atomic groups via (?:pattern) /^(?>(a+))+$/ // atomic group — no backtracking into the group
Safe Validation Patterns
javascript// For payment systems: validate wallet addresses with bounded patterns const EVM_ADDRESS = /^0x[0-9a-fA-F]{40}$/; // exact length, no backtracking const TX_HASH = /^0x[0-9a-fA-F]{64}$/; // exact length, no backtracking const AMOUNT = /^\d{1,20}$/; // bounded digits, linear // For UPI payment IDs: const UPI_ID = /^[\w.\-]{1,64}@[\w.\-]{1,64}$/; // bounded, no nested quantifiers // Test all validators against adversarial input before deploying import { safe as safeRegex } from 'safe-regex'; console.log(safeRegex(/^(a+)+$/)); // false — unsafe! console.log(safeRegex(EVM_ADDRESS)); // true — safe
Runtime Protection: statement_timeout for Regex
For regexes you cannot replace, enforce time limits:
javascript// Wrap regex execution in a timeout function safeMatch(pattern, input, timeoutMs = 10) { return new Promise((resolve, reject) => { const timer = setTimeout(() => { reject(new Error(`Regex timeout after ${timeoutMs}ms`)); }, timeoutMs); try { const result = pattern.test(input); clearTimeout(timer); resolve(result); } catch (err) { clearTimeout(timer); reject(err); } }); } // Usage: if validation takes > 10ms, it's a ReDoS attack try { const isValid = await safeMatch(complexEmailRegex, userInput, 10); } catch (err) { logger.warn({ input: userInput.slice(0, 100) }, 'Potential ReDoS detected'); return res.status(400).json({ error: 'Invalid input' }); }
JSON Payload Bombs: Memory Exhaustion via Deserialization
A crafted JSON payload can expand exponentially in memory after parsing.
javascript// Deeply nested object: 10KB JSON → 50MB in memory // Each nesting level multiplies reference overhead const bomb = '{"a":{"a":{"a":{"a":{"a":{"a":...{"a":true}...}}}}}}}'; // At 500 levels deep: V8 object overhead × 500 = significant memory // Repeated keys object: forces V8 to store all duplicates const repeated = '{"key": "val", "key": "val", "key": "val", ...}'; // 100K repetitions // JSON.parse keeps only last value, but parses ALL entries // Large arrays of objects: straightforward amplification const explosion = JSON.stringify({ data: Array(1_000_000).fill({ id: 1, value: 'x' }) }); // 10KB JSON → 800MB JavaScript array in memory
Defense: Request Size Limits and Depth Limits
javascript// Fastify: set request body size limit before accepting any data const fastify = Fastify({ bodyLimit: 1 * 1024 * 1024, // 1MB max body size }); // For high-security endpoints: even tighter limits fastify.post('/api/v2/payments', { config: { bodyLimit: 10 * 1024 }, // 10KB max for payment endpoint }, paymentHandler); // Validate JSON depth before processing function validateDepth(obj, maxDepth = 10, currentDepth = 0) { if (currentDepth > maxDepth) throw new Error('JSON depth limit exceeded'); if (obj !== null && typeof obj === 'object') { for (const value of Object.values(obj)) { validateDepth(value, maxDepth, currentDepth + 1); } } } fastify.addHook('preHandler', async (request) => { if (request.body && typeof request.body === 'object') { validateDepth(request.body, 10); } });
Schema-First Validation: Reject Before Parsing
The most effective defense: use ajv's compiled schema to validate structure before your application code runs.
javascript// Fastify schema validation runs BEFORE your handler // Invalid payloads are rejected by the compiled ajv function fastify.post('/api/v2/payments', { schema: { body: { type: 'object', maxProperties: 10, // max 10 keys at root additionalProperties: false, // no unknown keys required: ['amount', 'senderId', 'recipientId'], properties: { amount: { type: 'integer', minimum: 1, maximum: 1_000_000_000 }, senderId: { type: 'string', maxLength: 64 }, recipientId: { type: 'string', maxLength: 64 }, } } } }, paymentHandler); // Payloads with extra keys, wrong types, or out-of-range values // are rejected with 400 before the handler runs // No JSON bomb can make it past the schema validator
Event Loop Blocking Attacks
Any synchronous operation on the event loop is an attack surface: if an attacker can cause your code to execute a long synchronous operation, the entire service is blocked.
JSON.parse on Large Payloads
Even with size limits, a 1MB JSON payload takes ~8ms to parse synchronously. At 50K req/sec, if even 1% of requests are 1MB payloads: 500 req/sec × 8ms = 4,000ms/sec of blocking — event loop ELU 400% (impossible — starvation).
javascript// Protection: move large payload parsing to worker threads const LARGE_PAYLOAD_THRESHOLD = 50 * 1024; // 50KB fastify.addContentTypeParser('application/json', { parseAs: 'buffer' }, async (req, body) => { if (body.length > LARGE_PAYLOAD_THRESHOLD) { // Offload to worker thread — main event loop unaffected return await parseJsonInWorker(body); } return JSON.parse(body.toString('utf8')); });
Synchronous Crypto in Hot Paths
javascript// DANGEROUS: crypto.*Sync operations block the event loop const hash = crypto.createHash('sha256').update(data).digest('hex'); // synchronous, fast (< 1ms) const key = crypto.scryptSync(password, salt, 64); // synchronous, SLOW (100ms+) // SAFE: use async crypto APIs const key = await new Promise((resolve, reject) => { crypto.scrypt(password, salt, 64, (err, key) => { if (err) reject(err); else resolve(key); }); }); // This runs in the libuv thread pool, not on the event loop
Circuit Breakers: Preventing Cascade Failures
When an upstream service (database, external API, blockchain RPC node) becomes slow or unavailable, requests back up — each waiting for a timeout. This cascades: slow upstream → slow application → slow everything else → OOM.
A circuit breaker short-circuits failing calls immediately after a failure threshold, giving the upstream time to recover.
javascriptclass CircuitBreaker { #state = 'CLOSED'; // CLOSED = normal, OPEN = failing fast, HALF_OPEN = testing #failures = 0; #successes = 0; #lastFailureTime = 0; #FAILURE_THRESHOLD = 5; #SUCCESS_THRESHOLD = 2; #RECOVERY_TIMEOUT = 30_000; // 30s before trying again async execute(operation) { if (this.#state === 'OPEN') { // Check if recovery timeout has elapsed if (Date.now() - this.#lastFailureTime > this.#RECOVERY_TIMEOUT) { this.#state = 'HALF_OPEN'; this.#successes = 0; } else { throw new CircuitOpenError('Circuit is OPEN — failing fast'); } } try { const result = await operation(); this.#onSuccess(); return result; } catch (err) { this.#onFailure(); throw err; } } #onSuccess() { this.#failures = 0; if (this.#state === 'HALF_OPEN') { this.#successes++; if (this.#successes >= this.#SUCCESS_THRESHOLD) { this.#state = 'CLOSED'; logger.info('Circuit CLOSED — service recovered'); } } } #onFailure() { this.#failures++; this.#lastFailureTime = Date.now(); if (this.#failures >= this.#FAILURE_THRESHOLD) { this.#state = 'OPEN'; logger.error({ failures: this.#failures }, 'Circuit OPENED — failing fast'); } } get state() { return this.#state; } } // Wrap upstream calls with circuit breaker const blockchainRpcBreaker = new CircuitBreaker(); const databaseBreaker = new CircuitBreaker(); async function getBlockFromRpc(height) { return blockchainRpcBreaker.execute(async () => { return await rpcClient.getBlockByHeight(height); }); }
Bulkhead Pattern: Isolating Failure Domains
If your service makes calls to multiple upstream services, a slow upstream should not exhaust the connection pool for all upstreams.
javascript// Separate connection pools per upstream (bulkhead isolation) const pools = { postgresql: new pg.Pool({ max: 30 }), // 30 connections for DB analyticsDb: new pg.Pool({ max: 10 }), // 10 separate connections for analytics externalRpc: new Agent({ maxSockets: 20 }), // 20 connections to blockchain RPC }; // If analyticsDb is slow, it can exhaust its 10 connections // but never affects the 30 PostgreSQL connections // The main write path is isolated from analytics slowness
The Five Runbooks
Every team running Node.js in production needs these five runbooks written before they need them.
Runbook 1: Event Loop Saturation
Symptoms: ELU > 0.90, high latency, db_pool_waiting_count > 0 but CPU looks idle.
bash# Step 1: Confirm ELU via metrics # ELU metric: nodejs_event_loop_utilization > 0.90 # Step 2: Generate CPU profile to find the blocking function clinic flame -- node indexer.js & SERVER_PID=$! autocannon -c 50 -d 20 http://localhost:3000/health kill $SERVER_PID # Opens flamegraph: look for wide plateau in non-I/O functions # Step 3: If production emergency, reduce load # Scale up replicas immediately (buy time) kubectl scale deployment indexer --replicas=16 # Step 4: Identify and fix # Common causes: JSON.parse in hot path, sync crypto, RegEx on large input # Fix: move to worker thread or replace with async equivalent
Runbook 2: Memory Leak
Symptoms: Heap memory growing continuously over hours, GC running but memory not dropping.
bash# Step 1: Confirm via heap metric # nodejs_heap_used_bytes growing without leveling off # Step 2: Take two heap snapshots kill -SIGUSR2 $PID # snapshot 1 (configured in app startup) sleep 300 kill -SIGUSR2 $PID # snapshot 2 # Step 3: Load snapshots in Chrome DevTools # Memory tab → Load heap snapshot → Switch to Comparison view # Sort by Delta (objects that increased between snapshots) # Common leak sources: EventEmitter listeners, Map/Set without eviction, closures # Step 4: Emergency mitigation while fix is deployed # PM2: set max_memory_restart to trigger automatic restart before OOM # pm2 set max_memory_restart 2G
Runbook 3: Database Connection Pool Exhaustion
Symptoms: db_pool_waiting_count > 0, P99 latency spike to connectionTimeoutMillis.
bash# Step 1: Confirm via metrics # db_pool_waiting_count > 0 # Step 2: Check what's holding connections # In PostgreSQL: psql -c "SELECT pid, state, query_start, left(query,100) FROM pg_stat_activity WHERE state IN ('active','idle in transaction') ORDER BY query_start;" # Step 3: Kill long-running queries holding connections psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND query_start < NOW() - INTERVAL '30s';" # Step 4: Temporary relief # Increase pool size if database can handle more connections # Update and reload (zero-downtime with PM2 reload): # pool.options.max = 60; (requires restart to take effect) # Step 5: Find root cause # Was there a spike in traffic? → Size pool for peak # Did a query get slow? → Check EXPLAIN ANALYZE for plan regression
Runbook 4: Kafka Consumer Lag Spike
Symptoms: kafka_consumer_lag_total > 100_000, analytics/notifications delayed.
bash# Step 1: Confirm and measure kafka-consumer-groups.sh --bootstrap-server kafka:9092 \ --group transaction-processor --describe # Shows lag per partition # Step 2: Identify the slow partition (hotspot?) # If one partition has 80% of total lag: partition key imbalance # If all partitions have equal lag: consumer is generally too slow # Step 3: Scale consumers kubectl scale deployment analytics-service --replicas=12 # Kafka will rebalance partitions to new consumers # Step 4: Check consumer processing time # If avg processing > 10ms per message at 50K msg/sec: can't keep up # Fix: batch processing, optimize DB writes (use bulk INSERT) # Step 5: If lag is from historical event replay (deployment with fromBeginning): # Reset consumer offset to current kafka-consumer-groups.sh --reset-offsets --to-latest \ --group transaction-processor --topic transactions --execute
Runbook 5: Service Cascade Failure
Symptoms: One upstream service goes down, your service starts failing, downstream services start failing.
bash# Step 1: Identify the failing upstream # Check circuit breaker state metrics: circuit_state{service="blockchain-rpc"} = OPEN # Step 2: Verify circuit breaker is isolating the failure # If circuit is OPEN: failing fast → good, cascade is contained # If circuit is CLOSED but service is slow: trigger manual circuit open # In application code: blockchainRpcBreaker.forceOpen(); # expose this via admin endpoint # Step 3: Enable degraded mode # Return cached/stale data while upstream recovers # Defer non-critical operations (analytics, notifications) # Prioritize critical path (transaction validation, DB writes) # Step 4: Monitor upstream recovery # Watch circuit breaker half-open probes # circuit_state will transition: OPEN → HALF_OPEN → CLOSED when upstream recovers # Step 5: Gradually restore traffic # Circuit breaker handles this automatically via HALF_OPEN state # Verify: watch success rate of operations through the recovered circuit
Supply Chain Security: Protecting the Module Graph
Node.js applications import hundreds of transitive dependencies. Any one of them could be compromised.
bash# Audit known vulnerabilities npm audit yarn npm audit # Check for suspicious packages (typosquatting, malicious injections) npx @socket.dev/cli check # socket.dev analyzes package behavior # Lock exact versions in production # Never use ^ or ~ in production package.json { "dependencies": { "fastify": "4.28.1", # exact, not ^4.28.1 "pg": "8.12.0" # exact } }
javascript// Restrict what packages can do at runtime using Node.js Permission Model (v20+) // node --allow-fs-read=/app/config --allow-net=your-db-host:5432 indexer.js // Any attempt to read other files or connect to other hosts → throws PermissionError // This protects against compromised dependencies that try to: // - Exfiltrate environment variables // - Write files to disk (ransomware) // - Connect to external servers (data exfiltration)
Summary
| Concept | Key Takeaway |
|---|---|
| ReDoS | Catastrophic backtracking in nested quantifier regexes. Use safe-regex to audit. Enforce 10ms timeout on untrusted input. |
| JSON bombs | Deep nesting or large arrays amplify memory. Set bodyLimit, validate depth, use schema-first ajv validation. |
| Sync crypto | crypto.scryptSync and similar block the event loop. Always use async variants for expensive operations. |
| Circuit breaker | CLOSED → OPEN after N failures. Fail fast. HALF_OPEN after recovery timeout. Prevents cascade. |
| Bulkhead | Separate connection pools per upstream. Slow analytics DB cannot exhaust main DB pool. |
| Event loop runbook | ELU > 0.90 → flame graph → move blocking code to worker_threads or async. |
| Memory leak runbook | Two heap snapshots → comparison view → identify delta object type → trace to root. |
| Pool exhaustion runbook | waiting_count > 0 → kill idle-in-transaction → increase pool size → find root cause. |
| Kafka lag runbook | Scale consumers or reset offsets if from historical replay. |
| Cascade runbook | Verify circuit open → enable degraded mode → monitor HALF_OPEN recovery → restore. |
| Supply chain | Exact versions in production. npm audit. Node.js Permission Model to restrict runtime capabilities. |
The system is secure and resilient. The remaining modules cover the advanced Node.js features that eliminate entire categories of deployment, security, and performance problems: zero-trust runtime isolation, single executable deployment, native Rust integration, the Web Standards shift, and automated post-mortem diagnostics.
Next: Module 16 — Zero-Trust Runtime Architecture & The Node.js Permission Model →