PM2 graceful reload, pg-pool sizing mathematics, Redis Cluster client configuration, and Kubernetes liveness vs readiness probe implementation.
Module 13 — Advanced Connection Pooling & Process Management
What this module covers: Your Node.js service processes 50,000 events/second. Every event requires a database write. That means 50,000 PostgreSQL queries/second. A naive implementation opens a new connection for every query — connection establishment costs 5–10ms and PostgreSQL cannot handle 50,000 simultaneous connections. Connection pooling is the mechanism that multiplexes thousands of queries over dozens of connections. This module covers pg-pool sizing mathematics, Redis connection management, PM2 cluster lifecycle, graceful shutdown sequences, and the Kubernetes liveness vs readiness probe distinction that prevents traffic routing before your service is ready.
Why Connection Pooling Is Not Optional
Every database connection is expensive:
- PostgreSQL: spawns a new OS process per connection (~5MB RAM, ~10ms setup cost)
- Redis: opens a new TCP socket per connection (~0.5ms setup, persistent)
- MongoDB: opens a new socket, negotiates auth (~2ms)
Without pooling at 50,000 queries/second:
- 50,000 new PostgreSQL connections/second × 10ms = 500 seconds of connection overhead per second of operation — impossible
- At any instant, thousands of concurrent queries would require thousands of simultaneous PostgreSQL processes: 5,000 connections × 5MB = 25GB RAM just for PostgreSQL process overhead
Connection pooling solves this by reusing a small set of long-lived connections for many short-lived queries.
pg-pool: PostgreSQL Connection Pool
pg-pool (used internally by the pg package) maintains a pool of PostgreSQL connections.
The Mathematics of Pool Sizing
pool_size = (max_queries_per_second × avg_query_duration_seconds)
At 5,000 queries/second with 10ms average query duration:
pool_size = 5,000 × 0.010 = 50 connections
This is Little's Law applied to database connections. If you need more throughput or have longer queries, you need more connections — up to the database's limit.
javascriptimport pg from 'pg'; const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL, // Pool size: sized to your throughput target // = (target_queries_per_sec × avg_query_duration_sec) max: 50, // Connection idle timeout: close idle connections after 30s // Prevents accumulating connections that just hold resources idleTimeoutMillis: 30_000, // Connection acquisition timeout: fail fast if pool is exhausted // Without this, queries wait forever when pool is full connectionTimeoutMillis: 5_000, // Minimum connections to keep warm // Prevents cold-start latency for idle services min: 5, // Disconnect and reconnect after N transactions // Prevents long-lived connections accumulating server-side state maxUses: 7_500, // Connection health check on acquisition allowExitOnIdle: false, });
Monitoring Pool State
javascript// Monitor pool utilization in production const poolUtilizationGauge = new Gauge({ name: 'db_pool_utilization_ratio', help: 'DB pool utilization (active / max)', }); const poolWaitingGauge = new Gauge({ name: 'db_pool_waiting_count', help: 'Queries waiting for a connection', }); setInterval(() => { // pg-pool exposes pool state via these properties const total = pool.totalCount; // total connections (idle + active) const idle = pool.idleCount; // available connections const waiting = pool.waitingCount; // queries queued for a connection poolUtilizationGauge.set((total - idle) / pool.options.max); poolWaitingGauge.set(waiting); if (waiting > 0) { logger.warn({ waiting, total, idle }, 'DB pool exhausted — queries queuing'); } }, 5_000);
Alert thresholds:
db_pool_utilization_ratio > 0.8: pool approaching capacitydb_pool_waiting_count > 0: pool exhausted — immediate action neededdb_pool_waiting_count > 50: serious saturation — likely causing timeouts
Pool Exhaustion: The Silent Performance Killer
When all pool connections are busy, pool.connect() waits. At high throughput, this creates a latency cascade: query takes 15ms → pool slot held for 15ms → at 50 queries/pool × 15ms = 750ms of total query time per second → pool fully booked at 3,333 queries/second.
If your target is 5,000 queries/second:
required_pool_size = ceil(5,000 × 0.015) = 75 connections
Insufficient pool size manifests as:
- P99 latency spikes (queries waiting for a connection)
connectionTimeoutMilliserrors (pool exhausted for too long)- Not as high CPU or network errors — the service appears healthy from the outside while queries queue silently
Connection Lifecycle: maxUses and Health Checks
Long-lived database connections accumulate state: temporary tables (from past SET commands), prepared statements, and sometimes server-side session state.
maxUses: 7500 causes pg-pool to destroy and recreate a connection after 7,500 uses. This prevents state accumulation without the overhead of creating a new connection per query.
javascript// Verify pool connections are healthy before use const pool = new pg.Pool({ ...config, // Called on each connection before it is returned from the pool // If this throws, the connection is discarded and a new one created }); // Manual health check on startup (wait for pool to be ready before serving traffic) async function waitForDatabase(retries = 10) { for (let i = 0; i < retries; i++) { try { await pool.query('SELECT 1'); logger.info('Database connection pool ready'); return; } catch (err) { logger.warn({ attempt: i + 1, err: err.message }, 'Database not ready, retrying...'); await new Promise(r => setTimeout(r, 2_000)); } } throw new Error('Database connection failed after retries'); }
Redis: Connection Management with ioredis
Redis connections are cheaper than PostgreSQL but still need pooling for high-throughput usage.
javascriptimport Redis from 'ioredis'; // Single Redis connection (ioredis multiplexes commands over one connection) // ioredis handles pipelining automatically — multiple commands batched per RTT const redis = new Redis({ host: process.env.REDIS_HOST, port: 6379, password: process.env.REDIS_PASSWORD, // Auto-reconnect configuration retryStrategy: (times) => { if (times > 10) return null; // give up after 10 retries return Math.min(times * 100, 3_000); // exponential backoff, max 3s }, // Timeout for commands (prevents hanging on slow Redis) commandTimeout: 5_000, // Keep-alive to prevent idle connection drops keepAlive: 30_000, // Lazy connect: don't connect until first command lazyConnect: true, }); // Cluster mode: ioredis handles slot routing automatically const redisCluster = new Redis.Cluster([ { host: 'redis-node-1', port: 6379 }, { host: 'redis-node-2', port: 6379 }, { host: 'redis-node-3', port: 6379 }, ], { scaleReads: 'slave', // route read commands to replicas maxRedirections: 16, // follow up to 16 cluster redirections retryDelayOnClusterDown: 300, });
Pipelining: Batching Redis Commands
javascript// BAD: N network round trips for N commands for (const tx of transactions) { await redis.set(`tx:${tx.hash}`, JSON.stringify(tx), 'EX', 3600); } // N × RTT = high latency for large batches // GOOD: pipeline — all commands sent in one batch const pipeline = redis.pipeline(); for (const tx of transactions) { pipeline.set(`tx:${tx.hash}`, JSON.stringify(tx), 'EX', 3600); } await pipeline.exec(); // 1 RTT regardless of batch size
PM2: Process Management in Production
PM2 manages Node.js processes: starts them, restarts them on crash, monitors memory usage, and provides zero-downtime reload.
Ecosystem Configuration
javascript// ecosystem.config.js module.exports = { apps: [{ name: 'blockchain-indexer', script: 'dist/app.js', // Cluster mode: spawn N workers instances: 'max', // use all CPU cores exec_mode: 'cluster', // vs 'fork' for single instance // Memory limit: restart if worker exceeds this max_memory_restart: '2G', // Environment env_production: { NODE_ENV: 'production', UV_THREADPOOL_SIZE: '16', }, // Graceful shutdown kill_timeout: 5_000, // wait 5s for graceful shutdown before SIGKILL listen_timeout: 10_000, // wait 10s for app to start listening // Log management log_file: '/var/log/indexer/combined.log', error_file: '/var/log/indexer/error.log', log_date_format: 'YYYY-MM-DD HH:mm:ss', // Restart policy autorestart: true, watch: false, // don't watch files in production max_restarts: 10, // stop restarting after 10 crashes in a row min_uptime: '10s', // must be up for 10s to count as successful start }] };
Zero-Downtime Reload vs Restart
bash# restart: SIGKILL all workers immediately, start fresh # Causes downtime pm2 restart blockchain-indexer # reload: rolling restart — new worker starts, accepts traffic, # old worker receives SIGTERM and drains in-flight requests # Zero downtime pm2 reload blockchain-indexer
How reload works:
- PM2 starts a new worker instance
- New worker starts, begins accepting connections (via
SO_REUSEPORT) - PM2 sends
SIGTERMto old worker - Old worker: stops accepting new connections, finishes in-flight requests
- Old worker exits cleanly
- Repeat for each worker in the cluster
This is why implementing graceful shutdown correctly is essential for pm2 reload to work.
Graceful Shutdown: The Complete Sequence
A Node.js service has multiple resources that must be closed in the correct order on shutdown.
javascript// The complete graceful shutdown sequence async function gracefulShutdown(signal) { logger.info({ signal }, 'Shutdown signal received'); // Step 1: Stop accepting new connections // This prevents new requests from starting while we drain await new Promise((resolve, reject) => { httpServer.close((err) => err ? reject(err) : resolve()); }); logger.info('HTTP server stopped accepting connections'); // Step 2: Drain in-flight Kafka consumers // Wait for currently-processing messages to complete before closing await kafkaConsumer.disconnect(); logger.info('Kafka consumer disconnected'); // Step 3: Drain active database queries // Wait for currently-running queries to complete // pg-pool.end() waits for all connections to become idle, then closes them await pool.end(); logger.info('Database pool closed'); // Step 4: Close Redis connection await redis.quit(); logger.info('Redis connection closed'); // Step 5: Flush pending logs and metrics await logger.flush(); // Step 6: Exit logger.info('Graceful shutdown complete'); process.exit(0); } // Handle shutdown signals process.on('SIGTERM', () => gracefulShutdown('SIGTERM')); process.on('SIGINT', () => gracefulShutdown('SIGINT')); // Handle unhandled rejections — log and exit process.on('unhandledRejection', (reason, promise) => { logger.error({ reason, promise }, 'Unhandled Promise rejection'); gracefulShutdown('unhandledRejection').catch(() => process.exit(1)); }); // Hard timeout: if graceful shutdown takes too long, force exit const SHUTDOWN_TIMEOUT = 15_000; setTimeout(() => { logger.error('Graceful shutdown timed out — forcing exit'); process.exit(1); }, SHUTDOWN_TIMEOUT).unref(); // .unref() prevents this timer from keeping process alive
Kubernetes: Liveness vs Readiness Probes
Kubernetes uses two types of health probes with critically different semantics.
Liveness probe: "Is this pod still alive and should it be restarted?"
- If liveness fails: Kubernetes kills and restarts the pod
- Use for: detecting deadlocks, infinite loops, hung processes
- Be conservative: don't check dependencies (database) in liveness
Readiness probe: "Is this pod ready to receive traffic?"
- If readiness fails: Kubernetes removes the pod from load balancer rotation (but does NOT restart it)
- Use for: checking database connectivity, cache warmup, dependency availability
- A pod can be live but not ready (database is down → no traffic, but no restart needed)
javascript// Health check endpoints fastify.get('/healthz/live', async (req, reply) => { // Liveness: only check if the process is alive and not deadlocked // This should always return 200 unless the process itself is broken reply.code(200).send({ status: 'alive', pid: process.pid }); }); fastify.get('/healthz/ready', async (req, reply) => { // Readiness: check all dependencies const checks = await Promise.allSettled([ pool.query('SELECT 1'), // database redis.ping(), // cache kafkaProducer.isConnected() ? Promise.resolve() : Promise.reject(new Error('Kafka disconnected')), ]); const failed = checks .filter(c => c.status === 'rejected') .map(c => (c as PromiseRejectedResult).reason.message); if (failed.length > 0) { reply.code(503).send({ status: 'not_ready', failures: failed }); return; } reply.code(200).send({ status: 'ready' }); });
yaml# kubernetes/deployment.yaml spec: containers: - name: blockchain-indexer livenessProbe: httpGet: path: /healthz/live port: 3000 initialDelaySeconds: 10 # wait 10s before first check periodSeconds: 10 # check every 10s failureThreshold: 3 # restart after 3 consecutive failures readinessProbe: httpGet: path: /healthz/ready port: 3000 initialDelaySeconds: 5 # start checking readiness after 5s periodSeconds: 5 # check every 5s failureThreshold: 2 # remove from rotation after 2 failures successThreshold: 1 # add back after 1 success
The startup scenario:
t=0s: Pod starts, Node.js begins loading modules
t=5s: First readiness check: database pool connecting → 503 → not in rotation
t=10s: Database pool ready, Redis connected → 200 → added to rotation
t=10s: First liveness check: process alive → 200 → no restart
If database goes down at t=300s:
t=305s: Readiness fails → removed from rotation (no new traffic)
t=305s: Liveness still passes (process is alive, just can't reach DB)
t=310s: Liveness still passes → pod NOT restarted
t=350s: Database recovers → readiness passes → back in rotation
(Zero pod restarts for a transient database outage)
Production Incident: Pool Exhaustion Masking as Latency Spike
Context: A UPI payment service with max: 20 connections in the pool, targeting 3,000 queries/second at 8ms average query time.
The math check no one did:
required_pool = 3,000 × 0.008 = 24 connections needed
configured_pool = 20 connections → undersized by 17%
What happened: During normal operation, the service handled load because average query time was actually 6ms (not 8ms as designed for). Required pool: 3,000 × 0.006 = 18 connections. The pool of 20 had a 10% safety margin.
During a marketing push, traffic doubled to 6,000 queries/second. Required pool: 6,000 × 0.006 = 36 connections. Available: 20.
Pool exhaustion onset: when waiting_count > 0
At 6,000 qps with 20 connections at 6ms each:
Pool throughput: 20 / 0.006 = 3,333 queries/second max
Deficit: 6,000 - 3,333 = 2,667 queries/second queuing
After 1 second: 2,667 queries waiting
After 5 seconds: 13,335 queries waiting
connectionTimeoutMillis = 5,000ms → all queued queries start failing with timeout
Symptom: p99 latency jumped from 12ms to 5,200ms (the connection timeout). CPU was at 30%. Database was at 25% capacity. The bottleneck was invisible without monitoring db_pool_waiting_count.
The fix:
javascript// Correctly sized pool for the traffic target const pool = new pg.Pool({ max: 50, // sized for 6,000 qps at 8ms = 48 connections + buffer connectionTimeoutMillis: 2_000, // fail fast (2s not 5s) idleTimeoutMillis: 30_000, }); // Added alerting if (pool.waitingCount > 0) { alert(`DB pool exhausted: ${pool.waitingCount} queries waiting`); }
Additionally, connectionTimeoutMillis was reduced from 5,000ms to 2,000ms. The long timeout was hiding the exhaustion — queries waited 5 seconds before failing, making the P99 look catastrophic while masking the root cause (pool size).
Summary
| Concept | Key Takeaway |
|---|---|
| Pool sizing | pool_size = target_qps × avg_query_duration_sec. Recheck when traffic grows. |
db_pool_waiting_count | First metric to add. Any waiting = pool undersized for current load. |
connectionTimeoutMillis | Set to 2–3s. Long timeouts hide exhaustion and make P99 look catastrophic. |
maxUses | Recycle connections after N uses to prevent server-side state accumulation. |
| ioredis pipelining | Batch Redis commands in one RTT. Use pipeline() for multi-key operations. |
PM2 pm2 reload | Rolling restart with zero downtime. Requires correct graceful shutdown implementation. |
| Graceful shutdown | Stop accepting → drain Kafka → drain DB pool → close Redis → exit. Hard timeout as fallback. |
| Liveness probe | Is the process alive? Check: process responsiveness only. Never check dependencies. |
| Readiness probe | Is the service ready for traffic? Check: database, Redis, Kafka. Fail gracefully during startup. |
pool.end() | Waits for all active queries to complete before closing connections. Safe for shutdown. |
The connection layer determines capacity. Module 14 covers the other end of the scale — ultra-lightweight V8 isolates for edge deployments that need zero cold starts and globally distributed intake proxies.