Module A-14·25 min read

PM2 graceful reload, pg-pool sizing mathematics, Redis Cluster client configuration, and Kubernetes liveness vs readiness probe implementation.

Module 13 — Advanced Connection Pooling & Process Management

Q: When configuring a Kubernetes deployment for a Node.js service, what is the critical difference between a liveness probe and a readiness probe, and what should be checked in each?

Liveness checks if the process is alive (and restarts it if not) and should NEVER check dependencies. Readiness checks if the pod can accept traffic (including checking dependencies) and removes it from the load balancer if they fail. — Liveness probes determine if the pod is fundamentally broken and needs a hard restart (e.g., deadlocked). They should only check if the process itself is responsive; checking external dependencies like a database can cause a cascading failure where a transient DB issue restarts all your healthy application pods. Readiness probes determine if the pod can actively serve traffic, so they *must* check external dependencies. If readiness fails, Kubernetes simply stops sending traffic to that specific pod until the dependencies recover, without restarting the pod.

Q: A Node.js service has an average database query duration of 15ms. During a marketing event, traffic is expected to spike to 4,000 database queries per second. According to Little's Law applied to pool sizing, what is the mathematical minimum PostgreSQL connection pool size required to handle this throughput without queuing?

60 connections — The required pool size is calculated by multiplying the target queries per second by the average query duration in seconds. For 4,000 queries per second with a 15ms (0.015s) duration, the calculation is 4,000 × 0.015 = 60 connections. Any pool size smaller than this will result in queries queuing up and waiting for an available connection, drastically spiking P99 latency.

Q: Why is it crucial to set a relatively short `connectionTimeoutMillis` (e.g., 2,000ms) on the PostgreSQL connection pool rather than a long default (e.g., 30 seconds)?

Long timeouts hide pool exhaustion by letting queries queue silently, making P99 latency look catastrophic while masking the root cause of an undersized pool. — When the pool is undersized for the current traffic load, queries enter a waiting queue. If `connectionTimeoutMillis` is very long, these queries sit in memory for seconds before finally failing, making the service's P99 latency artificially massive (e.g., 5 seconds). A shorter timeout fails fast, accurately surfacing the pool exhaustion error immediately rather than masking it as an unexplainable latency spike.

What this module covers: Your Node.js service processes 50,000 events/second. Every event requires a database write. That means 50,000 PostgreSQL queries/second. A naive implementation opens a new connection for every query — connection establishment costs 5–10ms and PostgreSQL cannot handle 50,000 simultaneous connections. Connection pooling is the mechanism that multiplexes thousands of queries over dozens of connections. This module covers pg-pool sizing mathematics, Redis connection management, PM2 cluster lifecycle, graceful shutdown sequences, and the Kubernetes liveness vs readiness probe distinction that prevents traffic routing before your service is ready.

Why Connection Pooling Is Not Optional

Every database connection is expensive:

PostgreSQL: spawns a new OS process per connection (~5MB RAM, ~10ms setup cost)
Redis: opens a new TCP socket per connection (~0.5ms setup, persistent)
MongoDB: opens a new socket, negotiates auth (~2ms)

Without pooling at 50,000 queries/second:

50,000 new PostgreSQL connections/second × 10ms = 500 seconds of connection overhead per second of operation — impossible
At any instant, thousands of concurrent queries would require thousands of simultaneous PostgreSQL processes: 5,000 connections × 5MB = 25GB RAM just for PostgreSQL process overhead

Connection pooling solves this by reusing a small set of long-lived connections for many short-lived queries.

`pg-pool`: PostgreSQL Connection Pool

pg-pool (used internally by the pg package) maintains a pool of PostgreSQL connections.

The Mathematics of Pool Sizing

text

pool_size = (max_queries_per_second × avg_query_duration_seconds)

At 5,000 queries/second with 10ms average query duration:
pool_size = 5,000 × 0.010 = 50 connections

This is Little's Law applied to database connections. If you need more throughput or have longer queries, you need more connections — up to the database's limit.

javascript

import pg from 'pg';

const pool = new pg.Pool({
  connectionString: process.env.DATABASE_URL,

  // Pool size: sized to your throughput target
  // = (target_queries_per_sec × avg_query_duration_sec)
  max: 50,

  // Connection idle timeout: close idle connections after 30s
  // Prevents accumulating connections that just hold resources
  idleTimeoutMillis: 30_000,

  // Connection acquisition timeout: fail fast if pool is exhausted
  // Without this, queries wait forever when pool is full
  connectionTimeoutMillis: 5_000,

  // Minimum connections to keep warm
  // Prevents cold-start latency for idle services
  min: 5,

  // Disconnect and reconnect after N transactions
  // Prevents long-lived connections accumulating server-side state
  maxUses: 7_500,

  // Connection health check on acquisition
  allowExitOnIdle: false,
});

Monitoring Pool State

javascript

// Monitor pool utilization in production
const poolUtilizationGauge = new Gauge({
  name: 'db_pool_utilization_ratio',
  help: 'DB pool utilization (active / max)',
});

const poolWaitingGauge = new Gauge({
  name: 'db_pool_waiting_count',
  help: 'Queries waiting for a connection',
});

setInterval(() => {
  // pg-pool exposes pool state via these properties
  const total = pool.totalCount;      // total connections (idle + active)
  const idle = pool.idleCount;        // available connections
  const waiting = pool.waitingCount;  // queries queued for a connection

  poolUtilizationGauge.set((total - idle) / pool.options.max);
  poolWaitingGauge.set(waiting);

  if (waiting > 0) {
    logger.warn({ waiting, total, idle }, 'DB pool exhausted — queries queuing');
  }
}, 5_000);

Alert thresholds:

db_pool_utilization_ratio > 0.8: pool approaching capacity
db_pool_waiting_count > 0: pool exhausted — immediate action needed
db_pool_waiting_count > 50: serious saturation — likely causing timeouts

Pool Exhaustion: The Silent Performance Killer

When all pool connections are busy, pool.connect() waits. At high throughput, this creates a latency cascade: query takes 15ms → pool slot held for 15ms → at 50 queries/pool × 15ms = 750ms of total query time per second → pool fully booked at 3,333 queries/second.

If your target is 5,000 queries/second:

required_pool_size = ceil(5,000 × 0.015) = 75 connections

Insufficient pool size manifests as:

P99 latency spikes (queries waiting for a connection)
connectionTimeoutMillis errors (pool exhausted for too long)
Not as high CPU or network errors — the service appears healthy from the outside while queries queue silently

Connection Lifecycle: `maxUses` and Health Checks

Long-lived database connections accumulate state: temporary tables (from past SET commands), prepared statements, and sometimes server-side session state.

maxUses: 7500 causes pg-pool to destroy and recreate a connection after 7,500 uses. This prevents state accumulation without the overhead of creating a new connection per query.

javascript

// Verify pool connections are healthy before use
const pool = new pg.Pool({
  ...config,
  // Called on each connection before it is returned from the pool
  // If this throws, the connection is discarded and a new one created
});

// Manual health check on startup (wait for pool to be ready before serving traffic)
async function waitForDatabase(retries = 10) {
  for (let i = 0; i < retries; i++) {
    try {
      await pool.query('SELECT 1');
      logger.info('Database connection pool ready');
      return;
    } catch (err) {
      logger.warn({ attempt: i + 1, err: err.message }, 'Database not ready, retrying...');
      await new Promise(r => setTimeout(r, 2_000));
    }
  }
  throw new Error('Database connection failed after retries');
}

Redis: Connection Management with `ioredis`

Redis connections are cheaper than PostgreSQL but still need pooling for high-throughput usage.

javascript

import Redis from 'ioredis';

// Single Redis connection (ioredis multiplexes commands over one connection)
// ioredis handles pipelining automatically — multiple commands batched per RTT
const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
  password: process.env.REDIS_PASSWORD,

  // Auto-reconnect configuration
  retryStrategy: (times) => {
    if (times > 10) return null;  // give up after 10 retries
    return Math.min(times * 100, 3_000);  // exponential backoff, max 3s
  },

  // Timeout for commands (prevents hanging on slow Redis)
  commandTimeout: 5_000,

  // Keep-alive to prevent idle connection drops
  keepAlive: 30_000,

  // Lazy connect: don't connect until first command
  lazyConnect: true,
});

// Cluster mode: ioredis handles slot routing automatically
const redisCluster = new Redis.Cluster([
  { host: 'redis-node-1', port: 6379 },
  { host: 'redis-node-2', port: 6379 },
  { host: 'redis-node-3', port: 6379 },
], {
  scaleReads: 'slave',      // route read commands to replicas
  maxRedirections: 16,       // follow up to 16 cluster redirections
  retryDelayOnClusterDown: 300,
});

Pipelining: Batching Redis Commands

javascript

// BAD: N network round trips for N commands
for (const tx of transactions) {
  await redis.set(`tx:${tx.hash}`, JSON.stringify(tx), 'EX', 3600);
}
// N × RTT = high latency for large batches

// GOOD: pipeline — all commands sent in one batch
const pipeline = redis.pipeline();
for (const tx of transactions) {
  pipeline.set(`tx:${tx.hash}`, JSON.stringify(tx), 'EX', 3600);
}
await pipeline.exec();
// 1 RTT regardless of batch size

PM2: Process Management in Production

PM2 manages Node.js processes: starts them, restarts them on crash, monitors memory usage, and provides zero-downtime reload.

Ecosystem Configuration

javascript

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'blockchain-indexer',
    script: 'dist/app.js',

    // Cluster mode: spawn N workers
    instances: 'max',          // use all CPU cores
    exec_mode: 'cluster',      // vs 'fork' for single instance

    // Memory limit: restart if worker exceeds this
    max_memory_restart: '2G',

    // Environment
    env_production: {
      NODE_ENV: 'production',
      UV_THREADPOOL_SIZE: '16',
    },

    // Graceful shutdown
    kill_timeout: 5_000,       // wait 5s for graceful shutdown before SIGKILL
    listen_timeout: 10_000,    // wait 10s for app to start listening

    // Log management
    log_file: '/var/log/indexer/combined.log',
    error_file: '/var/log/indexer/error.log',
    log_date_format: 'YYYY-MM-DD HH:mm:ss',

    // Restart policy
    autorestart: true,
    watch: false,               // don't watch files in production
    max_restarts: 10,           // stop restarting after 10 crashes in a row
    min_uptime: '10s',          // must be up for 10s to count as successful start
  }]
};

Zero-Downtime Reload vs Restart

bash

# restart: SIGKILL all workers immediately, start fresh
# Causes downtime
pm2 restart blockchain-indexer

# reload: rolling restart — new worker starts, accepts traffic,
# old worker receives SIGTERM and drains in-flight requests
# Zero downtime
pm2 reload blockchain-indexer

How reload works:

PM2 starts a new worker instance
New worker starts, begins accepting connections (via SO_REUSEPORT)
PM2 sends SIGTERM to old worker
Old worker: stops accepting new connections, finishes in-flight requests
Old worker exits cleanly
Repeat for each worker in the cluster

This is why implementing graceful shutdown correctly is essential for pm2 reload to work.

Graceful Shutdown: The Complete Sequence

A Node.js service has multiple resources that must be closed in the correct order on shutdown.

javascript

// The complete graceful shutdown sequence
async function gracefulShutdown(signal) {
  logger.info({ signal }, 'Shutdown signal received');

  // Step 1: Stop accepting new connections
  // This prevents new requests from starting while we drain
  await new Promise((resolve, reject) => {
    httpServer.close((err) => err ? reject(err) : resolve());
  });
  logger.info('HTTP server stopped accepting connections');

  // Step 2: Drain in-flight Kafka consumers
  // Wait for currently-processing messages to complete before closing
  await kafkaConsumer.disconnect();
  logger.info('Kafka consumer disconnected');

  // Step 3: Drain active database queries
  // Wait for currently-running queries to complete
  // pg-pool.end() waits for all connections to become idle, then closes them
  await pool.end();
  logger.info('Database pool closed');

  // Step 4: Close Redis connection
  await redis.quit();
  logger.info('Redis connection closed');

  // Step 5: Flush pending logs and metrics
  await logger.flush();

  // Step 6: Exit
  logger.info('Graceful shutdown complete');
  process.exit(0);
}

// Handle shutdown signals
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));

// Handle unhandled rejections — log and exit
process.on('unhandledRejection', (reason, promise) => {
  logger.error({ reason, promise }, 'Unhandled Promise rejection');
  gracefulShutdown('unhandledRejection').catch(() => process.exit(1));
});

// Hard timeout: if graceful shutdown takes too long, force exit
const SHUTDOWN_TIMEOUT = 15_000;
setTimeout(() => {
  logger.error('Graceful shutdown timed out — forcing exit');
  process.exit(1);
}, SHUTDOWN_TIMEOUT).unref();  // .unref() prevents this timer from keeping process alive

Kubernetes: Liveness vs Readiness Probes

Kubernetes uses two types of health probes with critically different semantics.

Liveness probe: "Is this pod still alive and should it be restarted?"

If liveness fails: Kubernetes kills and restarts the pod
Use for: detecting deadlocks, infinite loops, hung processes
Be conservative: don't check dependencies (database) in liveness

Readiness probe: "Is this pod ready to receive traffic?"

If readiness fails: Kubernetes removes the pod from load balancer rotation (but does NOT restart it)
Use for: checking database connectivity, cache warmup, dependency availability
A pod can be live but not ready (database is down → no traffic, but no restart needed)

javascript

// Health check endpoints
fastify.get('/healthz/live', async (req, reply) => {
  // Liveness: only check if the process is alive and not deadlocked
  // This should always return 200 unless the process itself is broken
  reply.code(200).send({ status: 'alive', pid: process.pid });
});

fastify.get('/healthz/ready', async (req, reply) => {
  // Readiness: check all dependencies
  const checks = await Promise.allSettled([
    pool.query('SELECT 1'),                    // database
    redis.ping(),                              // cache
    kafkaProducer.isConnected() ? Promise.resolve() : Promise.reject(new Error('Kafka disconnected')),
  ]);

  const failed = checks
    .filter(c => c.status === 'rejected')
    .map(c => (c as PromiseRejectedResult).reason.message);

  if (failed.length > 0) {
    reply.code(503).send({ status: 'not_ready', failures: failed });
    return;
  }

  reply.code(200).send({ status: 'ready' });
});

yaml

# kubernetes/deployment.yaml
spec:
  containers:
  - name: blockchain-indexer
    livenessProbe:
      httpGet:
        path: /healthz/live
        port: 3000
      initialDelaySeconds: 10    # wait 10s before first check
      periodSeconds: 10          # check every 10s
      failureThreshold: 3        # restart after 3 consecutive failures

    readinessProbe:
      httpGet:
        path: /healthz/ready
        port: 3000
      initialDelaySeconds: 5     # start checking readiness after 5s
      periodSeconds: 5           # check every 5s
      failureThreshold: 2        # remove from rotation after 2 failures
      successThreshold: 1        # add back after 1 success

The startup scenario:

text

t=0s:  Pod starts, Node.js begins loading modules
t=5s:  First readiness check: database pool connecting → 503 → not in rotation
t=10s: Database pool ready, Redis connected → 200 → added to rotation
t=10s: First liveness check: process alive → 200 → no restart

If database goes down at t=300s:
  t=305s: Readiness fails → removed from rotation (no new traffic)
  t=305s: Liveness still passes (process is alive, just can't reach DB)
  t=310s: Liveness still passes → pod NOT restarted
  t=350s: Database recovers → readiness passes → back in rotation
  (Zero pod restarts for a transient database outage)

Production Incident: Pool Exhaustion Masking as Latency Spike

Context: A UPI payment service with max: 20 connections in the pool, targeting 3,000 queries/second at 8ms average query time.

The math check no one did:

text

required_pool = 3,000 × 0.008 = 24 connections needed
configured_pool = 20 connections → undersized by 17%

What happened: During normal operation, the service handled load because average query time was actually 6ms (not 8ms as designed for). Required pool: 3,000 × 0.006 = 18 connections. The pool of 20 had a 10% safety margin.

During a marketing push, traffic doubled to 6,000 queries/second. Required pool: 6,000 × 0.006 = 36 connections. Available: 20.

text

Pool exhaustion onset: when waiting_count > 0
At 6,000 qps with 20 connections at 6ms each:
  Pool throughput: 20 / 0.006 = 3,333 queries/second max
  Deficit: 6,000 - 3,333 = 2,667 queries/second queuing

After 1 second: 2,667 queries waiting
After 5 seconds: 13,335 queries waiting
connectionTimeoutMillis = 5,000ms → all queued queries start failing with timeout

Symptom: p99 latency jumped from 12ms to 5,200ms (the connection timeout). CPU was at 30%. Database was at 25% capacity. The bottleneck was invisible without monitoring db_pool_waiting_count.

The fix:

javascript

// Correctly sized pool for the traffic target
const pool = new pg.Pool({
  max: 50,           // sized for 6,000 qps at 8ms = 48 connections + buffer
  connectionTimeoutMillis: 2_000,  // fail fast (2s not 5s)
  idleTimeoutMillis: 30_000,
});

// Added alerting
if (pool.waitingCount > 0) {
  alert(`DB pool exhausted: ${pool.waitingCount} queries waiting`);
}

Additionally, connectionTimeoutMillis was reduced from 5,000ms to 2,000ms. The long timeout was hiding the exhaustion — queries waited 5 seconds before failing, making the P99 look catastrophic while masking the root cause (pool size).

Summary

Concept	Key Takeaway
Pool sizing	`pool_size = target_qps × avg_query_duration_sec`. Recheck when traffic grows.
`db_pool_waiting_count`	First metric to add. Any waiting = pool undersized for current load.
`connectionTimeoutMillis`	Set to 2–3s. Long timeouts hide exhaustion and make P99 look catastrophic.
`maxUses`	Recycle connections after N uses to prevent server-side state accumulation.
ioredis pipelining	Batch Redis commands in one RTT. Use `pipeline()` for multi-key operations.
PM2 `pm2 reload`	Rolling restart with zero downtime. Requires correct graceful shutdown implementation.
Graceful shutdown	Stop accepting → drain Kafka → drain DB pool → close Redis → exit. Hard timeout as fallback.
Liveness probe	Is the process alive? Check: process responsiveness only. Never check dependencies.
Readiness probe	Is the service ready for traffic? Check: database, Redis, Kafka. Fail gracefully during startup.
`pool.end()`	Waits for all active queries to complete before closing connections. Safe for shutdown.

The connection layer determines capacity. Module 14 covers the other end of the scale — ultra-lightweight V8 isolates for edge deployments that need zero cold starts and globally distributed intake proxies.

Next: Module 14 — Edge Runtime Ingestion & V8 Isolates →

Knowledge Check

When configuring a Kubernetes deployment for a Node.js service, what is the critical difference between a liveness probe and a readiness probe, and what should be checked in each?

A Node.js service has an average database query duration of 15ms. During a marketing event, traffic is expected to spike to 4,000 database queries per second. According to Little's Law applied to pool sizing, what is the mathematical minimum PostgreSQL connection pool size required to handle this throughput without queuing?

Why is it crucial to set a relatively short connectionTimeoutMillis (e.g., 2,000ms) on the PostgreSQL connection pool rather than a long default (e.g., 30 seconds)?

Test your knowledge with more question sets

PreviousModule A-13: Production Observability, Performance Profiling & Flame Graphs Next Module A-15: Edge Runtime Ingestion & V8 Isolates

Discussion

Join the discussion

Loading comments...

Module 13 — Advanced Connection Pooling & Process Management

Why Connection Pooling Is Not Optional

pg-pool: PostgreSQL Connection Pool

The Mathematics of Pool Sizing

Monitoring Pool State

Pool Exhaustion: The Silent Performance Killer

Connection Lifecycle: maxUses and Health Checks

Redis: Connection Management with ioredis

Pipelining: Batching Redis Commands

PM2: Process Management in Production

Ecosystem Configuration

Zero-Downtime Reload vs Restart

Graceful Shutdown: The Complete Sequence

Kubernetes: Liveness vs Readiness Probes

Production Incident: Pool Exhaustion Masking as Latency Spike

Summary

Test your knowledge with more question sets

Discussion

`pg-pool`: PostgreSQL Connection Pool

Connection Lifecycle: `maxUses` and Health Checks

Redis: Connection Management with `ioredis`