Module A-7·27 min read

cluster vs worker_threads, SharedArrayBuffer ring buffers for zero-copy IPC, and parallelising transaction signature verification.

Module 6 — Core Scaling: Multi-Process Clustering & IPC Latency

What this module covers: A single Node.js process is single-threaded. On a 32-core cloud instance, that means 31 cores sit idle while your blockchain indexer saturates the 32nd. cluster and worker_threads are the two mechanisms for using those cores — but they have fundamentally different properties, different communication overhead, and different failure modes. Choosing the wrong one for your workload costs 3–10x throughput. This module covers the exact internal mechanics of both, the precise cost of IPC serialization at high message rates, and how SharedArrayBuffer + Atomics eliminates that cost for the right workloads.


Why a Single Node.js Process Cannot Saturate a Multi-Core Server

The event loop runs on one thread. JavaScript runs on one thread. V8 runs on one thread. A 32-core server with Node.js running on it has 31 cores available for your application — and by default, 31 of them are idle.

This is by design, not a limitation. The single-threaded model is what makes the event loop's performance guarantees possible: no shared state, no mutex contention, no thread-safety bugs. But for CPU-intensive workloads on large instances, you need to take deliberate action to use the hardware you're paying for.

Two mechanisms exist:

  • cluster: forks full Node.js processes. Each has its own V8 heap, event loop, and thread. Complete process isolation.
  • worker_threads: creates threads within a single process. Shared memory space. Communication via message passing or SharedArrayBuffer.

The right choice depends entirely on your workload characteristics.


cluster: Full Process Replication

cluster.fork() creates a complete copy of the Node.js process. The child process has its own V8 heap, its own event loop, its own module registry, and its own memory space.

javascript
import cluster from 'node:cluster'; import { cpus } from 'node:os'; import net from 'node:net'; const NUM_WORKERS = cpus().length; if (cluster.isPrimary) { console.log(`Primary ${process.pid} starting ${NUM_WORKERS} workers`); for (let i = 0; i < NUM_WORKERS; i++) { cluster.fork(); } // Restart workers that die (OOM, uncaught exception, etc.) cluster.on('exit', (worker, code, signal) => { console.warn(`Worker ${worker.process.pid} died (${signal || code})`); cluster.fork(); // replace immediately }); } else { // Each worker runs a complete copy of your server const server = net.createServer({ reusePort: true }, handleConnection); server.listen(3000); console.log(`Worker ${process.pid} listening`); }

Cluster Load Balancing: Round-Robin vs SO_REUSEPORT

Default (round-robin by master): The master process accepts all incoming connections and distributes them to workers via IPC. This creates a bottleneck at the master process and adds IPC overhead per connection.

SO_REUSEPORT (recommended): Each worker binds directly to the port. The kernel distributes connections between workers using a hash of the connection 4-tuple (src IP + src port + dst IP + dst port). No IPC for connection distribution. No master bottleneck.

javascript
// SO_REUSEPORT: each worker listens independently if (!cluster.isPrimary) { const server = net.createServer({ reusePort: true }, handleConnection); server.listen(3000); }

Why SO_REUSEPORT is better for blockchain indexers:

The master-round-robin model concentrates the accept syscall on one process. At 50K connections/second, the master spends all its time calling accept4 and sending socket handles to workers via IPC. With SO_REUSEPORT, 32 workers each accept 1,562 connections/second — well within the capacity of each worker's event loop.

The Memory Cost of cluster

Each forked worker is a separate process. On Linux, fork() uses copy-on-write, so immediately after forking, the child shares the parent's pages. As each worker modifies pages (loading modules, creating objects), those pages are copied — the shared advantage erodes over time.

For a typical Node.js application with a 200MB heap, 32 workers will consume:

  • Immediately after fork: ~200MB shared + ~50MB unique per worker = ~1.8GB
  • After warmup (1–2 min): ~200MB shared + ~150MB unique per worker = ~5GB

On a 32GB server, 32 workers at 5GB total leaves 27GB for OS page cache and application data — acceptable. On a 16GB server with a large application, cluster may not be viable.


child_process: Spawning External Processes

For integrating with non-JavaScript components (a Rust binary for signature verification, a Python analytics script, a Go RPC service), child_process provides three key APIs.

spawn: Streaming I/O

javascript
import { spawn } from 'node:child_process'; // Spawn a Rust binary that verifies transaction signatures // Input: raw transaction bytes on stdin // Output: verification result on stdout function verifySignatureBatch(transactions) { const proc = spawn('./verify-signatures', [], { stdio: ['pipe', 'pipe', 'pipe'] // stdin, stdout, stderr }); // Write transaction data as a stream for (const tx of transactions) { proc.stdin.write(serializeTransaction(tx)); } proc.stdin.end(); // Read results as a stream const results = []; return new Promise((resolve, reject) => { proc.stdout.on('data', (chunk) => { results.push(...parseVerificationResults(chunk)); }); proc.on('close', (code) => { if (code !== 0) reject(new Error(`Verifier exited with ${code}`)); else resolve(results); }); }); }

spawn is for processes that produce large output or need streaming I/O. stdout/stderr are Readable streams — backpressure applies.

fork: V8-to-V8 IPC

fork is spawn specialized for Node.js child processes. It creates a communication channel (IPC) that supports process.send() / process.on('message').

javascript
import { fork } from 'node:child_process'; // Fork a dedicated transaction validator process const validator = fork('./validator-worker.js'); // Send work validator.send({ type: 'VALIDATE', transactions: batch }); // Receive results validator.on('message', (msg) => { if (msg.type === 'RESULT') processResults(msg.results); });

The IPC cost: every process.send() call serializes the message using JSON.stringify (or structuredClone for transferables), sends it over a Unix socket, and deserializes on the other side. For small messages (< 1KB) at low frequency (< 1,000/sec), this is fine. At high frequency with large payloads, it becomes the bottleneck — see the IPC latency section below.


worker_threads: Threads with Shared Memory

Unlike cluster (separate processes), worker_threads creates threads within the same process. Threads share:

  • The same process address space
  • The same libuv thread pool
  • The ability to share memory via SharedArrayBuffer

Threads do NOT share:

  • The V8 heap (each thread has its own heap)
  • The event loop (each thread has its own loop)
  • Global JavaScript state
javascript
import { Worker, isMainThread, parentPort, workerData } from 'node:worker_threads'; if (isMainThread) { // Create a pool of 8 worker threads for CPU-bound transaction processing const workers = Array.from({ length: 8 }, () => new Worker(new URL(import.meta.url), { workerData: { config: loadConfig() } }) ); // Distribute work across workers let workerIndex = 0; function submitWork(transaction) { return new Promise((resolve, reject) => { const worker = workers[workerIndex++ % workers.length]; worker.postMessage(transaction); worker.once('message', resolve); worker.once('error', reject); }); } } else { // Worker thread: receives transactions, validates, responds parentPort.on('message', async (transaction) => { const result = await validateTransactionSignature(transaction); parentPort.postMessage(result); }); }

worker_threads vs cluster: The Decision Matrix

clusterworker_threads
Process isolationComplete — crash in worker doesn't affect othersPartial — uncaught exception in worker crashes worker thread
Memory sharingNone (copy-on-write after fork)SharedArrayBuffer for zero-copy
IPCUnix socket + JSON serializationpostMessage (structured clone) or shared memory
Use forHTTP server scaling, independent workloadsCPU-bound computation, shared state
Module reloadEach worker loads modules independentlyWorkers can share module instances
Failure isolationStrong — OOM in one worker doesn't affect othersWeaker — shared libuv pool, shared process memory

For a blockchain indexer:

  • HTTP request handling → cluster (each worker independently accepts connections)
  • Transaction signature verification → worker_threads (CPU-bound, benefits from shared memory for batch data)
  • Historical block replay → worker_threads (high computation, can share the block buffer via SharedArrayBuffer)

IPC Latency: The Hidden Cost of postMessage

worker_threads.postMessage() and process.send() both serialize data before transmission.

The default serialization uses the Structured Clone Algorithm — similar to JSON.stringify but handles more types (Map, Set, ArrayBuffer, etc.). The cost:

javascript
// Measuring actual IPC latency with worker_threads const worker = new Worker(` const { parentPort } = require('worker_threads'); parentPort.on('message', (msg) => parentPort.postMessage(msg)); `, { eval: true }); // Round-trip latency test const ITERATIONS = 10_000; const start = process.hrtime.bigint(); let count = 0; worker.on('message', () => { if (++count < ITERATIONS) worker.postMessage({ id: count, data: testPayload }); else { const elapsed = Number(process.hrtime.bigint() - start) / 1_000_000; console.log(`${ITERATIONS} round trips: ${elapsed.toFixed(0)}ms`); console.log(`Per message: ${(elapsed / ITERATIONS).toFixed(3)}ms`); } }); worker.postMessage({ id: 0, data: testPayload });

Typical results by payload size:

PayloadRound-trip latencyThroughput
100 bytes0.05ms20,000 msg/sec
1 KB0.12ms8,300 msg/sec
10 KB0.8ms1,250 msg/sec
100 KB7ms143 msg/sec
1 MB65ms15 msg/sec

For a blockchain indexer passing 5KB transaction payloads at 50K/sec: IPC throughput = 1,250 msg/sec × N workers. At 8 workers: 10,000 transactions/sec max via message passing. Insufficient for 50K/sec.

The solution: SharedArrayBuffer for bulk data.


SharedArrayBuffer + Atomics: Zero-Copy IPC

SharedArrayBuffer allocates memory that is simultaneously accessible from multiple threads. No serialization. No copying. Reads and writes are direct memory operations.

A Production Ring Buffer for Transaction Batching

javascript
// shared-ring-buffer.js // Shared between main thread and worker threads via SharedArrayBuffer const CAPACITY = 65536; // 64K slots const SLOT_SIZE = 256; // 256 bytes per slot const BUFFER_BYTES = CAPACITY * SLOT_SIZE; const CONTROL_BYTES = 4 * 4; // 4 Int32 control values export function createSharedRingBuffer() { const dataBuffer = new SharedArrayBuffer(BUFFER_BYTES); const controlBuffer = new SharedArrayBuffer(CONTROL_BYTES); return { data: new Uint8Array(dataBuffer), control: new Int32Array(controlBuffer), // control[0] = write index // control[1] = read index // control[2] = producer notification flag // control[3] = consumer notification flag }; } // Producer (main thread) — write transaction into next slot export function writeToRing(ring, transactionBytes) { const writeIdx = Atomics.load(ring.control, 0); const nextIdx = (writeIdx + 1) % CAPACITY; const readIdx = Atomics.load(ring.control, 1); if (nextIdx === readIdx) return false; // buffer full const offset = writeIdx * SLOT_SIZE; ring.data.set(transactionBytes.subarray(0, SLOT_SIZE), offset); Atomics.store(ring.control, 0, nextIdx); Atomics.notify(ring.control, 0, 1); // wake one waiting consumer return true; } // Consumer (worker thread) — read transaction from ring export function readFromRing(ring) { const readIdx = Atomics.load(ring.control, 1); const writeIdx = Atomics.load(ring.control, 0); if (readIdx === writeIdx) { // Buffer empty — wait for producer Atomics.wait(ring.control, 0, writeIdx); // blocks until notify return null; } const offset = readIdx * SLOT_SIZE; const slot = ring.data.slice(offset, offset + SLOT_SIZE); const nextRead = (readIdx + 1) % CAPACITY; Atomics.store(ring.control, 1, nextRead); return slot; }
javascript
// main.js — producer const ring = createSharedRingBuffer(); // Pass the SharedArrayBuffer to workers (zero-copy — same memory) const workers = Array.from({ length: 8 }, () => new Worker('./processor-worker.js', { workerData: { dataBuffer: ring.data.buffer, controlBuffer: ring.control.buffer, } }) ); // Write incoming transactions into the ring buffer socket.on('data', (chunk) => { const transactions = parse(chunk); for (const tx of transactions) { const bytes = serialize(tx); writeToRing(ring, bytes); // zero-copy — no serialization } });
javascript
// processor-worker.js — consumer import { workerData } from 'node:worker_threads'; const data = new Uint8Array(workerData.dataBuffer); const control = new Int32Array(workerData.controlBuffer); const ring = { data, control }; // Worker continuously reads from shared ring buffer while (true) { const slot = readFromRing(ring); if (slot) { const transaction = deserialize(slot); processTransaction(transaction); } }

Throughput comparison:

  • postMessage with 5KB payload: ~8,300 msg/sec
  • SharedArrayBuffer ring buffer: ~850,000 writes/sec (100x faster)

The ring buffer is particularly effective when the main thread is ingesting faster than workers can consume — the shared memory acts as a natural back-buffer without any allocation.


Structured Concurrency: Managing Worker Pools Gracefully

A worker pool without lifecycle management leaks workers, misses errors, and fails ungracefully on shutdown.

javascript
class WorkerPool { #workers = []; #queue = []; #size; #workerScript; constructor(workerScript, size = cpus().length) { this.#workerScript = workerScript; this.#size = size; for (let i = 0; i < size; i++) { this.#createWorker(); } } #createWorker() { const worker = new Worker(this.#workerScript); worker.on('message', ({ id, result, error }) => { const task = this.#queue.find(t => t.id === id); if (!task) return; this.#queue = this.#queue.filter(t => t.id !== id); if (error) task.reject(new Error(error)); else task.resolve(result); }); worker.on('error', (err) => { console.error(`Worker error: ${err.message}`); this.#workers = this.#workers.filter(w => w !== worker); this.#createWorker(); // replace failed worker }); this.#workers.push(worker); } async run(payload) { const id = Math.random().toString(36).slice(2); const worker = this.#workers[this.#queue.length % this.#workers.length]; return new Promise((resolve, reject) => { this.#queue.push({ id, resolve, reject }); worker.postMessage({ id, payload }); }); } async shutdown() { // Wait for in-flight tasks, then terminate await Promise.all(this.#queue.map(t => t.resolve(null))); await Promise.all(this.#workers.map(w => w.terminate())); } } // Usage for signature verification pool const sigPool = new WorkerPool('./sig-verifier.js', cpus().length); // Graceful shutdown on SIGTERM process.on('SIGTERM', async () => { await sigPool.shutdown(); process.exit(0); });

The Production Incident: IPC Bottleneck Hiding as a CPU Problem

Context: A blockchain indexer using cluster with 16 workers. Each worker receives transaction events via IPC from the primary process, processes them, and writes to PostgreSQL.

The symptom: Throughput plateaued at 4,200 transactions/second on a 16-core server. CPU utilization across all cores: 15%. Neither the database nor the network was the bottleneck. Adding more workers didn't help.

The diagnosis:

bash
# Measure time spent in IPC send/receive strace -p $(pgrep -f "node indexer" | head -1) -e trace=sendmsg,recvmsg -T 2>&1 | head -50 # sendmsg(5, ..., 0) = 8192 <0.000423> # recvmsg(5, ..., MSG_DONTWAIT) = 8192 <0.000512>

Each cluster.fork() worker communication was going through the master's IPC loop. The primary process was serializing incoming transaction data with JSON.stringify, sending it via process.send() to 16 workers, and receiving acknowledgements back. At 4,200 transactions/second, the primary was doing:

  • 4,200 JSON.stringify calls/sec (each ~5KB payload = 21MB/sec of serialization)
  • 4,200 sendmsg syscalls/sec
  • 4,200 recvmsg syscalls/sec
  • 4,200 JSON.parse calls on each worker = 16 × 4,200 = 67,200 parses/sec

The primary's event loop hit ELU 0.97. The primary was the bottleneck, not the workers.

The fix — switch to SO_REUSEPORT so workers receive connections directly:

javascript
// Before: primary distributes work via IPC (bottleneck) // After: each worker accepts connections directly from the OS if (cluster.isPrimary) { for (let i = 0; i < 16; i++) cluster.fork(); } else { // Worker accepts connections directly — no IPC for each transaction const server = net.createServer({ reusePort: true }, handleTransaction); server.listen(3000); }

After the fix: throughput jumped to 28,000 transactions/second. Primary ELU dropped to 0.02. CPU across workers: 65% (healthy, room to grow). The IPC overhead had consumed 85% of the primary's capacity, visible only via strace and ELU measurement.


Summary

ConceptKey Takeaway
clusterFull process fork. Complete isolation. Memory: ~150MB extra per worker after warmup.
SO_REUSEPORTWorkers bind port directly. Kernel distributes connections. Eliminates master bottleneck.
child_process.spawnStreaming I/O to external processes. Use for Go/Rust/Python integrations.
child_process.forkNode-to-Node IPC. JSON serialization over Unix socket. Avoid for high-frequency messages.
worker_threadsThreads with own V8 heap. postMessage for messages, SharedArrayBuffer for bulk data.
IPC postMessage cost~0.05ms for 100 bytes, ~7ms for 100KB. 8,300 msg/sec max for 1KB payloads.
SharedArrayBufferZero-copy shared memory. 850K+ writes/sec for 256-byte slots. 100x faster than postMessage for bulk data.
Atomics.wait/notifyThread synchronization on SharedArrayBuffer. Wait without spinning.
Worker pool lifecycleHandle worker errors, replace crashed workers, drain on SIGTERM.
IPC bottleneck patternELU 0.97 on primary, low ELU on workers, plateau in throughput → primary is the bottleneck.

Clustering gets you horizontal scale across cores. Module 7 covers the routing layer — what happens to every incoming request before it reaches your business logic, and why the router you choose can cost you 3x throughput before a single line of your code runs.

Next: Module 7 — Routing Engines at Scale: Vanilla HTTP vs Radix Tree Frameworks →

© 2026 Jatin Jain Saraf (JJS). All rights reserved.