cluster vs worker_threads, SharedArrayBuffer ring buffers for zero-copy IPC, and parallelising transaction signature verification.
Module 6 — Core Scaling: Multi-Process Clustering & IPC Latency
What this module covers: A single Node.js process is single-threaded. On a 32-core cloud instance, that means 31 cores sit idle while your blockchain indexer saturates the 32nd. cluster and worker_threads are the two mechanisms for using those cores — but they have fundamentally different properties, different communication overhead, and different failure modes. Choosing the wrong one for your workload costs 3–10x throughput. This module covers the exact internal mechanics of both, the precise cost of IPC serialization at high message rates, and how SharedArrayBuffer + Atomics eliminates that cost for the right workloads.
Why a Single Node.js Process Cannot Saturate a Multi-Core Server
The event loop runs on one thread. JavaScript runs on one thread. V8 runs on one thread. A 32-core server with Node.js running on it has 31 cores available for your application — and by default, 31 of them are idle.
This is by design, not a limitation. The single-threaded model is what makes the event loop's performance guarantees possible: no shared state, no mutex contention, no thread-safety bugs. But for CPU-intensive workloads on large instances, you need to take deliberate action to use the hardware you're paying for.
Two mechanisms exist:
cluster: forks full Node.js processes. Each has its own V8 heap, event loop, and thread. Complete process isolation.worker_threads: creates threads within a single process. Shared memory space. Communication via message passing orSharedArrayBuffer.
The right choice depends entirely on your workload characteristics.
cluster: Full Process Replication
cluster.fork() creates a complete copy of the Node.js process. The child process has its own V8 heap, its own event loop, its own module registry, and its own memory space.
javascriptimport cluster from 'node:cluster'; import { cpus } from 'node:os'; import net from 'node:net'; const NUM_WORKERS = cpus().length; if (cluster.isPrimary) { console.log(`Primary ${process.pid} starting ${NUM_WORKERS} workers`); for (let i = 0; i < NUM_WORKERS; i++) { cluster.fork(); } // Restart workers that die (OOM, uncaught exception, etc.) cluster.on('exit', (worker, code, signal) => { console.warn(`Worker ${worker.process.pid} died (${signal || code})`); cluster.fork(); // replace immediately }); } else { // Each worker runs a complete copy of your server const server = net.createServer({ reusePort: true }, handleConnection); server.listen(3000); console.log(`Worker ${process.pid} listening`); }
Cluster Load Balancing: Round-Robin vs SO_REUSEPORT
Default (round-robin by master): The master process accepts all incoming connections and distributes them to workers via IPC. This creates a bottleneck at the master process and adds IPC overhead per connection.
SO_REUSEPORT (recommended):
Each worker binds directly to the port. The kernel distributes connections between workers using a hash of the connection 4-tuple (src IP + src port + dst IP + dst port). No IPC for connection distribution. No master bottleneck.
javascript// SO_REUSEPORT: each worker listens independently if (!cluster.isPrimary) { const server = net.createServer({ reusePort: true }, handleConnection); server.listen(3000); }
Why SO_REUSEPORT is better for blockchain indexers:
The master-round-robin model concentrates the accept syscall on one process. At 50K connections/second, the master spends all its time calling accept4 and sending socket handles to workers via IPC. With SO_REUSEPORT, 32 workers each accept 1,562 connections/second — well within the capacity of each worker's event loop.
The Memory Cost of cluster
Each forked worker is a separate process. On Linux, fork() uses copy-on-write, so immediately after forking, the child shares the parent's pages. As each worker modifies pages (loading modules, creating objects), those pages are copied — the shared advantage erodes over time.
For a typical Node.js application with a 200MB heap, 32 workers will consume:
- Immediately after fork: ~200MB shared + ~50MB unique per worker = ~1.8GB
- After warmup (1–2 min): ~200MB shared + ~150MB unique per worker = ~5GB
On a 32GB server, 32 workers at 5GB total leaves 27GB for OS page cache and application data — acceptable. On a 16GB server with a large application, cluster may not be viable.
child_process: Spawning External Processes
For integrating with non-JavaScript components (a Rust binary for signature verification, a Python analytics script, a Go RPC service), child_process provides three key APIs.
spawn: Streaming I/O
javascriptimport { spawn } from 'node:child_process'; // Spawn a Rust binary that verifies transaction signatures // Input: raw transaction bytes on stdin // Output: verification result on stdout function verifySignatureBatch(transactions) { const proc = spawn('./verify-signatures', [], { stdio: ['pipe', 'pipe', 'pipe'] // stdin, stdout, stderr }); // Write transaction data as a stream for (const tx of transactions) { proc.stdin.write(serializeTransaction(tx)); } proc.stdin.end(); // Read results as a stream const results = []; return new Promise((resolve, reject) => { proc.stdout.on('data', (chunk) => { results.push(...parseVerificationResults(chunk)); }); proc.on('close', (code) => { if (code !== 0) reject(new Error(`Verifier exited with ${code}`)); else resolve(results); }); }); }
spawn is for processes that produce large output or need streaming I/O. stdout/stderr are Readable streams — backpressure applies.
fork: V8-to-V8 IPC
fork is spawn specialized for Node.js child processes. It creates a communication channel (IPC) that supports process.send() / process.on('message').
javascriptimport { fork } from 'node:child_process'; // Fork a dedicated transaction validator process const validator = fork('./validator-worker.js'); // Send work validator.send({ type: 'VALIDATE', transactions: batch }); // Receive results validator.on('message', (msg) => { if (msg.type === 'RESULT') processResults(msg.results); });
The IPC cost: every process.send() call serializes the message using JSON.stringify (or structuredClone for transferables), sends it over a Unix socket, and deserializes on the other side. For small messages (< 1KB) at low frequency (< 1,000/sec), this is fine. At high frequency with large payloads, it becomes the bottleneck — see the IPC latency section below.
worker_threads: Threads with Shared Memory
Unlike cluster (separate processes), worker_threads creates threads within the same process. Threads share:
- The same process address space
- The same libuv thread pool
- The ability to share memory via
SharedArrayBuffer
Threads do NOT share:
- The V8 heap (each thread has its own heap)
- The event loop (each thread has its own loop)
- Global JavaScript state
javascriptimport { Worker, isMainThread, parentPort, workerData } from 'node:worker_threads'; if (isMainThread) { // Create a pool of 8 worker threads for CPU-bound transaction processing const workers = Array.from({ length: 8 }, () => new Worker(new URL(import.meta.url), { workerData: { config: loadConfig() } }) ); // Distribute work across workers let workerIndex = 0; function submitWork(transaction) { return new Promise((resolve, reject) => { const worker = workers[workerIndex++ % workers.length]; worker.postMessage(transaction); worker.once('message', resolve); worker.once('error', reject); }); } } else { // Worker thread: receives transactions, validates, responds parentPort.on('message', async (transaction) => { const result = await validateTransactionSignature(transaction); parentPort.postMessage(result); }); }
worker_threads vs cluster: The Decision Matrix
cluster | worker_threads | |
|---|---|---|
| Process isolation | Complete — crash in worker doesn't affect others | Partial — uncaught exception in worker crashes worker thread |
| Memory sharing | None (copy-on-write after fork) | SharedArrayBuffer for zero-copy |
| IPC | Unix socket + JSON serialization | postMessage (structured clone) or shared memory |
| Use for | HTTP server scaling, independent workloads | CPU-bound computation, shared state |
| Module reload | Each worker loads modules independently | Workers can share module instances |
| Failure isolation | Strong — OOM in one worker doesn't affect others | Weaker — shared libuv pool, shared process memory |
For a blockchain indexer:
- HTTP request handling →
cluster(each worker independently accepts connections) - Transaction signature verification →
worker_threads(CPU-bound, benefits from shared memory for batch data) - Historical block replay →
worker_threads(high computation, can share the block buffer viaSharedArrayBuffer)
IPC Latency: The Hidden Cost of postMessage
worker_threads.postMessage() and process.send() both serialize data before transmission.
The default serialization uses the Structured Clone Algorithm — similar to JSON.stringify but handles more types (Map, Set, ArrayBuffer, etc.). The cost:
javascript// Measuring actual IPC latency with worker_threads const worker = new Worker(` const { parentPort } = require('worker_threads'); parentPort.on('message', (msg) => parentPort.postMessage(msg)); `, { eval: true }); // Round-trip latency test const ITERATIONS = 10_000; const start = process.hrtime.bigint(); let count = 0; worker.on('message', () => { if (++count < ITERATIONS) worker.postMessage({ id: count, data: testPayload }); else { const elapsed = Number(process.hrtime.bigint() - start) / 1_000_000; console.log(`${ITERATIONS} round trips: ${elapsed.toFixed(0)}ms`); console.log(`Per message: ${(elapsed / ITERATIONS).toFixed(3)}ms`); } }); worker.postMessage({ id: 0, data: testPayload });
Typical results by payload size:
| Payload | Round-trip latency | Throughput |
|---|---|---|
| 100 bytes | 0.05ms | 20,000 msg/sec |
| 1 KB | 0.12ms | 8,300 msg/sec |
| 10 KB | 0.8ms | 1,250 msg/sec |
| 100 KB | 7ms | 143 msg/sec |
| 1 MB | 65ms | 15 msg/sec |
For a blockchain indexer passing 5KB transaction payloads at 50K/sec: IPC throughput = 1,250 msg/sec × N workers. At 8 workers: 10,000 transactions/sec max via message passing. Insufficient for 50K/sec.
The solution: SharedArrayBuffer for bulk data.
SharedArrayBuffer + Atomics: Zero-Copy IPC
SharedArrayBuffer allocates memory that is simultaneously accessible from multiple threads. No serialization. No copying. Reads and writes are direct memory operations.
A Production Ring Buffer for Transaction Batching
javascript// shared-ring-buffer.js // Shared between main thread and worker threads via SharedArrayBuffer const CAPACITY = 65536; // 64K slots const SLOT_SIZE = 256; // 256 bytes per slot const BUFFER_BYTES = CAPACITY * SLOT_SIZE; const CONTROL_BYTES = 4 * 4; // 4 Int32 control values export function createSharedRingBuffer() { const dataBuffer = new SharedArrayBuffer(BUFFER_BYTES); const controlBuffer = new SharedArrayBuffer(CONTROL_BYTES); return { data: new Uint8Array(dataBuffer), control: new Int32Array(controlBuffer), // control[0] = write index // control[1] = read index // control[2] = producer notification flag // control[3] = consumer notification flag }; } // Producer (main thread) — write transaction into next slot export function writeToRing(ring, transactionBytes) { const writeIdx = Atomics.load(ring.control, 0); const nextIdx = (writeIdx + 1) % CAPACITY; const readIdx = Atomics.load(ring.control, 1); if (nextIdx === readIdx) return false; // buffer full const offset = writeIdx * SLOT_SIZE; ring.data.set(transactionBytes.subarray(0, SLOT_SIZE), offset); Atomics.store(ring.control, 0, nextIdx); Atomics.notify(ring.control, 0, 1); // wake one waiting consumer return true; } // Consumer (worker thread) — read transaction from ring export function readFromRing(ring) { const readIdx = Atomics.load(ring.control, 1); const writeIdx = Atomics.load(ring.control, 0); if (readIdx === writeIdx) { // Buffer empty — wait for producer Atomics.wait(ring.control, 0, writeIdx); // blocks until notify return null; } const offset = readIdx * SLOT_SIZE; const slot = ring.data.slice(offset, offset + SLOT_SIZE); const nextRead = (readIdx + 1) % CAPACITY; Atomics.store(ring.control, 1, nextRead); return slot; }
javascript// main.js — producer const ring = createSharedRingBuffer(); // Pass the SharedArrayBuffer to workers (zero-copy — same memory) const workers = Array.from({ length: 8 }, () => new Worker('./processor-worker.js', { workerData: { dataBuffer: ring.data.buffer, controlBuffer: ring.control.buffer, } }) ); // Write incoming transactions into the ring buffer socket.on('data', (chunk) => { const transactions = parse(chunk); for (const tx of transactions) { const bytes = serialize(tx); writeToRing(ring, bytes); // zero-copy — no serialization } });
javascript// processor-worker.js — consumer import { workerData } from 'node:worker_threads'; const data = new Uint8Array(workerData.dataBuffer); const control = new Int32Array(workerData.controlBuffer); const ring = { data, control }; // Worker continuously reads from shared ring buffer while (true) { const slot = readFromRing(ring); if (slot) { const transaction = deserialize(slot); processTransaction(transaction); } }
Throughput comparison:
postMessagewith 5KB payload: ~8,300 msg/secSharedArrayBufferring buffer: ~850,000 writes/sec (100x faster)
The ring buffer is particularly effective when the main thread is ingesting faster than workers can consume — the shared memory acts as a natural back-buffer without any allocation.
Structured Concurrency: Managing Worker Pools Gracefully
A worker pool without lifecycle management leaks workers, misses errors, and fails ungracefully on shutdown.
javascriptclass WorkerPool { #workers = []; #queue = []; #size; #workerScript; constructor(workerScript, size = cpus().length) { this.#workerScript = workerScript; this.#size = size; for (let i = 0; i < size; i++) { this.#createWorker(); } } #createWorker() { const worker = new Worker(this.#workerScript); worker.on('message', ({ id, result, error }) => { const task = this.#queue.find(t => t.id === id); if (!task) return; this.#queue = this.#queue.filter(t => t.id !== id); if (error) task.reject(new Error(error)); else task.resolve(result); }); worker.on('error', (err) => { console.error(`Worker error: ${err.message}`); this.#workers = this.#workers.filter(w => w !== worker); this.#createWorker(); // replace failed worker }); this.#workers.push(worker); } async run(payload) { const id = Math.random().toString(36).slice(2); const worker = this.#workers[this.#queue.length % this.#workers.length]; return new Promise((resolve, reject) => { this.#queue.push({ id, resolve, reject }); worker.postMessage({ id, payload }); }); } async shutdown() { // Wait for in-flight tasks, then terminate await Promise.all(this.#queue.map(t => t.resolve(null))); await Promise.all(this.#workers.map(w => w.terminate())); } } // Usage for signature verification pool const sigPool = new WorkerPool('./sig-verifier.js', cpus().length); // Graceful shutdown on SIGTERM process.on('SIGTERM', async () => { await sigPool.shutdown(); process.exit(0); });
The Production Incident: IPC Bottleneck Hiding as a CPU Problem
Context: A blockchain indexer using cluster with 16 workers. Each worker receives transaction events via IPC from the primary process, processes them, and writes to PostgreSQL.
The symptom: Throughput plateaued at 4,200 transactions/second on a 16-core server. CPU utilization across all cores: 15%. Neither the database nor the network was the bottleneck. Adding more workers didn't help.
The diagnosis:
bash# Measure time spent in IPC send/receive strace -p $(pgrep -f "node indexer" | head -1) -e trace=sendmsg,recvmsg -T 2>&1 | head -50 # sendmsg(5, ..., 0) = 8192 <0.000423> # recvmsg(5, ..., MSG_DONTWAIT) = 8192 <0.000512>
Each cluster.fork() worker communication was going through the master's IPC loop. The primary process was serializing incoming transaction data with JSON.stringify, sending it via process.send() to 16 workers, and receiving acknowledgements back. At 4,200 transactions/second, the primary was doing:
- 4,200
JSON.stringifycalls/sec (each ~5KB payload = 21MB/sec of serialization) - 4,200
sendmsgsyscalls/sec - 4,200
recvmsgsyscalls/sec - 4,200
JSON.parsecalls on each worker = 16 × 4,200 = 67,200 parses/sec
The primary's event loop hit ELU 0.97. The primary was the bottleneck, not the workers.
The fix — switch to SO_REUSEPORT so workers receive connections directly:
javascript// Before: primary distributes work via IPC (bottleneck) // After: each worker accepts connections directly from the OS if (cluster.isPrimary) { for (let i = 0; i < 16; i++) cluster.fork(); } else { // Worker accepts connections directly — no IPC for each transaction const server = net.createServer({ reusePort: true }, handleTransaction); server.listen(3000); }
After the fix: throughput jumped to 28,000 transactions/second. Primary ELU dropped to 0.02. CPU across workers: 65% (healthy, room to grow). The IPC overhead had consumed 85% of the primary's capacity, visible only via strace and ELU measurement.
Summary
| Concept | Key Takeaway |
|---|---|
cluster | Full process fork. Complete isolation. Memory: ~150MB extra per worker after warmup. |
SO_REUSEPORT | Workers bind port directly. Kernel distributes connections. Eliminates master bottleneck. |
child_process.spawn | Streaming I/O to external processes. Use for Go/Rust/Python integrations. |
child_process.fork | Node-to-Node IPC. JSON serialization over Unix socket. Avoid for high-frequency messages. |
worker_threads | Threads with own V8 heap. postMessage for messages, SharedArrayBuffer for bulk data. |
IPC postMessage cost | ~0.05ms for 100 bytes, ~7ms for 100KB. 8,300 msg/sec max for 1KB payloads. |
SharedArrayBuffer | Zero-copy shared memory. 850K+ writes/sec for 256-byte slots. 100x faster than postMessage for bulk data. |
Atomics.wait/notify | Thread synchronization on SharedArrayBuffer. Wait without spinning. |
| Worker pool lifecycle | Handle worker errors, replace crashed workers, drain on SIGTERM. |
| IPC bottleneck pattern | ELU 0.97 on primary, low ELU on workers, plateau in throughput → primary is the bottleneck. |
Clustering gets you horizontal scale across cores. Module 7 covers the routing layer — what happens to every incoming request before it reaches your business logic, and why the router you choose can cost you 3x throughput before a single line of your code runs.
Next: Module 7 — Routing Engines at Scale: Vanilla HTTP vs Radix Tree Frameworks →