Module A-4·31 min read

The raw network stack: how a transaction payload travels from NIC to Node.js runtime boundary via epoll and libuv.

Module 3 — Kernel-Level I/O Multiplexing: epoll, kqueue, IOCP

Q: Why is Linux's `epoll` crucial for Node.js's ability to handle tens of thousands of concurrent connections, compared to the older `select` or `poll` interfaces?

`epoll_wait` operates in O(ready events) time by returning only the file descriptors that have activity, rather than scanning all registered descriptors. — The `epoll` mechanism maintains an internal ready-list in the kernel and returns only the file descriptors that are actually ready for I/O. This makes `epoll_wait` scale efficiently with the number of *active* events (O(ready)), whereas `select` and `poll` require the kernel and application to scan every registered descriptor on every call (O(N)).

Q: Which kernel-level parameter is most likely misconfigured if large TCP payloads arrive in bursty, inconsistent chunks because TCP flow control frequently forces the sender to pause?

`SO_RCVBUF` — `SO_RCVBUF` defines the kernel's receive buffer size for a socket. If a massive payload arrives and the buffer is too small (like the default 85KB), the buffer quickly fills up. This causes TCP flow control to advertise a zero window, forcing the sender to pause until Node.js drains the data, leading to bursty throughput.

Q: To mitigate the CPU and connection pool exhaustion caused by a "thundering herd" of simultaneous reconnects, which strategy is the most effective at the application layer?

Rate-limit connection processing in the `connection` event handler by deferring socket initialization logic, while relying on the kernel's SYN backlog. — During a reconnect storm, the kernel can quickly accept connections into the SYN backlog, but simultaneously executing heavy initialization logic (like DB authentication) for thousands of connections will crash the app. Deferring and rate-limiting the *application-level* processing of these sockets prevents resource exhaustion.

What this module covers: Node.js's non-blocking I/O is not magic — it is a precise coordination between libuv and OS-level kernel interfaces. epoll on Linux, kqueue on macOS/BSD, and IOCP on Windows are the actual mechanisms that make thousands of concurrent connections possible without thousands of threads. Understanding them at the system call level lets you predict Node.js network behavior under load, tune kernel parameters correctly, and diagnose failures that are invisible from the JavaScript layer. This module traces a transaction payload from the network interface card to your callback with no hand-waving.

Why Kernel I/O Interfaces Matter

When a Node.js blockchain indexer handles 50,000 concurrent TCP connections from blockchain full nodes, each pushing new block data, two questions matter:

How does the OS tell Node.js that one of those 50,000 connections has new data ready to read?
How does Node.js read that data without blocking on connections that have no data?

The answer to both questions is the I/O multiplexer — a kernel interface that watches many file descriptors simultaneously and notifies the application when any of them are ready.

The three implementations differ in important ways. Knowing which one you're on and how it works determines which kernel parameters you can tune and how Node.js behaves under connection-heavy loads.

The Evolution: `select` → `poll` → `epoll`

`select`: The Original (and Still Worst)

select is the POSIX I/O multiplexer available on every Unix system. It watches a set of file descriptors and returns when any of them are ready.

// select signature (simplified)
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds,
           struct timeval *timeout);

Critical limitation: O(N) at both submission and return. The fd_set is a fixed-size bitmap. The kernel scans every bit to find ready descriptors. You scan every bit again to find which ones fired. At 50,000 file descriptors, this is 50,000 bit checks twice, every time you call select. And select's maximum nfds is typically 1,024 on Linux — you literally cannot watch more than 1,024 file descriptors.

Node.js does not use select.

`poll`: Slightly Better

poll removes the 1,024 limit but keeps the O(N) scan:

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

You pass an array of pollfd structs. The kernel fills in revents for each one that's ready. You still scan the entire array to find which descriptors fired. At 50,000 connections, this is 50,000 struct scans on every call.

Node.js does not use poll for network sockets (though libuv falls back to it on some platforms for specific operations).

`epoll`: Linux's Answer (What Node.js Uses)

epoll was introduced in Linux 2.5.44 (2002) and is the foundation of Node.js's I/O on Linux. It solves the O(N) problem:

// Create an epoll instance
int epfd = epoll_create1(0);

// Register a file descriptor to watch
struct epoll_event ev;
ev.events = EPOLLIN;        // watch for readable data
ev.data.fd = socket_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, socket_fd, &ev);

// Wait for events — returns only READY descriptors
int n = epoll_wait(epfd, events, MAX_EVENTS, timeout_ms);
// n = number of ready events, events[] contains only the ready ones

The key difference: epoll_wait returns only the file descriptors that are actually ready. If 50,000 connections are registered but only 3 have new data, epoll_wait returns 3 — not 50,000. The kernel maintains an internal ready-list and adds to it when descriptors become ready via interrupt-driven notification.

The result: epoll_wait is O(ready events), not O(total registered). Registering 50,000 connections costs 50,000 epoll_ctl calls at setup time, but each subsequent epoll_wait scales with activity, not connection count.

The Three `epoll` System Calls

Understanding these three calls precisely demystifies libuv's I/O poll phase.

`epoll_create1`

int epfd = epoll_create1(EPOLL_CLOEXEC);

Creates an epoll instance — a kernel data structure that maintains the watched set and the ready list. Returns a file descriptor. EPOLL_CLOEXEC ensures the descriptor is closed if the process exec's (security hygiene).

libuv calls this once at startup. The single epfd watches all sockets, timers, pipes, and signals for the Node.js process.

`epoll_ctl`

// Add a new file descriptor to watch
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);

// Modify the events to watch on an existing fd
epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);

// Remove a file descriptor
epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL);

Called by libuv when:

A new TCP connection is accepted → EPOLL_CTL_ADD with EPOLLIN | EPOLLOUT
A socket transitions from write-interested to read-only → EPOLL_CTL_MOD
A connection closes → EPOLL_CTL_DEL

For a blockchain indexer receiving 50,000 concurrent connections: 50,000 EPOLL_CTL_ADD calls at connection time. One epfd watching all 50,000.

`epoll_wait`

struct epoll_event events[MAX_EVENTS];
int n = epoll_wait(epfd, events, MAX_EVENTS, timeout_ms);
for (int i = 0; i < n; i++) {
  handle_event(events[i].data.fd, events[i].events);
}

This is called by libuv in the poll phase of the event loop. It blocks until:

At least one registered fd is ready, OR
The timeout expires (calculated by libuv based on pending timers)

When it returns, events[0..n-1] contains only the ready descriptors. libuv processes each one, dispatching the appropriate callback.

The Complete Journey: NIC to JavaScript Callback

Let me trace a transaction payload from a blockchain full node through every layer to your callback. This is the path that happens millions of times per second in a production indexer.

text

1. [Network] Full node sends TCP segment containing block data
              ↓
2. [NIC] Network interface card receives the packet via DMA
         NIC fires a hardware interrupt to notify the CPU
              ↓
3. [Kernel — Interrupt Handler]
         CPU handles the interrupt in kernel context
         Copies packet data from NIC buffer to kernel socket receive buffer
         (SO_RCVBUF — we'll tune this below)
              ↓
4. [Kernel — Socket State Machine]
         TCP stack processes the segment: acknowledgement, reassembly
         If the socket was watching EPOLLIN, marks it ready in epoll's ready list
              ↓
5. [epoll_wait returns]
         libuv's poll phase was blocked in epoll_wait
         epoll_wait returns with this socket in the events[] array
              ↓
6. [libuv — I/O Poll Phase]
         libuv calls the registered read callback for this socket handle
         The callback calls recv() / read() to copy data from kernel buffer
         to userspace (libuv's read buffer)
              ↓
7. [libuv → Node.js]
         libuv fires the 'data' event on the corresponding net.Socket
         Passes the data as a Buffer (or calls the registered stream handler)
              ↓
8. [JavaScript — Event Loop]
         Your 'data' callback runs
         You parse the Buffer, extract the transaction
         Issue async database write (non-blocking)
         Return from callback
              ↓
9. [Event loop continues]
         Next ready socket's data callback runs
         Or the loop blocks in epoll_wait again if nothing is ready

Every step from 1 to 7 happens in kernel or libuv C code, with zero JavaScript involvement. Step 8 is the only JavaScript. This is why a single Node.js process can handle 50,000 connections: steps 1–7 are handled by the kernel's interrupt-driven mechanisms, requiring no application-level threads.

`kqueue`: macOS and BSD

kqueue is the BSD equivalent of epoll. The API is different but the concept is identical: an event queue that you register file descriptors into and poll for readiness notifications.

// Create a kqueue instance
int kq = kqueue();

// Register a socket for read readiness
struct kevent change;
EV_SET(&change, socket_fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
kevent(kq, &change, 1, NULL, 0, NULL);  // register (no wait)

// Wait for events
struct kevent events[MAX_EVENTS];
int n = kevent(kq, NULL, 0, events, MAX_EVENTS, NULL);  // wait

Node.js on macOS uses kqueue via libuv's kqueue backend. The behavior is identical to epoll from the application perspective.

Practical difference for development: macOS often has lower default file descriptor limits (ulimit -n). If you're testing a high-connection-count scenario locally on macOS, you may need:

bash

# Check current limit
ulimit -n

# Increase for current session (testing only)
ulimit -n 65536

# Permanent: edit /etc/launchd.conf or use launchctl limit maxfiles

IOCP: Windows (Completion-Based)

Windows uses I/O Completion Ports (IOCP) — a fundamentally different model. Where epoll/kqueue are readiness-based (notify when an fd is ready to read/write), IOCP is completion-based (notify when an I/O operation has completed and the data is already in your buffer).

text

epoll/kqueue model:
  "Socket X has data available. Go read it."
  → Application calls recv() to copy data from kernel to userspace buffer

IOCP model:
  "I completed reading from socket X. The data is already in buffer Y at offset Z."
  → Application processes the data directly

IOCP can be more efficient on Windows because it eliminates the application-side recv() syscall — data arrives ready to use. libuv implements an IOCP backend for Windows, maintaining the same Node.js API surface across platforms.

Kernel Socket Buffer Tuning

The kernel maintains receive (SO_RCVBUF) and send (SO_SNDBUF) buffers for each TCP socket. These buffers sit between the NIC and the application.

`SO_RCVBUF`: The Receive Buffer

When a blockchain full node is pushing multi-MB block data, the kernel's receive buffer determines how much data can accumulate before the sender must stop (TCP flow control kicks in).

Default: 87,380 bytes (85KB) on most Linux systems.

For a blockchain indexer receiving large block payloads:

javascript

// Set socket buffer sizes in Node.js
const server = net.createServer((socket) => {
  // Set before the connection is fully established
  socket.setNoDelay(true);     // disable Nagle
  socket.setKeepAlive(true, 30000);

  // Buffer sizes must be set via socket options if you need > default
  // Node.js doesn't expose SO_RCVBUF directly, but you can use a native addon
  // or set system-wide defaults via sysctl

  socket.on('data', (chunk) => {
    // chunk.length ≤ SO_RCVBUF
    processBlockData(chunk);
  });
});

bash

# Check current socket buffer defaults
sysctl net.core.rmem_default     # default receive buffer
sysctl net.core.rmem_max         # maximum allowed receive buffer
sysctl net.core.wmem_default     # default send buffer

# For high-throughput blockchain data streams: increase receive buffer
sudo sysctl -w net.core.rmem_default=262144   # 256KB default
sudo sysctl -w net.core.rmem_max=16777216     # 16MB max (for large block payloads)
sudo sysctl -w net.ipv4.tcp_rmem="4096 262144 16777216"

# Persist across reboots
echo "net.core.rmem_default=262144" >> /etc/sysctl.conf
echo "net.core.rmem_max=16777216" >> /etc/sysctl.conf

When Buffer Size Matters

Too small (default 85KB): Full node pushes a 2MB Ethereum block. The kernel receive buffer fills at 85KB. TCP advertises zero window to the sender. The sender pauses. Node.js drains the buffer. TCP opens the window again. This cycle creates bursty throughput instead of smooth streaming. Under extreme load, this back-and-forth adds latency.

Too large: More memory consumed per connection. With 50,000 connections at 2MB buffer each, that's 100GB of kernel buffer space — untenable.

Correct approach: Size to your expected payload size. For a blockchain indexer where individual blocks are 1–5MB, a 4MB receive buffer per connection is reasonable. Scale rmem_max to 16MB and let the OS auto-tune with tcp_rmem.

`TCP_NODELAY` and Nagle's Algorithm

Nagle's Algorithm was designed to reduce the number of small TCP packets. It buffers small writes and coalesces them:

text

Nagle's rule: send immediately IF:
  - The buffer contains a full-size segment (MSS), OR
  - All previously sent data has been acknowledged

Otherwise: buffer and wait for ACK or MSS

For a real-time payment gateway or WebSocket notification system, Nagle introduces visible latency. A 50-byte payment acknowledgement gets buffered waiting for a full 1460-byte segment that may never come.

javascript

// Disable Nagle for low-latency payment ACKs
socket.setNoDelay(true);   // sets TCP_NODELAY

// Where to set it:
const server = net.createServer((socket) => {
  socket.setNoDelay(true);  // as early as possible on new connections
});

When to keep Nagle enabled: bulk data transfers where throughput matters more than latency. A blockchain full node pushing bulk historical block data benefits from Nagle — fewer, larger packets use network bandwidth more efficiently.

The rule for a hybrid system: disable Nagle on connections that send real-time acknowledgements; enable (or leave default) on connections used for bulk data ingestion.

`SO_REUSEPORT`: Kernel-Level Load Distribution

By default, only one process can bind to a given port. SO_REUSEPORT allows multiple processes (or sockets) to bind to the same port, with the kernel distributing incoming connections between them using a hash of (source IP, source port, dest IP, dest port).

This is the correct way to use Node.js cluster for TCP workloads — instead of the master process distributing connections in userspace (with round-robin, which breaks connection affinity), the kernel distributes them with consistent hashing.

javascript

// Using SO_REUSEPORT for multiple workers bound to the same port
// This requires Node.js 16+ and setting reusePort: true

import cluster from 'node:cluster';
import net from 'node:net';
import { cpus } from 'node:os';

if (cluster.isPrimary) {
  for (let i = 0; i < cpus().length; i++) {
    cluster.fork();
  }
} else {
  // Each worker creates its OWN server on the same port
  // The kernel distributes connections between workers
  const server = net.createServer({ reusePort: true }, (socket) => {
    socket.setNoDelay(true);
    handleConnection(socket);
  });

  server.listen(3000, () => {
    console.log(`Worker ${process.pid} listening on :3000`);
  });
}

Why this beats standard cluster round-robin:

Standard cluster: master process accepts ALL connections, sends to workers via IPC
SO_REUSEPORT: kernel accepts connections directly into each worker's accept queue
No IPC overhead. No single-process bottleneck at the accept queue.
Connection affinity: subsequent packets from the same client consistently reach the same worker (important for stateful protocol handling)

Observing Syscall Patterns Under Load

strace (Linux) and dtruss (macOS) let you observe the actual system calls Node.js makes, confirming the kernel interactions described above.

bash

# Trace I/O-related syscalls for a running Node.js process
strace -p $(pgrep -f "node indexer") \
       -e trace=epoll_wait,epoll_ctl,read,write,accept4,close \
       -T   # show time spent in each syscall

Example output during active ingestion:

text

epoll_wait(5, [{EPOLLIN, {u32=12, u64=12}}], 1024, 100) = 1 <0.000421>
read(12, "\x82\x00\x00\x03\xf4...", 65536) = 4096 <0.000031>
epoll_wait(5, [{EPOLLIN, {u32=12, u64=12}}], 1024, 100) = 1 <0.000012>
read(12, "\x82\x00\x00\x03\xf4...", 65536) = 3892 <0.000028>
epoll_wait(5, [], 1024, 100) = 0 <0.100041>  ← 100ms idle wait

Reading this:

epoll_wait returning 1 = one socket is ready
read with < 65536 bytes = end of available data in buffer
epoll_wait returning 0 after 100ms = no activity, loop is idle

High-load pattern:

epoll_wait(5, [{EPOLLIN, ...}, {EPOLLIN, ...}, {EPOLLIN, ...}], 1024, 0) = 48 <0.000003>

epoll_wait returning 48 events with timeout=0 means the ready list is continuously populated — the event loop never goes idle. This is a healthy high-throughput state, not starvation, as long as each callback is fast.

`/proc/net/tcp` and Socket State Monitoring

The kernel exposes all TCP socket state via /proc/net/tcp (IPv4) and /proc/net/tcp6 (IPv6).

bash

# Count sockets by state
awk 'NR>1 {print $4}' /proc/net/tcp | sort | uniq -c | sort -rn

# States (hex): 01=ESTABLISHED, 06=TIME_WAIT, 0A=LISTEN, 0B=CLOSING

The SYN backlog: detecting connection drops

When a blockchain full node makes 10,000 simultaneous new connections to your indexer, the kernel must process SYN packets faster than the application calls accept(). The SYN backlog queue holds pending connections waiting for accept().

If the SYN backlog fills, the kernel drops new SYN packets silently — clients get no response and eventually time out.

bash

# Check current SYN backlog setting
sysctl net.ipv4.tcp_max_syn_backlog  # default: 128–256

# For high-connection-rate indexers
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65536
sudo sysctl -w net.core.somaxconn=65536  # also raise the listen() backlog

# In Node.js: pass backlog to listen()
server.listen(3000, '0.0.0.0', 65536);  // explicit backlog parameter

bash

# Monitor SYN drops in real time
watch -n 1 "netstat -s | grep -i 'syn\|overflow\|drop'"

The Production Incident: epoll Thundering Herd on Reconnect

Context: A blockchain indexer with 5,000 persistent connections from blockchain full nodes. The indexer deployed a new version. All 5,000 full nodes detected the connection drop simultaneously and attempted to reconnect at the same moment.

What happened:

All 5,000 connection attempts arrived within 200ms. The kernel accepted them all into the SYN backlog, but the Node.js net.createServer call queue could only accept connections as fast as the single accept4 call in libuv processed them.

The real problem: all 5,000 connections established simultaneously generated 5,000 epoll_ctl(EPOLL_CTL_ADD) calls, each sending an immediate "hello" protocol message. The first epoll_wait after reconnection returned 5,000 ready events. libuv dispatched 5,000 callbacks simultaneously. Each callback triggered a database lookup (authenticate the reconnecting node). The database connection pool (default 10 connections) was instantly exhausted. 4,990 database lookups queued. Authentication took 30–120 seconds per connection instead of < 10ms. The thundering herd had caused a database pool exhaustion cascade.

The fix — connection rate limiting at the TCP accept layer:

javascript

// Limit simultaneous new connections being processed
let activeHandshakes = 0;
const MAX_CONCURRENT_HANDSHAKES = 50;
const pendingConnections = [];

server.on('connection', (socket) => {
  if (activeHandshakes >= MAX_CONCURRENT_HANDSHAKES) {
    // Defer processing — socket is accepted at TCP level, won't be dropped
    pendingConnections.push(socket);
    return;
  }

  processConnection(socket);
});

function processConnection(socket) {
  activeHandshakes++;
  socket.setNoDelay(true);

  authenticateNode(socket)
    .then(() => registerNode(socket))
    .finally(() => {
      activeHandshakes--;
      // Process next deferred connection
      if (pendingConnections.length > 0) {
        const next = pendingConnections.shift();
        setImmediate(() => processConnection(next));
      }
    });
}

Additionally: SO_REUSEPORT across 8 workers distributed the initial accept load — instead of one process handling all 5,000 accept calls, 8 workers each handled ~625. Reconnection storm time dropped from 120s to 8s.

Summary

Concept	Key Takeaway
`select`/`poll`	O(N) scan of all watched fds. Do not use for high-connection-count workloads.
`epoll` (Linux)	O(ready events). The foundation of Node.js I/O on Linux.
`kqueue` (macOS)	Conceptually identical to epoll. Same behavior, different API.
IOCP (Windows)	Completion-based (not readiness-based). Handled transparently by libuv.
NIC → JavaScript	9-step path: NIC interrupt → kernel TCP stack → epoll ready list → libuv poll phase → your callback.
`SO_RCVBUF`	Kernel receive buffer per socket. Tune for expected payload size. Default 85KB is too small for multi-MB blockchain block data.
`TCP_NODELAY`	Disables Nagle. Required for low-latency payment ACKs. Avoid for bulk data transfers.
`SO_REUSEPORT`	Multiple workers bind the same port. Kernel distributes connections with connection affinity. No IPC overhead.
SYN backlog	Set `tcp_max_syn_backlog=65536` for high reconnect-rate scenarios.
`strace`	Direct observation of system calls — confirms epoll interaction patterns.
Thundering herd	Simultaneous reconnects + database pool exhaustion = cascade failure. Rate-limit connection processing.

You now know what happens at every layer below Node.js. Module 4 moves back up to the application layer and covers the HTTP/TCP subsystem — backpressure mechanisms, high-water marks, the drain event contract, and how to prevent socket floods from taking down your ingestion service.

Next: Module 4 — The HTTP/TCP Subsystem & Ingestion Backpressure →

Knowledge Check

Why is Linux's epoll crucial for Node.js's ability to handle tens of thousands of concurrent connections, compared to the older select or poll interfaces?

Which kernel-level parameter is most likely misconfigured if large TCP payloads arrive in bursty, inconsistent chunks because TCP flow control frequently forces the sender to pause?

To mitigate the CPU and connection pool exhaustion caused by a "thundering herd" of simultaneous reconnects, which strategy is the most effective at the application layer?

Test your knowledge with more question sets

PreviousModule A-3: Event Loop Saturation & Thread Pool Offloading Next Module A-5: The HTTP/TCP Subsystem & Ingestion Backpressure

Discussion

Join the discussion

Loading comments...

Module 3 — Kernel-Level I/O Multiplexing: epoll, kqueue, IOCP

Why Kernel I/O Interfaces Matter

The Evolution: select → poll → epoll

select: The Original (and Still Worst)

poll: Slightly Better

epoll: Linux's Answer (What Node.js Uses)

The Three epoll System Calls

epoll_create1

epoll_ctl

epoll_wait

The Complete Journey: NIC to JavaScript Callback

kqueue: macOS and BSD

IOCP: Windows (Completion-Based)

Kernel Socket Buffer Tuning

SO_RCVBUF: The Receive Buffer

When Buffer Size Matters

TCP_NODELAY and Nagle's Algorithm

SO_REUSEPORT: Kernel-Level Load Distribution

Observing Syscall Patterns Under Load

/proc/net/tcp and Socket State Monitoring

The Production Incident: epoll Thundering Herd on Reconnect

Summary

Test your knowledge with more question sets

Discussion

The Evolution: `select` → `poll` → `epoll`

`select`: The Original (and Still Worst)

`poll`: Slightly Better

`epoll`: Linux's Answer (What Node.js Uses)

The Three `epoll` System Calls

`epoll_create1`

`epoll_ctl`

`epoll_wait`

`kqueue`: macOS and BSD

`SO_RCVBUF`: The Receive Buffer

`TCP_NODELAY` and Nagle's Algorithm

`SO_REUSEPORT`: Kernel-Level Load Distribution

`/proc/net/tcp` and Socket State Monitoring