The raw network stack: how a transaction payload travels from NIC to Node.js runtime boundary via epoll and libuv.
Module 3 — Kernel-Level I/O Multiplexing: epoll, kqueue, IOCP
What this module covers: Node.js's non-blocking I/O is not magic — it is a precise coordination between libuv and OS-level kernel interfaces. epoll on Linux, kqueue on macOS/BSD, and IOCP on Windows are the actual mechanisms that make thousands of concurrent connections possible without thousands of threads. Understanding them at the system call level lets you predict Node.js network behavior under load, tune kernel parameters correctly, and diagnose failures that are invisible from the JavaScript layer. This module traces a transaction payload from the network interface card to your callback with no hand-waving.
Why Kernel I/O Interfaces Matter
When a Node.js blockchain indexer handles 50,000 concurrent TCP connections from blockchain full nodes, each pushing new block data, two questions matter:
- How does the OS tell Node.js that one of those 50,000 connections has new data ready to read?
- How does Node.js read that data without blocking on connections that have no data?
The answer to both questions is the I/O multiplexer — a kernel interface that watches many file descriptors simultaneously and notifies the application when any of them are ready.
The three implementations differ in important ways. Knowing which one you're on and how it works determines which kernel parameters you can tune and how Node.js behaves under connection-heavy loads.
The Evolution: select → poll → epoll
select: The Original (and Still Worst)
select is the POSIX I/O multiplexer available on every Unix system. It watches a set of file descriptors and returns when any of them are ready.
c// select signature (simplified) int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
Critical limitation: O(N) at both submission and return. The fd_set is a fixed-size bitmap. The kernel scans every bit to find ready descriptors. You scan every bit again to find which ones fired. At 50,000 file descriptors, this is 50,000 bit checks twice, every time you call select. And select's maximum nfds is typically 1,024 on Linux — you literally cannot watch more than 1,024 file descriptors.
Node.js does not use select.
poll: Slightly Better
poll removes the 1,024 limit but keeps the O(N) scan:
cint poll(struct pollfd *fds, nfds_t nfds, int timeout);
You pass an array of pollfd structs. The kernel fills in revents for each one that's ready. You still scan the entire array to find which descriptors fired. At 50,000 connections, this is 50,000 struct scans on every call.
Node.js does not use poll for network sockets (though libuv falls back to it on some platforms for specific operations).
epoll: Linux's Answer (What Node.js Uses)
epoll was introduced in Linux 2.5.44 (2002) and is the foundation of Node.js's I/O on Linux. It solves the O(N) problem:
c// Create an epoll instance int epfd = epoll_create1(0); // Register a file descriptor to watch struct epoll_event ev; ev.events = EPOLLIN; // watch for readable data ev.data.fd = socket_fd; epoll_ctl(epfd, EPOLL_CTL_ADD, socket_fd, &ev); // Wait for events — returns only READY descriptors int n = epoll_wait(epfd, events, MAX_EVENTS, timeout_ms); // n = number of ready events, events[] contains only the ready ones
The key difference: epoll_wait returns only the file descriptors that are actually ready. If 50,000 connections are registered but only 3 have new data, epoll_wait returns 3 — not 50,000. The kernel maintains an internal ready-list and adds to it when descriptors become ready via interrupt-driven notification.
The result: epoll_wait is O(ready events), not O(total registered). Registering 50,000 connections costs 50,000 epoll_ctl calls at setup time, but each subsequent epoll_wait scales with activity, not connection count.
The Three epoll System Calls
Understanding these three calls precisely demystifies libuv's I/O poll phase.
epoll_create1
cint epfd = epoll_create1(EPOLL_CLOEXEC);
Creates an epoll instance — a kernel data structure that maintains the watched set and the ready list. Returns a file descriptor. EPOLL_CLOEXEC ensures the descriptor is closed if the process exec's (security hygiene).
libuv calls this once at startup. The single epfd watches all sockets, timers, pipes, and signals for the Node.js process.
epoll_ctl
c// Add a new file descriptor to watch epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event); // Modify the events to watch on an existing fd epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event); // Remove a file descriptor epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL);
Called by libuv when:
- A new TCP connection is accepted →
EPOLL_CTL_ADDwithEPOLLIN | EPOLLOUT - A socket transitions from write-interested to read-only →
EPOLL_CTL_MOD - A connection closes →
EPOLL_CTL_DEL
For a blockchain indexer receiving 50,000 concurrent connections: 50,000 EPOLL_CTL_ADD calls at connection time. One epfd watching all 50,000.
epoll_wait
cstruct epoll_event events[MAX_EVENTS]; int n = epoll_wait(epfd, events, MAX_EVENTS, timeout_ms); for (int i = 0; i < n; i++) { handle_event(events[i].data.fd, events[i].events); }
This is called by libuv in the poll phase of the event loop. It blocks until:
- At least one registered fd is ready, OR
- The timeout expires (calculated by libuv based on pending timers)
When it returns, events[0..n-1] contains only the ready descriptors. libuv processes each one, dispatching the appropriate callback.
The Complete Journey: NIC to JavaScript Callback
Let me trace a transaction payload from a blockchain full node through every layer to your callback. This is the path that happens millions of times per second in a production indexer.
1. [Network] Full node sends TCP segment containing block data
↓
2. [NIC] Network interface card receives the packet via DMA
NIC fires a hardware interrupt to notify the CPU
↓
3. [Kernel — Interrupt Handler]
CPU handles the interrupt in kernel context
Copies packet data from NIC buffer to kernel socket receive buffer
(SO_RCVBUF — we'll tune this below)
↓
4. [Kernel — Socket State Machine]
TCP stack processes the segment: acknowledgement, reassembly
If the socket was watching EPOLLIN, marks it ready in epoll's ready list
↓
5. [epoll_wait returns]
libuv's poll phase was blocked in epoll_wait
epoll_wait returns with this socket in the events[] array
↓
6. [libuv — I/O Poll Phase]
libuv calls the registered read callback for this socket handle
The callback calls recv() / read() to copy data from kernel buffer
to userspace (libuv's read buffer)
↓
7. [libuv → Node.js]
libuv fires the 'data' event on the corresponding net.Socket
Passes the data as a Buffer (or calls the registered stream handler)
↓
8. [JavaScript — Event Loop]
Your 'data' callback runs
You parse the Buffer, extract the transaction
Issue async database write (non-blocking)
Return from callback
↓
9. [Event loop continues]
Next ready socket's data callback runs
Or the loop blocks in epoll_wait again if nothing is ready
Every step from 1 to 7 happens in kernel or libuv C code, with zero JavaScript involvement. Step 8 is the only JavaScript. This is why a single Node.js process can handle 50,000 connections: steps 1–7 are handled by the kernel's interrupt-driven mechanisms, requiring no application-level threads.
kqueue: macOS and BSD
kqueue is the BSD equivalent of epoll. The API is different but the concept is identical: an event queue that you register file descriptors into and poll for readiness notifications.
c// Create a kqueue instance int kq = kqueue(); // Register a socket for read readiness struct kevent change; EV_SET(&change, socket_fd, EVFILT_READ, EV_ADD, 0, 0, NULL); kevent(kq, &change, 1, NULL, 0, NULL); // register (no wait) // Wait for events struct kevent events[MAX_EVENTS]; int n = kevent(kq, NULL, 0, events, MAX_EVENTS, NULL); // wait
Node.js on macOS uses kqueue via libuv's kqueue backend. The behavior is identical to epoll from the application perspective.
Practical difference for development: macOS often has lower default file descriptor limits (ulimit -n). If you're testing a high-connection-count scenario locally on macOS, you may need:
bash# Check current limit ulimit -n # Increase for current session (testing only) ulimit -n 65536 # Permanent: edit /etc/launchd.conf or use launchctl limit maxfiles
IOCP: Windows (Completion-Based)
Windows uses I/O Completion Ports (IOCP) — a fundamentally different model. Where epoll/kqueue are readiness-based (notify when an fd is ready to read/write), IOCP is completion-based (notify when an I/O operation has completed and the data is already in your buffer).
epoll/kqueue model:
"Socket X has data available. Go read it."
→ Application calls recv() to copy data from kernel to userspace buffer
IOCP model:
"I completed reading from socket X. The data is already in buffer Y at offset Z."
→ Application processes the data directly
IOCP can be more efficient on Windows because it eliminates the application-side recv() syscall — data arrives ready to use. libuv implements an IOCP backend for Windows, maintaining the same Node.js API surface across platforms.
Kernel Socket Buffer Tuning
The kernel maintains receive (SO_RCVBUF) and send (SO_SNDBUF) buffers for each TCP socket. These buffers sit between the NIC and the application.
SO_RCVBUF: The Receive Buffer
When a blockchain full node is pushing multi-MB block data, the kernel's receive buffer determines how much data can accumulate before the sender must stop (TCP flow control kicks in).
Default: 87,380 bytes (85KB) on most Linux systems.
For a blockchain indexer receiving large block payloads:
javascript// Set socket buffer sizes in Node.js const server = net.createServer((socket) => { // Set before the connection is fully established socket.setNoDelay(true); // disable Nagle socket.setKeepAlive(true, 30000); // Buffer sizes must be set via socket options if you need > default // Node.js doesn't expose SO_RCVBUF directly, but you can use a native addon // or set system-wide defaults via sysctl socket.on('data', (chunk) => { // chunk.length ≤ SO_RCVBUF processBlockData(chunk); }); });
bash# Check current socket buffer defaults sysctl net.core.rmem_default # default receive buffer sysctl net.core.rmem_max # maximum allowed receive buffer sysctl net.core.wmem_default # default send buffer # For high-throughput blockchain data streams: increase receive buffer sudo sysctl -w net.core.rmem_default=262144 # 256KB default sudo sysctl -w net.core.rmem_max=16777216 # 16MB max (for large block payloads) sudo sysctl -w net.ipv4.tcp_rmem="4096 262144 16777216" # Persist across reboots echo "net.core.rmem_default=262144" >> /etc/sysctl.conf echo "net.core.rmem_max=16777216" >> /etc/sysctl.conf
When Buffer Size Matters
Too small (default 85KB): Full node pushes a 2MB Ethereum block. The kernel receive buffer fills at 85KB. TCP advertises zero window to the sender. The sender pauses. Node.js drains the buffer. TCP opens the window again. This cycle creates bursty throughput instead of smooth streaming. Under extreme load, this back-and-forth adds latency.
Too large: More memory consumed per connection. With 50,000 connections at 2MB buffer each, that's 100GB of kernel buffer space — untenable.
Correct approach: Size to your expected payload size. For a blockchain indexer where individual blocks are 1–5MB, a 4MB receive buffer per connection is reasonable. Scale rmem_max to 16MB and let the OS auto-tune with tcp_rmem.
TCP_NODELAY and Nagle's Algorithm
Nagle's Algorithm was designed to reduce the number of small TCP packets. It buffers small writes and coalesces them:
Nagle's rule: send immediately IF:
- The buffer contains a full-size segment (MSS), OR
- All previously sent data has been acknowledged
Otherwise: buffer and wait for ACK or MSS
For a real-time payment gateway or WebSocket notification system, Nagle introduces visible latency. A 50-byte payment acknowledgement gets buffered waiting for a full 1460-byte segment that may never come.
javascript// Disable Nagle for low-latency payment ACKs socket.setNoDelay(true); // sets TCP_NODELAY // Where to set it: const server = net.createServer((socket) => { socket.setNoDelay(true); // as early as possible on new connections });
When to keep Nagle enabled: bulk data transfers where throughput matters more than latency. A blockchain full node pushing bulk historical block data benefits from Nagle — fewer, larger packets use network bandwidth more efficiently.
The rule for a hybrid system: disable Nagle on connections that send real-time acknowledgements; enable (or leave default) on connections used for bulk data ingestion.
SO_REUSEPORT: Kernel-Level Load Distribution
By default, only one process can bind to a given port. SO_REUSEPORT allows multiple processes (or sockets) to bind to the same port, with the kernel distributing incoming connections between them using a hash of (source IP, source port, dest IP, dest port).
This is the correct way to use Node.js cluster for TCP workloads — instead of the master process distributing connections in userspace (with round-robin, which breaks connection affinity), the kernel distributes them with consistent hashing.
javascript// Using SO_REUSEPORT for multiple workers bound to the same port // This requires Node.js 16+ and setting reusePort: true import cluster from 'node:cluster'; import net from 'node:net'; import { cpus } from 'node:os'; if (cluster.isPrimary) { for (let i = 0; i < cpus().length; i++) { cluster.fork(); } } else { // Each worker creates its OWN server on the same port // The kernel distributes connections between workers const server = net.createServer({ reusePort: true }, (socket) => { socket.setNoDelay(true); handleConnection(socket); }); server.listen(3000, () => { console.log(`Worker ${process.pid} listening on :3000`); }); }
Why this beats standard cluster round-robin:
- Standard cluster: master process accepts ALL connections, sends to workers via IPC
SO_REUSEPORT: kernel accepts connections directly into each worker's accept queue- No IPC overhead. No single-process bottleneck at the accept queue.
- Connection affinity: subsequent packets from the same client consistently reach the same worker (important for stateful protocol handling)
Observing Syscall Patterns Under Load
strace (Linux) and dtruss (macOS) let you observe the actual system calls Node.js makes, confirming the kernel interactions described above.
bash# Trace I/O-related syscalls for a running Node.js process strace -p $(pgrep -f "node indexer") \ -e trace=epoll_wait,epoll_ctl,read,write,accept4,close \ -T # show time spent in each syscall
Example output during active ingestion:
epoll_wait(5, [{EPOLLIN, {u32=12, u64=12}}], 1024, 100) = 1 <0.000421>
read(12, "\x82\x00\x00\x03\xf4...", 65536) = 4096 <0.000031>
epoll_wait(5, [{EPOLLIN, {u32=12, u64=12}}], 1024, 100) = 1 <0.000012>
read(12, "\x82\x00\x00\x03\xf4...", 65536) = 3892 <0.000028>
epoll_wait(5, [], 1024, 100) = 0 <0.100041> ← 100ms idle wait
Reading this:
epoll_waitreturning 1 = one socket is readyreadwith < 65536 bytes = end of available data in bufferepoll_waitreturning 0 after 100ms = no activity, loop is idle
High-load pattern:
epoll_wait(5, [{EPOLLIN, ...}, {EPOLLIN, ...}, {EPOLLIN, ...}], 1024, 0) = 48 <0.000003>
epoll_wait returning 48 events with timeout=0 means the ready list is continuously populated — the event loop never goes idle. This is a healthy high-throughput state, not starvation, as long as each callback is fast.
/proc/net/tcp and Socket State Monitoring
The kernel exposes all TCP socket state via /proc/net/tcp (IPv4) and /proc/net/tcp6 (IPv6).
bash# Count sockets by state awk 'NR>1 {print $4}' /proc/net/tcp | sort | uniq -c | sort -rn # States (hex): 01=ESTABLISHED, 06=TIME_WAIT, 0A=LISTEN, 0B=CLOSING
The SYN backlog: detecting connection drops
When a blockchain full node makes 10,000 simultaneous new connections to your indexer, the kernel must process SYN packets faster than the application calls accept(). The SYN backlog queue holds pending connections waiting for accept().
If the SYN backlog fills, the kernel drops new SYN packets silently — clients get no response and eventually time out.
bash# Check current SYN backlog setting sysctl net.ipv4.tcp_max_syn_backlog # default: 128–256 # For high-connection-rate indexers sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65536 sudo sysctl -w net.core.somaxconn=65536 # also raise the listen() backlog # In Node.js: pass backlog to listen() server.listen(3000, '0.0.0.0', 65536); // explicit backlog parameter
bash# Monitor SYN drops in real time watch -n 1 "netstat -s | grep -i 'syn\|overflow\|drop'"
The Production Incident: epoll Thundering Herd on Reconnect
Context: A blockchain indexer with 5,000 persistent connections from blockchain full nodes. The indexer deployed a new version. All 5,000 full nodes detected the connection drop simultaneously and attempted to reconnect at the same moment.
What happened:
All 5,000 connection attempts arrived within 200ms. The kernel accepted them all into the SYN backlog, but the Node.js net.createServer call queue could only accept connections as fast as the single accept4 call in libuv processed them.
The real problem: all 5,000 connections established simultaneously generated 5,000 epoll_ctl(EPOLL_CTL_ADD) calls, each sending an immediate "hello" protocol message. The first epoll_wait after reconnection returned 5,000 ready events. libuv dispatched 5,000 callbacks simultaneously. Each callback triggered a database lookup (authenticate the reconnecting node). The database connection pool (default 10 connections) was instantly exhausted. 4,990 database lookups queued. Authentication took 30–120 seconds per connection instead of < 10ms. The thundering herd had caused a database pool exhaustion cascade.
The fix — connection rate limiting at the TCP accept layer:
javascript// Limit simultaneous new connections being processed let activeHandshakes = 0; const MAX_CONCURRENT_HANDSHAKES = 50; const pendingConnections = []; server.on('connection', (socket) => { if (activeHandshakes >= MAX_CONCURRENT_HANDSHAKES) { // Defer processing — socket is accepted at TCP level, won't be dropped pendingConnections.push(socket); return; } processConnection(socket); }); function processConnection(socket) { activeHandshakes++; socket.setNoDelay(true); authenticateNode(socket) .then(() => registerNode(socket)) .finally(() => { activeHandshakes--; // Process next deferred connection if (pendingConnections.length > 0) { const next = pendingConnections.shift(); setImmediate(() => processConnection(next)); } }); }
Additionally: SO_REUSEPORT across 8 workers distributed the initial accept load — instead of one process handling all 5,000 accept calls, 8 workers each handled ~625. Reconnection storm time dropped from 120s to 8s.
Summary
| Concept | Key Takeaway |
|---|---|
select/poll | O(N) scan of all watched fds. Do not use for high-connection-count workloads. |
epoll (Linux) | O(ready events). The foundation of Node.js I/O on Linux. |
kqueue (macOS) | Conceptually identical to epoll. Same behavior, different API. |
| IOCP (Windows) | Completion-based (not readiness-based). Handled transparently by libuv. |
| NIC → JavaScript | 9-step path: NIC interrupt → kernel TCP stack → epoll ready list → libuv poll phase → your callback. |
SO_RCVBUF | Kernel receive buffer per socket. Tune for expected payload size. Default 85KB is too small for multi-MB blockchain block data. |
TCP_NODELAY | Disables Nagle. Required for low-latency payment ACKs. Avoid for bulk data transfers. |
SO_REUSEPORT | Multiple workers bind the same port. Kernel distributes connections with connection affinity. No IPC overhead. |
| SYN backlog | Set tcp_max_syn_backlog=65536 for high reconnect-rate scenarios. |
strace | Direct observation of system calls — confirms epoll interaction patterns. |
| Thundering herd | Simultaneous reconnects + database pool exhaustion = cascade failure. Rate-limit connection processing. |
You now know what happens at every layer below Node.js. Module 4 moves back up to the application layer and covers the HTTP/TCP subsystem — backpressure mechanisms, high-water marks, the drain event contract, and how to prevent socket floods from taking down your ingestion service.
Next: Module 4 — The HTTP/TCP Subsystem & Ingestion Backpressure →