Module A-6·24 min read

The real-world problem: 10+ multithreaded Node.js instances processing Kafka-delivered blockchain blocks at 2,000+ TPS without double-processing. Block-range partitioning via Redis locks, heartbeat extension, crash recovery, and the 6-hour replication lag incident.

A-6 — The SupraScan Architecture: Coordinating 10+ Concurrent Scanner Instances

Q: In the SupraScan architecture, how does the system maintain a global "frontier" (the highest contiguous block number fully indexed) when multiple workers might be processing blocks out of order simultaneously?

Workers add completed block numbers to a Redis Sorted Set. A function periodically checks the Sorted Set looking for an unbroken sequence of block numbers starting from the current frontier + 1, advancing the frontier only when a contiguous run is found. — Because 10+ workers are processing blocks in parallel, Block 105 might finish before Block 104. You cannot advance the "safe, fully processed" frontier to 105 until 104 is also done. Using a Redis Sorted Set allows the system to store out-of-order completions efficiently. The frontier advancement logic simply queries the sorted set starting from the current frontier and counts upward sequentially. If it finds 101, 102, 103, and 105 (but not 104), the frontier stops at 103. Once 104 is added, the next check will cleanly advance the frontier to 105.

Q: Why did the "6-Hour Replication Lag Incident" cause SupraScan's reclaimer process to erroneously requeue approximately 1 million blocks that had already been successfully indexed in PostgreSQL?

The reclaimer process was reading coordination state (like completion markers) from a Redis replica instead of the primary. Because the replica was 6 hours behind, it reported that the blocks were incomplete, causing the reclaimer to aggressively re-assign them. — This is the central lesson of the case study. Read replicas are fantastic for scaling read-heavy workloads like user dashboards or analytics. However, for strict distributed coordination—where decisions (like "should I re-process this data?") are made based on the read value—you must read from the primary. If you read from a lagging replica, your application will make decisions based on stale state, leading to catastrophic logic failures like massive unnecessary reprocessing.

Q: When SupraScan's background reclaimer scans for orphaned block ranges, how does it distinguish between a "healthy but slow" worker currently processing a large block range, and a "crashed" worker whose block range needs to be reclaimed?

It checks for the existence of a separate, short-TTL "heartbeat" key associated with the worker. If the lock is close to expiry but the heartbeat key is missing, the worker is presumed dead and the range is reclaimed. — A lock TTL alone is not enough to manage complex state recovery. A worker might be processing a massive block that takes a long time, and its lock might be getting close to expiry before the next watchdog extension cycle. If the reclaimer blindly reassigns the block based only on the lock TTL, it will cause double-processing. By requiring workers to constantly publish a short-lived heartbeat key, the reclaimer has an independent signal of the worker process's health. Missing heartbeat = dead worker, safe to reclaim.

This module is different. The previous modules covered Redis primitives in theory and controlled examples. This module is a case study from production: building a distributed blockchain indexer that processes Kafka-delivered blocks at 2,000+ transactions per second across 10+ concurrent Node.js instances. Every technique from this course is here — distributed locks, Redis coordination, crash recovery, memory management under load, and the failure modes that only appear at production scale.

The Problem

SupraScan is a blockchain indexer for the Supra L1 network. Its job: consume every block from Kafka, parse every transaction, and write structured data to PostgreSQL — with no missed blocks, no duplicate processing, and no gaps in the indexed history.

The constraints:

Throughput: 2,000+ transactions per second at peak
Concurrency: 10+ worker instances, each multi-threaded (Node.js cluster + worker threads)
Ordering: Blocks must be indexed in strict sequence (block N before block N+1)
Durability: The indexed data is the only source of historical state — unavailable via on-chain RPC. No data loss is acceptable.
Availability: Processing must resume automatically after any worker crash
Historical data: The indexer's complete historical dataset became foundational infrastructure consumed by multiple cross-functional teams across analytics, product, and research — this elevated the cost of any data loss or gap

The fundamental challenge: distributed coordination without a central coordinator. Every worker is equal. Any worker can crash at any time. No worker has special authority.

The Architecture

text

Kafka (block stream)
        │
        │ Multiple consumer groups
        ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Worker 1    │  │  Worker 2    │  │  Worker N    │
│  (Node.js)   │  │  (Node.js)   │  │  (Node.js)   │
│  Cluster     │  │  Cluster     │  │  Cluster     │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       └────────────────┬┘                 │
                        │                 │
                   ┌────▼──────────────────▼───┐
                   │         Redis              │
                   │  (coordination layer)      │
                   └────────────────────────────┘
                                │
                           ┌────▼────┐
                           │ PostgreSQL│
                           └─────────┘

Redis is the coordination layer — it holds lock state, progress tracking, and health signals. PostgreSQL is the data store. Kafka is the event source.

Block-Range Partitioning via Redis Locks

Each Kafka partition delivers blocks in order. Multiple workers consume from multiple partitions simultaneously. The challenge: if two workers consume the same block (e.g., due to consumer group rebalancing), they must not both write it to the database.

Solution: Block-range locks.

Before processing a block range (e.g., blocks 100,000–100,099), a worker acquires a Redis lock for that range:

SET lock:block:range:100000 {worker_uuid} NX PX 60000

If the lock is acquired, the worker owns that range exclusively. If another worker also tries to process the same range (due to consumer group rebalancing, duplicate delivery), it sees the lock and skips.

typescript

async function processBlockRange(startBlock: number, endBlock: number, blocks: Block[]) {
  const rangeKey = `lock:block:range:${startBlock}`;
  const workerId = `${os.hostname()}:${process.pid}:${Date.now()}`;

  const acquired = await redis.set(rangeKey, workerId, 'NX', 'PX', 60000);
  if (acquired !== 'OK') {
    logger.info(`Block range ${startBlock}-${endBlock} already locked, skipping`);
    return;
  }

  // Start heartbeat to extend lock while processing
  const heartbeat = setInterval(async () => {
    await redis.pexpire(rangeKey, 60000);  // extend by 60 more seconds
  }, 20000);  // renew every 20s (1/3 of TTL)

  try {
    for (const block of blocks) {
      await indexBlock(block);
    }
    // Mark range as complete
    await redis.set(`complete:block:range:${startBlock}`, '1', 'EX', 86400);
  } finally {
    clearInterval(heartbeat);
    // Release lock only if we still hold it
    const current = await redis.get(rangeKey);
    if (current === workerId) {
      await redis.del(rangeKey);
    }
  }
}

Idempotency Check

Before acquiring the lock, check if this range was already completed (by a previous run or another worker that finished and released the lock):

typescript

const alreadyDone = await redis.exists(`complete:block:range:${startBlock}`);
if (alreadyDone) {
  logger.debug(`Block range ${startBlock} already indexed, skipping`);
  return;
}

The complete:* keys have a 24-hour TTL — sufficient to prevent reprocessing during normal operation while not accumulating indefinitely.

Progress Tracking: Sorted Set as a Processing Frontier

The indexer must maintain a global "frontier" — the highest contiguous block number that has been fully indexed. Blocks may be processed out of order (due to parallel consumers), so we need to track which blocks are complete and advance the frontier only when gaps are filled.

typescript

const COMPLETED_BLOCKS_KEY = 'indexer:completed:blocks';
const FRONTIER_KEY = 'indexer:frontier';

// After successfully indexing a block:
async function markBlockComplete(blockNumber: number) {
  // Add to the sorted set (score = blockNumber)
  await redis.zadd(COMPLETED_BLOCKS_KEY, blockNumber, String(blockNumber));

  // Try to advance the frontier
  await advanceFrontier();
}

async function advanceFrontier() {
  const frontier = parseInt(await redis.get(FRONTIER_KEY) ?? '0', 10);

  // Get all completed blocks from frontier+1 onwards
  const nextBlocks = await redis.zrangebyscore(
    COMPLETED_BLOCKS_KEY,
    frontier + 1,
    frontier + 10000,  // look ahead 10,000 blocks
    'LIMIT', 0, 10000
  );

  let newFrontier = frontier;
  for (const blockStr of nextBlocks) {
    const block = parseInt(blockStr, 10);
    if (block === newFrontier + 1) {
      newFrontier = block;
    } else {
      break;  // gap in sequence — stop advancing
    }
  }

  if (newFrontier > frontier) {
    await redis.set(FRONTIER_KEY, String(newFrontier));
    // Clean up completed blocks below the new frontier (memory management)
    await redis.zremrangebyscore(COMPLETED_BLOCKS_KEY, 0, newFrontier);
    logger.info(`Frontier advanced to block ${newFrontier}`);
  }
}

The Sorted Set holds in-flight and recently completed block numbers. The frontier advances when a contiguous run of completed blocks is detected. Completed blocks below the frontier are trimmed to prevent unbounded memory growth.

Crash Recovery: Reclaiming Orphaned Locks

When a worker crashes mid-processing, its lock expires (due to TTL). The block range is not marked complete. Another worker must pick it up.

The recovery mechanism: a background "reclaimer" process periodically scans for block ranges that have locks but are not in the completed set and not currently being processed by a healthy worker.

typescript

async function reclaimOrphanedRanges() {
  // Scan all block range locks
  const lockKeys: string[] = [];
  let cursor = '0';
  do {
    const [nextCursor, keys] = await redis.scan(cursor, 'MATCH', 'lock:block:range:*', 'COUNT', 100);
    lockKeys.push(...keys);
    cursor = nextCursor;
  } while (cursor !== '0');

  for (const lockKey of lockKeys) {
    const lockValue = await redis.get(lockKey);
    if (!lockValue) continue;  // already expired

    const ttl = await redis.pttl(lockKey);
    if (ttl > 30000) continue;  // lock is healthy (more than 30s remaining)

    // Lock is close to expiry — is the holder still alive?
    const [, workerId] = lockKey.split('lock:block:range:');
    const heartbeatKey = `heartbeat:${lockValue}`;
    const heartbeatExists = await redis.exists(heartbeatKey);

    if (!heartbeatExists) {
      // Worker is dead — reclaim this range for reprocessing
      logger.warn(`Reclaiming orphaned lock ${lockKey} (worker ${lockValue} is down)`);
      await redis.del(lockKey);
      // The range will be re-acquired by the next worker that processes it
    }
  }
}

Worker Heartbeats

Each worker publishes a heartbeat key with a short TTL. If the worker crashes, the heartbeat expires and the reclaimer can detect orphaned ranges:

typescript

// Each worker: publish heartbeat every 10 seconds
const HEARTBEAT_TTL = 30;  // 30 seconds
const heartbeatKey = `heartbeat:${workerId}`;

setInterval(async () => {
  await redis.set(heartbeatKey, Date.now().toString(), 'EX', HEARTBEAT_TTL);
}, 10000);

// On clean shutdown: remove heartbeat immediately
process.on('SIGTERM', async () => {
  await redis.del(heartbeatKey);
  process.exit(0);
});

The 6-Hour Replication Lag Incident

The most instructive production failure: a Redis replica fell 6 hours behind the primary due to network partition. After the partition healed and the replica caught up, approximately 6 hours of completed-block markers were replayed — causing the reclaimer to see blocks as "not completed" and requeue them for reprocessing.

Root cause: The reclaimer queried a replica (for read scaling), which had stale data. It saw block 500,000 as incomplete (because the completion marker had not yet replicated), even though it was fully indexed in PostgreSQL.

Consequences: Approximately 1M blocks were re-queued for reprocessing. The workers detected duplicate inserts via PostgreSQL's ON CONFLICT DO NOTHING and discarded them, but the extra processing load caused a 2-hour throughput degradation.

Fix:

Read coordination state from the primary only. Completion markers, frontier state, and lock state are read from the primary Redis connection. Replica reads are only used for non-critical data (metrics, display data).
Idempotent writes to PostgreSQL. All INSERT statements use ON CONFLICT DO NOTHING or ON CONFLICT DO UPDATE. Double processing produces the same result as single processing.
Replication lag monitoring. Alert when replica.lag > 5s. A 6-hour lag should have triggered alerts long before it caused data consistency issues.

typescript

// Separate client for coordination reads (always primary)
const primaryRedis = new Redis({ host: 'redis-primary.internal' });
const replicaRedis = new Redis({ host: 'redis-replica.internal' });

// All coordination reads from primary
const frontier = await primaryRedis.get(FRONTIER_KEY);
const lockExists = await primaryRedis.exists(rangeKey);

// Non-critical display data can use replica
const displayMetrics = await replicaRedis.hgetall('indexer:metrics');

Memory Management Under Load

At 2,000+ TPS with 10 workers, Redis memory pressure was a real concern. Key management practices:

Bounded Sorted Set for completed blocks:

typescript

// Keep only the last 100,000 completed block numbers in the sorted set
// (frontier advancement removes everything below the frontier anyway)
await redis.zremrangebyrank(COMPLETED_BLOCKS_KEY, 0, -100001);

TTL on all coordination keys:

Every key written by the indexer has a TTL — lock keys (60s), heartbeat keys (30s), completion markers (24h). No orphan keys accumulate indefinitely.

Monitoring key count:

typescript

setInterval(async () => {
  const info = await primaryRedis.info('keyspace');
  const dbMatch = info.match(/db0:keys=(\d+)/);
  if (dbMatch) {
    metrics.gauge('redis.key_count', parseInt(dbMatch[1]));
  }
}, 60000);

Lessons for Your Own Distributed Systems

Read coordination state from the primary. Replica lag causes phantom failures and false recoveries. The performance cost is acceptable for coordination operations.
Make all writes idempotent. Double-processing must produce the same result as single-processing. Design your database writes with ON CONFLICT from the start, not as an afterthought.
TTL everything. Every Redis key written by background processes must have a TTL. An exception is a memory leak waiting to manifest under production load.
Heartbeats are not optional. Without heartbeats, you cannot distinguish a slow worker from a crashed one. Without that distinction, your reclaimer either ignores crashed workers (gap in processing) or aggressively reclaims slow workers (double processing under load).
Monitor replication lag as a first-class metric. A large lag is a data consistency crisis in slow motion. Alert at seconds, not minutes.
The sorted set frontier pattern scales. Using a Sorted Set to track out-of-order completions and advance a contiguous frontier is a general pattern — applicable to any system processing an ordered stream with parallel workers.

Summary

SupraScan coordinated 10+ concurrent worker instances using Redis block-range locks with heartbeat extension and UUID-based holder identity
The completed blocks Sorted Set + frontier key pattern tracked out-of-order processing and maintained strict sequential guarantees
Crash recovery via lock expiry + heartbeat absence detection — a worker with a missing heartbeat is presumed dead; its ranges are reclaimed
The 6-hour replication lag incident taught the fundamental lesson: always read coordination state from the primary; replica reads are for display data only
Idempotent database writes are the safety net — double processing should produce the same result as single processing
TTL every coordination key, monitor key counts, alert on replication lag above 5 seconds

Next: A-7 — Master-Replica Replication: PSYNC, Replication Buffer, and Lag — how Redis propagates writes to replicas, what happens when a replica falls behind, and how to measure and manage replication lag.

Knowledge Check

In the SupraScan architecture, how does the system maintain a global "frontier" (the highest contiguous block number fully indexed) when multiple workers might be processing blocks out of order simultaneously?

Why did the "6-Hour Replication Lag Incident" cause SupraScan's reclaimer process to erroneously requeue approximately 1 million blocks that had already been successfully indexed in PostgreSQL?

When SupraScan's background reclaimer scans for orphaned block ranges, how does it distinguish between a "healthy but slow" worker currently processing a large block range, and a "crashed" worker whose block range needs to be reclaimed?

Test your knowledge with more question sets

PreviousModule A-5: Reentrant Locks, Hierarchies, and Deadlock Prevention Next Module A-7: Master-Replica Replication: PSYNC, Replication Buffer, and Lag

Discussion

Join the discussion

Loading comments...