Module A-10·23 min read

When to decouple — the outbox pattern, datastore split strategies, and transitioning a write-heavy module out of the monolith without data loss.

Module 9 — Pragmatic Microservice Deconstruction: Splitting Ingestion from Analytics

Q: When migrating from a Modulith to Microservices, what is the primary purpose of the "Outbox Pattern"?

To ensure that a domain event is reliably published to a message broker (like Kafka) by writing the event to an outbox table within the same atomic database transaction as the business data. — The Outbox Pattern guarantees atomicity between database writes and event publishing. If the business data is saved, the event is guaranteed to be recorded in the outbox table in the same transaction. A separate process or worker can then reliably publish these events to a message broker.

Q: Why is the `FOR UPDATE SKIP LOCKED` clause critical when multiple workers are processing an outbox table?

It allows concurrent workers to claim rows from the queue simultaneously without blocking each other or processing duplicate events. — `FOR UPDATE SKIP LOCKED` enables high-concurrency row locking. If Worker A locks the first 100 rows, Worker B will skip those locked rows and immediately grab the next 100, preventing queue contention and ensuring events aren't processed simultaneously by multiple workers.

Q: What role does an Anti-Corruption Layer (ACL) play when splitting the Analytics service from the Ingestion service?

It translates and maps the versioned events published by the Ingestion service into the unique domain model required by the Analytics service. — An ACL ensures that a downstream service (Analytics) is not forced to adopt the internal data model of an upstream service (Ingestion). By translating versioned ingestion events into analytics-specific models, the ACL isolates the analytics service from upstream schema changes.

What this module covers: The Modulith from Module 8 is the right starting architecture. But eventually a specific module genuinely needs independent scaling, geographic distribution, or a different runtime. Extracting a module incorrectly causes data loss, split-brain state, and deployment coupling that defeats the purpose of the split. This module covers how to identify the correct seam to split, the outbox pattern for atomically publishing events during extraction, datastore split strategies, and the operational reality of running two services where there was one.

When to Split: The Three Legitimate Signals

The decision to split a module out of the Modulith should be data-driven. There are exactly three legitimate reasons:

Signal 1: Independent scaling requirement

A module needs 10× the compute of others. The analytics module needs 32 CPU cores for aggregation while ingestion needs 4. Scaling the whole Modulith to 32 cores wastes 28 cores of ingestion capacity.

javascript

// Measure per-module CPU cost
const { cpuUsage } = process;
const before = cpuUsage();
await analyticsModule.processTransaction(tx);
const usage = cpuUsage(before);
console.log(`Analytics CPU: ${usage.user}μs user, ${usage.system}μs system`);

If analytics consistently uses 15× more CPU than ingestion per transaction, that's a scaling signal.

Signal 2: Different availability requirements

Ingestion must be 99.99% available (data loss on downtime). Analytics can tolerate 99.9% (reports are slightly stale during an outage). Running them together means analytics bugs can take down ingestion — an unacceptable risk profile.

Signal 3: Genuine team autonomy need

A separate team owns analytics and deploys 10 times per day. Coupling their deployment to ingestion (which deploys once a week) creates a deployment bottleneck. Conway's Law applies.

Do NOT split for:

"Microservices are the modern way" (cargo cult)
The module is large (size is not a reason for distribution)
You want to use a different language (use a native addon instead)
"It might need to scale later" (YAGNI — split when the signal is real)

Identifying the Split Seam

The correct seam for splitting is where write contention and read contention diverge.

For a blockchain indexer:

text

Write path:    incoming block → parse → validate → INSERT into transactions table
Read path:     API queries → SELECT aggregations, lookups, analytics

Both paths hit the same PostgreSQL primary.
At 50K writes/sec + 10K reads/sec, the primary's write throughput is the constraint.
Reads on the primary compete with writes for buffer pool, WAL, and I/O.

The seam: separate the write service from the read/analytics service. The write service owns the primary. The analytics service reads from a read replica or a separate projection store.

text

Before split:
  [Modulith] → writes transactions → PostgreSQL primary
  [Modulith] → reads analytics ← PostgreSQL primary (competing I/O)

After split:
  [Ingestion Service] → writes transactions → PostgreSQL primary
  [Ingestion Service] → publishes events → Kafka topic
  [Analytics Service] → consumes Kafka → builds projections → PostgreSQL analytics (separate)
  [Analytics Service] → serves read API

Now writes and reads never compete. The ingestion service can sustain 50K writes/sec at full I/O. Analytics has its own PostgreSQL instance sized for reads.

The Outbox Pattern: Atomic Write + Event Publish

The most dangerous moment in a service split is when a write to the database and a publish to Kafka must be treated as a unit. If you write to the database and then publish to Kafka, a crash between the two leaves the database updated but Kafka unnotified — downstream services miss the event.

javascript

// BROKEN: non-atomic write + publish
async function ingestTransaction(tx) {
  await db.write(tx);              // step 1: DB write succeeds
  // CRASH HERE → Kafka never gets the event → analytics never updates
  await kafka.publish('transactions', tx);  // step 2: may never execute
}

The outbox pattern solves this by writing the event to an outbox table in the same database transaction as the business data. Atomicity is guaranteed by the database. A separate process reads the outbox and publishes to Kafka.

javascript

// Step 1: Write transaction + outbox event in ONE database transaction
async function ingestTransaction(tx) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');

    // Write business data
    await client.query(
      'INSERT INTO transactions (hash, block_height, sender, amount) VALUES ($1, $2, $3, $4)',
      [tx.hash, tx.blockHeight, tx.sender, tx.amount]
    );

    // Write outbox event — same transaction, atomically
    await client.query(
      'INSERT INTO outbox (event_type, payload, created_at) VALUES ($1, $2, NOW())',
      ['TRANSACTION_INGESTED', JSON.stringify(tx)]
    );

    await client.query('COMMIT');
    // Either BOTH succeed or BOTH fail. Never one without the other.
  } catch (err) {
    await client.query('ROLLBACK');
    throw err;
  } finally {
    client.release();
  }
}

javascript

// Step 2: Outbox publisher — reads unprocessed events, publishes to Kafka
async function outboxPublisher() {
  while (true) {
    const { rows } = await pool.query(
      'SELECT id, event_type, payload FROM outbox WHERE published_at IS NULL ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED'
    );

    if (rows.length === 0) {
      await sleep(100);  // no events, wait briefly
      continue;
    }

    // Publish to Kafka
    await kafka.sendBatch(rows.map(row => ({
      topic: 'transactions',
      messages: [{ key: row.id.toString(), value: row.payload }]
    })));

    // Mark as published
    await pool.query(
      'UPDATE outbox SET published_at = NOW() WHERE id = ANY($1)',
      [rows.map(r => r.id)]
    );
  }
}

Why FOR UPDATE SKIP LOCKED: Multiple outbox publisher instances can run safely. Each grabs a batch of rows that no other publisher is currently processing. No duplicate publishes.

The at-least-once guarantee: if the publisher crashes after publishing to Kafka but before marking the rows, it will re-publish on restart. The Kafka consumer must be idempotent — processing the same event twice must be safe.

Kafka Consumer: Idempotent Processing

The analytics service consumes from the Kafka topic. Because the outbox guarantees at-least-once delivery, the consumer must handle duplicates:

javascript

import { Kafka } from 'kafkajs';

const kafka = new Kafka({ brokers: ['kafka:9092'] });
const consumer = kafka.consumer({ groupId: 'analytics-service' });

await consumer.connect();
await consumer.subscribe({ topic: 'transactions', fromBeginning: false });

await consumer.run({
  autoCommit: false,  // manual offset management for exactly-once processing
  eachMessage: async ({ topic, partition, message, heartbeat }) => {
    const tx = JSON.parse(message.value.toString());

    // Idempotent processing: INSERT ... ON CONFLICT DO NOTHING
    // If we've seen this transaction ID before, skip silently
    const result = await analyticsPool.query(
      `INSERT INTO transaction_projections (tx_hash, block_height, amount, processed_at)
       VALUES ($1, $2, $3, NOW())
       ON CONFLICT (tx_hash) DO NOTHING`,
      [tx.hash, tx.blockHeight, tx.amount]
    );

    if (result.rowCount === 0) {
      console.log(`Duplicate event skipped: ${tx.hash}`);
    }

    // Commit offset only after successful processing
    await consumer.commitOffsets([{
      topic, partition,
      offset: (parseInt(message.offset) + 1).toString()
    }]);

    // Heartbeat for long-running processing
    await heartbeat();
  }
});

Datastore Split Strategies

When splitting, you have several options for how to divide the data:

Strategy 1: Primary + Read Replica (simplest)

text

Ingestion Service → PostgreSQL primary (writes)
Analytics Service → PostgreSQL read replica (reads)

The read replica receives WAL from the primary and stays < 1 second behind. Analytics queries run against it without competing with writes. No separate schema required.

Limitation: analytics read volume still competes with replication I/O on the replica. Works for moderate analytics load.

Strategy 2: CQRS with separate projection store

text

Ingestion Service → PostgreSQL primary (normalized, OLTP)
Analytics Service → analytics PostgreSQL (denormalized projections, OLAP)
                  → Updated via Kafka consumer

The analytics database is optimized for read patterns: denormalized tables, partial indexes, materialized views. The ingestion database is optimized for write throughput: normalized, minimal indexes.

javascript

// Analytics projection table — denormalized for fast reads
// Updated by Kafka consumer, not by ingestion service
await analyticsPool.query(`
  CREATE TABLE IF NOT EXISTS daily_volume (
    date DATE,
    total_amount NUMERIC(38, 8),
    tx_count BIGINT,
    unique_senders BIGINT,
    PRIMARY KEY (date)
  )
`);

// Consumer updates projection
await analyticsPool.query(`
  INSERT INTO daily_volume (date, total_amount, tx_count, unique_senders)
  VALUES (DATE($1), $2, 1, 1)
  ON CONFLICT (date) DO UPDATE SET
    total_amount = daily_volume.total_amount + $2,
    tx_count = daily_volume.tx_count + 1
`, [tx.timestamp, tx.amount]);

Strategy 3: Dedicated analytics store (ClickHouse/TimescaleDB)

For high-cardinality time-series analytics (transaction volume by hour by sender by network), a columnar store like ClickHouse provides 10–100× better query performance than PostgreSQL.

javascript

// Kafka consumer writing to ClickHouse via HTTP interface
await fetch('http://clickhouse:8123/', {
  method: 'POST',
  body: `INSERT INTO transactions FORMAT JSONEachRow\n${JSON.stringify(tx)}\n`,
});

This is the correct deployment for a production blockchain explorer where analytics queries aggregate billions of rows.

Service Contract: The Anti-Corruption Layer

When splitting, the analytics service must not depend on the ingestion service's internal data model. If ingestion changes its schema, analytics should not break.

javascript

// modules/ingestion/events.ts — the public contract (never changes)
export interface TransactionIngestedEvent {
  eventType: 'TRANSACTION_INGESTED';
  eventVersion: 1;
  transactionHash: string;
  blockHeight: number;
  senderAddress: string;
  amount: string;  // string to avoid BigInt serialization issues
  timestamp: number;
}

// Anti-corruption layer in analytics: translate from ingestion's event format
// to analytics' domain model
function translateEvent(event: TransactionIngestedEvent): AnalyticsTransaction {
  return {
    hash: event.transactionHash,  // analytics uses different field names
    height: event.blockHeight,
    sender: event.senderAddress,
    value: BigInt(event.amount),  // analytics uses BigInt internally
    at: new Date(event.timestamp * 1000),
  };
}

Version the events (eventVersion: 1). When the ingestion event format changes, bump to eventVersion: 2 and have analytics handle both versions during the transition.

Operational Reality: Two Services

After the split, you have doubled the operational surface:

Concern	Before (Modulith)	After (Split)
Deployments	1 pipeline	2 pipelines
Health checks	1 endpoint	2 endpoints
Logs	1 log stream	2 log streams
Distributed tracing	Optional	Required
Schema migrations	1 database	2 databases
Kafka consumer lag	N/A	Must monitor
Outbox queue depth	N/A	Must monitor
Data consistency	Guaranteed (same TX)	Eventual (Kafka lag)

Monitoring additions for the split:

javascript

// Monitor Kafka consumer lag (analytics falling behind ingestion)
const { OFFSET_OUT_OF_RANGE } = ErrorCodes;

setInterval(async () => {
  const offsets = await admin.fetchOffsets({ groupId: 'analytics-service', topics: ['transactions'] });
  const endOffsets = await admin.fetchTopicOffsets('transactions');

  for (const partition of offsets.offsets) {
    const end = endOffsets.find(e => e.partition === partition.partition);
    const lag = parseInt(end.offset) - parseInt(partition.offset);
    consumerLagGauge.labels(partition.partition.toString()).set(lag);

    if (lag > 100_000) {
      logger.warn(`Analytics consumer lag: ${lag} messages on partition ${partition.partition}`);
    }
  }
}, 10_000);

javascript

// Monitor outbox queue depth (publish delay)
setInterval(async () => {
  const { rows } = await pool.query(
    'SELECT COUNT(*) FROM outbox WHERE published_at IS NULL'
  );
  outboxDepthGauge.set(parseInt(rows[0].count));
}, 5_000);

Production Incident: Split-Brain During Extraction Migration

Context: A blockchain indexer mid-migration. Ingestion writes to PostgreSQL. The outbox publisher was deployed but the analytics Kafka consumer had not yet been deployed.

What happened:

The outbox table accumulated 12 million rows over 2 days. When the analytics consumer was finally deployed, it began processing 12 million events from the beginning of time. The analytics database received 12 million inserts in 4 hours — a 10× spike over normal write rate. The analytics PostgreSQL ran out of IOPS and fell behind. Queries against the analytics API returned stale data for 18 hours while the consumer caught up.

The fix:

javascript

// Option 1: Seed the analytics database from the primary before enabling the consumer
// Run a one-time historical backfill query
await analyticsPool.query(`
  INSERT INTO daily_volume (date, total_amount, tx_count)
  SELECT
    DATE(timestamp) as date,
    SUM(amount) as total_amount,
    COUNT(*) as tx_count
  FROM transactions  -- query the primary
  GROUP BY DATE(timestamp)
  ON CONFLICT (date) DO UPDATE SET
    total_amount = EXCLUDED.total_amount,
    tx_count = EXCLUDED.tx_count
`);

// Option 2: Start the Kafka consumer at the current offset (skip historical events)
// Only do this if the analytics DB was pre-seeded with historical data
await consumer.subscribe({ topic: 'transactions', fromBeginning: false });
// fromBeginning: false = start from latest offset, not from beginning

The lesson: migrations that involve a new consumer processing a backlog must account for the resource cost of processing historical events. Pre-seed the downstream datastore before enabling the consumer and starting from the current offset.

Summary

Concept	Key Takeaway
Split signals	Independent scale requirement, different availability, team autonomy. Not "microservices are modern."
Correct seam	Where write and read contention diverge. Separate write-heavy from read-heavy on different datastores.
Outbox pattern	Write event to outbox table in same DB transaction as business data. Atomicity guaranteed.
At-least-once delivery	Publisher may re-publish after crash. Consumer must be idempotent.
`FOR UPDATE SKIP LOCKED`	Multiple publishers consume outbox without duplicates.
Idempotent consumer	`INSERT ... ON CONFLICT DO NOTHING`. Safe to process same event twice.
CQRS projections	Analytics DB is denormalized, read-optimized. Updated via Kafka, never by ingestion service.
Anti-corruption layer	Version your events. Translate between service domains explicitly.
Operational cost	2 pipelines, 2 health checks, consumer lag monitoring, outbox depth monitoring.
Historical backfill	Pre-seed downstream DB before enabling consumer. Never process 12M historical events on a live system.

You can now build and split distributed services. Module 10 covers how those services talk to each other at scale — gRPC vs REST, Kafka consumer group mechanics, and event sourcing for payment ledgers that need replay capability.

Next: Module 10 — High-Performance IPC: gRPC, Kafka & Event Streams →

Knowledge Check

When migrating from a Modulith to Microservices, what is the primary purpose of the "Outbox Pattern"?

Why is the FOR UPDATE SKIP LOCKED clause critical when multiple workers are processing an outbox table?

What role does an Anti-Corruption Layer (ACL) play when splitting the Analytics service from the Ingestion service?

Test your knowledge with more question sets

PreviousModule A-9: The Modern Hybrid Monolith: High-Throughput Modulith Architecture Next Module A-11: High-Performance IPC: gRPC, Kafka & Event Streams

Discussion

Join the discussion

Loading comments...