Module A-16·30 min read

Logical vs physical replication, replication slots and the slot bloat disaster, publication/subscription model, WAL sender process, Debezium + Kafka change data capture, zero-downtime major version upgrades using logical replication as a migration bridge, and replication identity modes.

Module 17 — Logical Replication and CDC Pipelines

Q: A critical production PostgreSQL instance is configured with a logical replication slot for a Debezium connector. Due to an unforeseen issue, the Debezium Kafka Connect worker goes offline for 48 hours. The PostgreSQL instance does *not* have `max_slot_wal_keep_size` configured. Which of the following is the most likely and severe outcome, and what is the primary architectural mitigation to prevent it?

The PostgreSQL instance's disk will eventually fill up with unconsumed WAL segments, leading to the database stopping accepting writes and causing an application outage. The primary mitigation is to configure `max_slot_wal_keep_size`. — Option B is correct. As described in the lesson, without `max_slot_wal_keep_size` configured, PostgreSQL will faithfully retain all WAL segments required by an inactive logical replication slot. If the subscriber (Debezium) is down for an extended period, these WAL segments will accumulate, eventually filling the disk. When the disk is full, PostgreSQL cannot write new WAL, which causes it to stop accepting new writes, leading to a complete application outage. The primary mitigation is to configure `max_slot_wal_keep_size` (e.g., '10GB') to automatically drop slots that fall too far behind, forcing the subscriber to resync but preventing a database outage. Option A is incorrect because the database itself will be severely impacted, not operate normally. Option C is incorrect; logical replication slots do not automatically discard WAL or become inactive in a way that cleans up WAL without `max_slot_wal_keep_size`. Option D is incorrect; PostgreSQL does not throttle writes in this scenario; it stops them entirely when the disk fills.

Q: A development team is performing a schema migration, adding a new non-nullable column with a default value to an `orders` table that is actively being logically replicated from a PostgreSQL 12 source to a PostgreSQL 16 destination as part of a zero-downtime major version upgrade. The team applies the `ALTER TABLE` statement to the PostgreSQL 12 source database first, then immediately attempts to apply it to the PostgreSQL 16 destination. What is the most likely immediate consequence, and what is the correct sequence of operations to prevent it?

The logical replication subscription on PostgreSQL 16 will encounter an `ERROR: column "new_column" of relation "orders" does not exist` when it attempts to apply changes from the source that include the new column, causing replication to stop. The correct sequence is to apply DDL to PG16 first, then to PG12. — Option B is correct. The lesson explicitly states that logical replication does NOT replicate DDL. If the source database (PG12) has a new column added and starts writing data for it, but the destination database (PG16) does not yet have that column, the logical replication subscription on PG16 will fail with an `ERROR: column "new_column" of relation "orders" does not exist` when it tries to apply the incoming changes. This will stop the replication. The correct procedure for DDL changes during an upgrade is to apply the DDL to the destination (PG16) first, ensuring the schema is ready, and then apply it to the source (PG12). Option A is incorrect; logical replication does not automatically lock tables to prevent DDL in this manner. Option C is incorrect; the issue is not just about data not being replicated, but replication *stopping* due to a schema mismatch. Option D is incorrect; the source database does not experience an outage due to DDL synchronization with the replication stream; the problem manifests on the destination when applying changes.

Q: A Senior SRE has configured a Debezium connector to stream changes from a PostgreSQL database to Kafka using logical replication. They have set up robust monitoring for replication slot lag using `pg_replication_slots`, with alerts configured to page on-call engineers if `lag_size` exceeds 5GB. However, they neglected to configure the Debezium heartbeat mechanism. What is the most likely operational challenge this oversight will cause, and why?

The `lag_size` metric will inaccurately show a constantly growing lag even when the Debezium connector is healthy and actively consuming changes, leading to frequent false positive alerts and alert fatigue. This happens because `restart_lsn` only advances with actual data changes. — Option B is correct. The lesson explicitly states that if monitored tables have no changes for hours, the replication slot's `restart_lsn` does not advance because there's nothing to consume. Consequently, the `lag_size` metric will continuously grow, even if the Debezium connector is perfectly healthy and caught up. This leads to frequent false positive alerts and alert fatigue, undermining the effectiveness of the monitoring system. The heartbeat mechanism is designed to periodically write a small change to a dedicated table, thereby advancing the LSN and keeping the lag metric accurate. Option A is incorrect; the absence of a heartbeat does not cause PostgreSQL to prematurely close the slot or disconnect the connector. Option C is incorrect; `wal_level` is a static configuration parameter and does not automatically revert based on replication activity. Option D is incorrect; the heartbeat mechanism is for advancing the LSN to keep lag metrics accurate, not for synchronizing DDL changes, which is a separate manual coordination process.

We needed to upgrade from PostgreSQL 12 to PostgreSQL 16. Physical streaming replication doesn't cross major versions. The naive approach: pg_dump, restore, brief downtime. Our database was 8TB. The dump alone took 4 hours. The restore took 6 hours. Then indexing. Total downtime: 14 hours. We told the team we could do it in 2 hours. We were wrong. Logical replication, done correctly, would have given us 0 seconds of downtime.

Physical Replication vs Logical Replication

	Physical (Streaming)	Logical
Unit of transfer	WAL bytes (block-level)	Logical row changes (decoded)
Cross-version support	Same major version only	Works across major versions
Selective tables	Entire cluster	Per-table granularity
DDL replicated automatically	Via WAL blocks only	Not automatically
Bi-directional	No	Yes, carefully
Primary use case	HA, hot standby, read replicas	CDC, version upgrades, ETL, Kafka

Physical replication copies WAL bytes verbatim — the standby replays the exact same block writes. It cannot cross major versions because the on-disk format changes between them. Logical replication decodes the WAL into row-level changes (INSERT/UPDATE/DELETE), which are version-agnostic. That version-agnosticism is the property that makes the zero-downtime upgrade possible.

Replication Slots — The Critical Operational Detail

Replication slots are PostgreSQL's mechanism for ensuring the WAL needed by a subscriber is not discarded before the subscriber reads it. This sounds helpful. It is helpful until the subscriber goes away.

The disaster scenario: You create a logical replication slot. The subscriber is a Debezium connector that reads changes and publishes to Kafka. The Debezium Kafka Connect worker goes down for a maintenance window. Three days pass. PostgreSQL has been faithfully retaining WAL since the slot was last consumed. WAL accumulates on disk. Your 500GB disk fills up. PostgreSQL cannot write new WAL. PostgreSQL stops accepting writes. Your entire application is down. Every application team is paging the on-call. This scenario is not hypothetical — it has killed production systems.

Monitoring slot lag — make this a production alert, not an afterthought:

sql

SELECT
  slot_name,
  slot_type,
  active,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS lag_size,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
ORDER BY lag_bytes DESC NULLS LAST;

Alert when any slot's lag_size exceeds 5GB. Page someone when it exceeds 20GB.

Cap WAL retention per slot using max_slot_wal_keep_size (PostgreSQL 13+):

ini

# postgresql.conf
max_slot_wal_keep_size = '10GB'
# Slots that fall more than 10GB behind are dropped automatically.
# The subscriber must resync from scratch — painful but preferable
# to the database going down.

When a slot is dropped due to the size limit, the slot's active column goes to false and pg_replication_slots shows the slot as invalidated. The subscriber gets an error on its next attempt to read, knows it needs to resync, and can do so. Your disk stays healthy. Set the alert lower than the limit so you have time to act before the slot is dropped.

Publication and Subscription

On the source database:

sql

-- Create a publication for specific tables
CREATE PUBLICATION my_publication
  FOR TABLE orders, products, users
  WITH (publish = 'insert, update, delete');

-- Or publish all tables in the schema
CREATE PUBLICATION all_tables FOR ALL TABLES;

-- Check what's published
SELECT * FROM pg_publication_tables;

On the destination database:

sql

-- Subscribe to the publication
-- PostgreSQL handles initial data sync automatically
CREATE SUBSCRIPTION my_subscription
  CONNECTION 'host=source-db.example.com port=5432 dbname=myapp user=replicator password=secret'
  PUBLICATION my_publication;

PostgreSQL handles the initial data sync (copying existing rows) and then streams ongoing changes. No application changes required during initial sync.

Monitoring subscription progress:

sql

-- On the destination: check sync state
SELECT
  subname,
  received_lsn,
  latest_end_lsn,
  pg_size_pretty(
    pg_wal_lsn_diff(latest_end_lsn, received_lsn)
  ) AS replication_lag
FROM pg_stat_subscription;
-- When received_lsn approaches latest_end_lsn, lag is near-zero.

sql

-- Initial table sync status
SELECT
  subname,
  relname,
  srsubstate,  -- 'r' = ready, 'i' = initial sync in progress, 'd' = data sync
  srsublsn
FROM pg_subscription_rel
JOIN pg_class ON relid = pg_class.oid
JOIN pg_subscription ON srsubid = pg_subscription.oid;

Replication Identity — The Row Identification Problem

Logical replication needs to identify which row to UPDATE or DELETE on the destination. Without a way to find the target row, UPDATE and DELETE cannot be applied.

sql

-- DEFAULT: uses the PRIMARY KEY (the right choice for most tables)
ALTER TABLE orders REPLICA IDENTITY DEFAULT;

-- FULL: includes all column values before the change in the WAL record.
-- Works for tables without a primary key, but makes WAL significantly larger.
-- Every UPDATE writes the entire old row to WAL.
ALTER TABLE orders REPLICA IDENTITY FULL;

-- USING INDEX: use a unique index instead of the PK
ALTER TABLE orders REPLICA IDENTITY USING INDEX idx_orders_external_id;

-- NOTHING: no old row data in WAL records.
-- UPDATEs and DELETEs cannot be applied to the destination.
-- Only use if you're replicating INSERT-only tables.
ALTER TABLE orders REPLICA IDENTITY NOTHING;

Tables without a primary key and without REPLICA IDENTITY FULL will fail to replicate UPDATEs and DELETEs:

text

ERROR: logical replication target relation "public.orders" has
replica identity "d" (default), but the primary key is not present

Before setting up logical replication, audit all tables in the publication for missing primary keys:

sql

SELECT
  t.tablename,
  pk.constraint_type
FROM pg_tables t
LEFT JOIN information_schema.table_constraints pk
  ON pk.table_name = t.tablename
  AND pk.constraint_type = 'PRIMARY KEY'
WHERE t.schemaname = 'public'
  AND pk.constraint_type IS NULL;
-- Tables without a PK will be problematic for UPDATE/DELETE replication

Zero-Downtime Major Version Upgrade

The playbook for upgrading PostgreSQL 12 to PostgreSQL 16 with under 5 minutes of write downtime.

Step 1: Prepare the source (PG12)

sql

-- Enable logical WAL on PG12
ALTER SYSTEM SET wal_level = 'logical';
SELECT pg_reload_conf();
-- Requires restart if wal_level was previously 'replica' or 'minimal'

-- Create a dedicated replication user
CREATE ROLE replicator REPLICATION LOGIN PASSWORD 'use-a-real-secret';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replicator;
GRANT USAGE ON SCHEMA public TO replicator;

-- Create the publication
CREATE PUBLICATION upgrade_pub FOR ALL TABLES;

Step 2: Copy the schema to PG16

bash

# Dump schema only (no data) from PG12
pg_dump --schema-only --no-owner --no-acl \
  -h pg12-host -U postgres myapp > schema.sql

# Restore schema on PG16
psql -h pg16-host -U postgres myapp < schema.sql

DDL must exist on the destination before you create the subscription — otherwise the initial sync fails when it tries to copy data into tables that don't exist.

Step 3: Create the subscription on PG16

sql

-- On PG16
CREATE SUBSCRIPTION upgrade_sub
  CONNECTION 'host=pg12-host port=5432 dbname=myapp user=replicator password=use-a-real-secret'
  PUBLICATION upgrade_pub;

PostgreSQL begins copying existing rows immediately. For an 8TB database, this takes hours. The source database continues serving traffic normally. All changes made during the sync are captured and will be applied after the initial copy completes.

Step 4: Monitor lag until near-zero

sql

-- Poll this on PG16 until lag is consistently below 1MB
SELECT
  subname,
  received_lsn,
  latest_end_lsn,
  pg_size_pretty(pg_wal_lsn_diff(latest_end_lsn, received_lsn)) AS lag
FROM pg_stat_subscription;

When lag is below 1MB, you are ready for the cutover window.

Step 5: The cutover (aim for under 5 minutes)

bash

# 1. Enable maintenance mode in your application (return 503, queue or reject writes)
# 2. Wait for replication lag to reach 0 bytes
# 3. Verify sequence values on PG16 match PG12 (sequences are NOT replicated)

sql

-- On PG12: check current sequence values
SELECT sequence_name, last_value FROM information_schema.sequences
WHERE sequence_schema = 'public';

-- On PG16: fast-forward sequences to match
-- (add headroom to avoid collision on first write)
SELECT setval('orders_id_seq', (SELECT last_value FROM orders_id_seq) + 10000);

bash

# 4. Disable the subscription on PG16 (it has caught up; no need to keep pulling)
psql -h pg16-host -c "ALTER SUBSCRIPTION upgrade_sub DISABLE;"

# 5. Update your application's DATABASE_URL to point to PG16
# 6. Take application out of maintenance mode
# 7. Monitor error rates for 10 minutes
# 8. Cleanup when confident
psql -h pg16-host -c "DROP SUBSCRIPTION upgrade_sub;"
psql -h pg12-host -c "DROP PUBLICATION upgrade_pub;"

The application-visible downtime is steps 1 through 6 — typically 1–3 minutes if you have done the preparation. The 14-hour downtime from the pg_dump approach is gone.

Debezium + Kafka CDC Architecture

Debezium reads PostgreSQL's logical replication stream and publishes row-level changes to Kafka topics. Every INSERT, UPDATE, DELETE becomes a Kafka message. Downstream services consume changes without polling the database.

json

// Kafka message for an ORDER UPDATE
{
  "before": { "id": 42, "status": "pending", "amount": 9900 },
  "after":  { "id": 42, "status": "shipped",  "amount": 9900 },
  "op": "u",
  "ts_ms": 1700000000000,
  "source": {
    "db": "myapp",
    "table": "orders",
    "lsn": 12345678,
    "txId": 98765
  }
}

op values: c (create/insert), u (update), d (delete), r (read during snapshot).

Topic design

text

One Kafka topic per source table is the standard pattern.

myapp.public.orders       — non-compacted, full history
myapp.public.products     — compacted (log.cleanup.policy=compact), latest value per product ID
myapp.public.users        — compacted, latest user record per user ID

Compacted topics retain only the latest message per key. Good for reference data (products, users) where you care about the current state, not the full change history. Non-compacted topics retain every change — necessary for event data (orders, transactions) where each change is meaningful.

Debezium connector configuration

sql

-- On PostgreSQL: create the replication slot Debezium will use
-- (Debezium can create this itself, but creating it manually gives you control)
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

-- Grant permissions
CREATE ROLE debezium REPLICATION LOGIN PASSWORD 'secret';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium;

json

{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres-host",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "secret",
    "database.dbname": "myapp",
    "database.server.name": "myapp",
    "slot.name": "debezium_slot",
    "plugin.name": "pgoutput",
    "table.include.list": "public.orders,public.products,public.users",
    "heartbeat.interval.ms": "10000",
    "heartbeat.action.query": "INSERT INTO debezium_heartbeat (ts) VALUES (now()) ON CONFLICT (id) DO UPDATE SET ts = now()",
    "publication.name": "debezium_publication",
    "publication.autocreate.mode": "filtered",
    "snapshot.mode": "initial"
  }
}

The heartbeat is critical. If your monitored tables have no changes for hours, the replication slot's restart_lsn does not advance — because there's nothing to consume. The slot lag metric grows even though the connector is healthy and connected. The heartbeat writes a small change to a dedicated heartbeat table on a regular interval, advancing the LSN and keeping the lag metric accurate. Without it, your slot lag alert fires false positives constantly.

sql

-- Create the heartbeat table the connector writes to
CREATE TABLE debezium_heartbeat (
  id   INT PRIMARY KEY DEFAULT 1,
  ts   TIMESTAMPTZ NOT NULL DEFAULT now()
);
INSERT INTO debezium_heartbeat (id, ts) VALUES (1, now());

DDL Changes — The Logical Replication Blindspot

Logical replication does NOT replicate DDL. If you run ALTER TABLE orders ADD COLUMN shipped_at timestamptz on the source, the column does not appear on the destination. Ongoing INSERTs that include shipped_at will fail on the destination because the column doesn't exist there.

ERROR: column "shipped_at" of relation "orders" does not exist

The subscription stops. Debezium stops consuming. Your replication slot lag starts growing.

For the major version upgrade pattern

Apply all DDL to PG16 before enabling the subscription. The pg_dump --schema-only step handles the initial DDL. Schema changes after that point must be applied to PG16 manually before they are applied to PG12.

The workflow during the upgrade period:

bash

# For every schema migration:
# 1. Apply to PG16 first
psql -h pg16-host -d myapp -c "ALTER TABLE orders ADD COLUMN shipped_at timestamptz;"

# 2. Then apply to PG12
psql -h pg12-host -d myapp -c "ALTER TABLE orders ADD COLUMN shipped_at timestamptz;"

# Order matters: destination must have the column before the source starts writing it

For Debezium CDC

Schema changes require a controlled process:

bash

# 1. Pause the Debezium connector
curl -X PUT http://kafka-connect:8083/connectors/orders-connector/pause

# 2. Apply the DDL to the destination (your analytics DB, data warehouse, etc.)
psql -h destination-host -d analytics -c "ALTER TABLE orders ADD COLUMN shipped_at timestamptz;"

# 3. Apply the DDL to the source PostgreSQL
psql -h postgres-host -d myapp -c "ALTER TABLE orders ADD COLUMN shipped_at timestamptz;"

# 4. Resume the connector
curl -X PUT http://kafka-connect:8083/connectors/orders-connector/resume

Some CDC platforms (Confluent Schema Registry, Fivetran) handle schema evolution automatically. If you are managing Debezium manually, this coordination step is yours to own. Missing it means 3am pages from a stopped connector and growing slot lag.

Bi-Directional Replication — The Careful Part

Logical replication supports bi-directional setups (both databases can write, changes replicate in both directions). This enables active-active architectures, blue-green database deployments, and merging databases.

The catch: bi-directional replication requires conflict resolution. If two clients write to the same row on different nodes simultaneously, both changes get replicated to the other node and one of them will fail.

PostgreSQL's logical replication has no built-in conflict resolution. You must:

Design your application to avoid write conflicts (partition writes by region, user, or shard)
Or use a third-party tool that adds conflict resolution (BDR from EDB, Citus, etc.)

For the major version upgrade case, bi-directional is unnecessary — you are replicating in one direction until cutover, then switching completely. Don't add bi-directional complexity unless you have a specific operational reason.

The Replication Slot Disaster, Revisited

The slot lag disaster from Module 4's production incident is worth examining from the logical replication angle. The setup:

PostgreSQL with a logical replication slot for a Debezium connector
Debezium goes offline for 3 days (Kafka Connect cluster maintenance)
No max_slot_wal_keep_size configured
No slot lag alerting

What the monitoring dashboard should have shown:

sql

-- This query should run every 5 minutes and alert when lag_bytes > 5GB
SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes,
  now() - pg_walfile_name_offset(restart_lsn)::text::interval AS approx_lag_age
FROM pg_replication_slots
WHERE active = false;  -- inactive slots are the dangerous ones

What the recovery looked like:

sql

-- Drop the inactive slot (after confirming Debezium will resync from scratch)
SELECT pg_drop_replication_slot('debezium_slot');

-- Verify WAL cleanup has started
SELECT pg_size_pretty(sum(size)) AS wal_size
FROM pg_ls_waldir();
-- Should decrease over the next few minutes

Debezium resyncs by running a new snapshot (full table scan of all configured tables) and replaying from there. Depending on table sizes, this takes minutes to hours. During resync, downstream Kafka topics may have duplicate messages if Debezium uses at-least-once delivery — your consumers should be idempotent.

Production Checklist

Before enabling logical replication for CDC or a version upgrade:

text

[ ] wal_level = 'logical' on source (requires restart if not already set)
[ ] max_slot_wal_keep_size configured (10GB is a reasonable starting point)
[ ] Slot lag monitoring alert configured (alert at 5GB, page at 20GB)
[ ] Alert on inactive slots (active = false for > 1 hour)
[ ] All tables in the publication have primary keys or REPLICA IDENTITY FULL
[ ] Replication user has appropriate permissions (REPLICATION role + SELECT on tables)
[ ] Heartbeat table and query configured (for Debezium/CDC connectors)
[ ] Sequence values handled separately for version upgrades (sequences don't replicate)
[ ] DDL coordination process documented (who applies schema changes and in what order)
[ ] Rollback plan documented (how to revert to source if cutover fails)

Logical replication is not plug-and-play. The initial setup is straightforward. The operational discipline — slot monitoring, DDL coordination, sequence management — is where production incidents happen.

Summary

Concept	Key Takeaway
Physical vs. logical	Physical copies WAL bytes (same major version only). Logical decodes row changes (cross-version).
Replication slots	Retain WAL for subscribers. Inactive slots cause disk-filling disasters. Monitor lag.
`max_slot_wal_keep_size`	Cap WAL per slot. Slot gets dropped if it falls behind — subscriber resyncs, disk stays healthy.
Publication/subscription	Source defines what to publish. Destination subscribes and receives initial sync + ongoing changes.
Replica identity	Tables need a PK (or REPLICA IDENTITY FULL) for UPDATE/DELETE to replicate correctly.
Major version upgrade	Logical replication to new version + near-zero lag + brief maintenance window = minutes of downtime, not hours.
Debezium/CDC	Reads the replication stream, publishes to Kafka. Heartbeat keeps slot lag metric accurate.
DDL	Not replicated. Must be applied manually to destination before being applied to source.
Bi-directional	Possible but requires conflict resolution strategy. Avoid unless you have a specific requirement.

Next: P-10 — Zero-Downtime Schema Migrations →

Knowledge Check

A critical production PostgreSQL instance is configured with a logical replication slot for a Debezium connector. Due to an unforeseen issue, the Debezium Kafka Connect worker goes offline for 48 hours. The PostgreSQL instance does *not* have max_slot_wal_keep_size configured. Which of the following is the most likely and severe outcome, and what is the primary architectural mitigation to prevent it?

A development team is performing a schema migration, adding a new non-nullable column with a default value to an orders table that is actively being logically replicated from a PostgreSQL 12 source to a PostgreSQL 16 destination as part of a zero-downtime major version upgrade. The team applies the ALTER TABLE statement to the PostgreSQL 12 source database first, then immediately attempts to apply it to the PostgreSQL 16 destination. What is the most likely immediate consequence, and what is the correct sequence of operations to prevent it?

A Senior SRE has configured a Debezium connector to stream changes from a PostgreSQL database to Kafka using logical replication. They have set up robust monitoring for replication slot lag using pg_replication_slots, with alerts configured to page on-call engineers if lag_size exceeds 5GB. However, they neglected to configure the Debezium heartbeat mechanism. What is the most likely operational challenge this oversight will cause, and why?

Test your knowledge with more question sets

PreviousModule A-15: The Extensions Ecosystem: pg_cron, TimescaleDB, Citus, and pgvector

Discussion

Join the discussion

Loading comments...