Module A-16·30 min read

Logical vs physical replication, replication slots and the slot bloat disaster, publication/subscription model, WAL sender process, Debezium + Kafka change data capture, zero-downtime major version upgrades using logical replication as a migration bridge, and replication identity modes.

Module 17 — Logical Replication and CDC Pipelines

We needed to upgrade from PostgreSQL 12 to PostgreSQL 16. Physical streaming replication doesn't cross major versions. The naive approach: pg_dump, restore, brief downtime. Our database was 8TB. The dump alone took 4 hours. The restore took 6 hours. Then indexing. Total downtime: 14 hours. We told the team we could do it in 2 hours. We were wrong. Logical replication, done correctly, would have given us 0 seconds of downtime.


Physical Replication vs Logical Replication

Physical (Streaming)Logical
Unit of transferWAL bytes (block-level)Logical row changes (decoded)
Cross-version supportSame major version onlyWorks across major versions
Selective tablesEntire clusterPer-table granularity
DDL replicated automaticallyVia WAL blocks onlyNot automatically
Bi-directionalNoYes, carefully
Primary use caseHA, hot standby, read replicasCDC, version upgrades, ETL, Kafka

Physical replication copies WAL bytes verbatim — the standby replays the exact same block writes. It cannot cross major versions because the on-disk format changes between them. Logical replication decodes the WAL into row-level changes (INSERT/UPDATE/DELETE), which are version-agnostic. That version-agnosticism is the property that makes the zero-downtime upgrade possible.


Replication Slots — The Critical Operational Detail

Replication slots are PostgreSQL's mechanism for ensuring the WAL needed by a subscriber is not discarded before the subscriber reads it. This sounds helpful. It is helpful until the subscriber goes away.

The disaster scenario: You create a logical replication slot. The subscriber is a Debezium connector that reads changes and publishes to Kafka. The Debezium Kafka Connect worker goes down for a maintenance window. Three days pass. PostgreSQL has been faithfully retaining WAL since the slot was last consumed. WAL accumulates on disk. Your 500GB disk fills up. PostgreSQL cannot write new WAL. PostgreSQL stops accepting writes. Your entire application is down. Every application team is paging the on-call. This scenario is not hypothetical — it has killed production systems.

Monitoring slot lag — make this a production alert, not an afterthought:

sql
SELECT slot_name, slot_type, active, pg_size_pretty( pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) ) AS lag_size, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots ORDER BY lag_bytes DESC NULLS LAST;

Alert when any slot's lag_size exceeds 5GB. Page someone when it exceeds 20GB.

Cap WAL retention per slot using max_slot_wal_keep_size (PostgreSQL 13+):

ini
# postgresql.conf max_slot_wal_keep_size = '10GB' # Slots that fall more than 10GB behind are dropped automatically. # The subscriber must resync from scratch — painful but preferable # to the database going down.

When a slot is dropped due to the size limit, the slot's active column goes to false and pg_replication_slots shows the slot as invalidated. The subscriber gets an error on its next attempt to read, knows it needs to resync, and can do so. Your disk stays healthy. Set the alert lower than the limit so you have time to act before the slot is dropped.


Publication and Subscription

On the source database:

sql
-- Create a publication for specific tables CREATE PUBLICATION my_publication FOR TABLE orders, products, users WITH (publish = 'insert, update, delete'); -- Or publish all tables in the schema CREATE PUBLICATION all_tables FOR ALL TABLES; -- Check what's published SELECT * FROM pg_publication_tables;

On the destination database:

sql
-- Subscribe to the publication -- PostgreSQL handles initial data sync automatically CREATE SUBSCRIPTION my_subscription CONNECTION 'host=source-db.example.com port=5432 dbname=myapp user=replicator password=secret' PUBLICATION my_publication;

PostgreSQL handles the initial data sync (copying existing rows) and then streams ongoing changes. No application changes required during initial sync.

Monitoring subscription progress:

sql
-- On the destination: check sync state SELECT subname, received_lsn, latest_end_lsn, pg_size_pretty( pg_wal_lsn_diff(latest_end_lsn, received_lsn) ) AS replication_lag FROM pg_stat_subscription; -- When received_lsn approaches latest_end_lsn, lag is near-zero.
sql
-- Initial table sync status SELECT subname, relname, srsubstate, -- 'r' = ready, 'i' = initial sync in progress, 'd' = data sync srsublsn FROM pg_subscription_rel JOIN pg_class ON relid = pg_class.oid JOIN pg_subscription ON srsubid = pg_subscription.oid;

Replication Identity — The Row Identification Problem

Logical replication needs to identify which row to UPDATE or DELETE on the destination. Without a way to find the target row, UPDATE and DELETE cannot be applied.

sql
-- DEFAULT: uses the PRIMARY KEY (the right choice for most tables) ALTER TABLE orders REPLICA IDENTITY DEFAULT; -- FULL: includes all column values before the change in the WAL record. -- Works for tables without a primary key, but makes WAL significantly larger. -- Every UPDATE writes the entire old row to WAL. ALTER TABLE orders REPLICA IDENTITY FULL; -- USING INDEX: use a unique index instead of the PK ALTER TABLE orders REPLICA IDENTITY USING INDEX idx_orders_external_id; -- NOTHING: no old row data in WAL records. -- UPDATEs and DELETEs cannot be applied to the destination. -- Only use if you're replicating INSERT-only tables. ALTER TABLE orders REPLICA IDENTITY NOTHING;

Tables without a primary key and without REPLICA IDENTITY FULL will fail to replicate UPDATEs and DELETEs:

ERROR: logical replication target relation "public.orders" has
replica identity "d" (default), but the primary key is not present

Before setting up logical replication, audit all tables in the publication for missing primary keys:

sql
SELECT t.tablename, pk.constraint_type FROM pg_tables t LEFT JOIN information_schema.table_constraints pk ON pk.table_name = t.tablename AND pk.constraint_type = 'PRIMARY KEY' WHERE t.schemaname = 'public' AND pk.constraint_type IS NULL; -- Tables without a PK will be problematic for UPDATE/DELETE replication

Zero-Downtime Major Version Upgrade

The playbook for upgrading PostgreSQL 12 to PostgreSQL 16 with under 5 minutes of write downtime.

Step 1: Prepare the source (PG12)

sql
-- Enable logical WAL on PG12 ALTER SYSTEM SET wal_level = 'logical'; SELECT pg_reload_conf(); -- Requires restart if wal_level was previously 'replica' or 'minimal' -- Create a dedicated replication user CREATE ROLE replicator REPLICATION LOGIN PASSWORD 'use-a-real-secret'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO replicator; GRANT USAGE ON SCHEMA public TO replicator; -- Create the publication CREATE PUBLICATION upgrade_pub FOR ALL TABLES;

Step 2: Copy the schema to PG16

bash
# Dump schema only (no data) from PG12 pg_dump --schema-only --no-owner --no-acl \ -h pg12-host -U postgres myapp > schema.sql # Restore schema on PG16 psql -h pg16-host -U postgres myapp < schema.sql

DDL must exist on the destination before you create the subscription — otherwise the initial sync fails when it tries to copy data into tables that don't exist.

Step 3: Create the subscription on PG16

sql
-- On PG16 CREATE SUBSCRIPTION upgrade_sub CONNECTION 'host=pg12-host port=5432 dbname=myapp user=replicator password=use-a-real-secret' PUBLICATION upgrade_pub;

PostgreSQL begins copying existing rows immediately. For an 8TB database, this takes hours. The source database continues serving traffic normally. All changes made during the sync are captured and will be applied after the initial copy completes.

Step 4: Monitor lag until near-zero

sql
-- Poll this on PG16 until lag is consistently below 1MB SELECT subname, received_lsn, latest_end_lsn, pg_size_pretty(pg_wal_lsn_diff(latest_end_lsn, received_lsn)) AS lag FROM pg_stat_subscription;

When lag is below 1MB, you are ready for the cutover window.

Step 5: The cutover (aim for under 5 minutes)

bash
# 1. Enable maintenance mode in your application (return 503, queue or reject writes) # 2. Wait for replication lag to reach 0 bytes # 3. Verify sequence values on PG16 match PG12 (sequences are NOT replicated)
sql
-- On PG12: check current sequence values SELECT sequence_name, last_value FROM information_schema.sequences WHERE sequence_schema = 'public'; -- On PG16: fast-forward sequences to match -- (add headroom to avoid collision on first write) SELECT setval('orders_id_seq', (SELECT last_value FROM orders_id_seq) + 10000);
bash
# 4. Disable the subscription on PG16 (it has caught up; no need to keep pulling) psql -h pg16-host -c "ALTER SUBSCRIPTION upgrade_sub DISABLE;" # 5. Update your application's DATABASE_URL to point to PG16 # 6. Take application out of maintenance mode # 7. Monitor error rates for 10 minutes # 8. Cleanup when confident psql -h pg16-host -c "DROP SUBSCRIPTION upgrade_sub;" psql -h pg12-host -c "DROP PUBLICATION upgrade_pub;"

The application-visible downtime is steps 1 through 6 — typically 1–3 minutes if you have done the preparation. The 14-hour downtime from the pg_dump approach is gone.


Debezium + Kafka CDC Architecture

Debezium reads PostgreSQL's logical replication stream and publishes row-level changes to Kafka topics. Every INSERT, UPDATE, DELETE becomes a Kafka message. Downstream services consume changes without polling the database.

json
// Kafka message for an ORDER UPDATE { "before": { "id": 42, "status": "pending", "amount": 9900 }, "after": { "id": 42, "status": "shipped", "amount": 9900 }, "op": "u", "ts_ms": 1700000000000, "source": { "db": "myapp", "table": "orders", "lsn": 12345678, "txId": 98765 } }

op values: c (create/insert), u (update), d (delete), r (read during snapshot).

Topic design

One Kafka topic per source table is the standard pattern.

myapp.public.orders       — non-compacted, full history
myapp.public.products     — compacted (log.cleanup.policy=compact), latest value per product ID
myapp.public.users        — compacted, latest user record per user ID

Compacted topics retain only the latest message per key. Good for reference data (products, users) where you care about the current state, not the full change history. Non-compacted topics retain every change — necessary for event data (orders, transactions) where each change is meaningful.

Debezium connector configuration

sql
-- On PostgreSQL: create the replication slot Debezium will use -- (Debezium can create this itself, but creating it manually gives you control) SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput'); -- Grant permissions CREATE ROLE debezium REPLICATION LOGIN PASSWORD 'secret'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium;
json
{ "name": "orders-connector", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "postgres-host", "database.port": "5432", "database.user": "debezium", "database.password": "secret", "database.dbname": "myapp", "database.server.name": "myapp", "slot.name": "debezium_slot", "plugin.name": "pgoutput", "table.include.list": "public.orders,public.products,public.users", "heartbeat.interval.ms": "10000", "heartbeat.action.query": "INSERT INTO debezium_heartbeat (ts) VALUES (now()) ON CONFLICT (id) DO UPDATE SET ts = now()", "publication.name": "debezium_publication", "publication.autocreate.mode": "filtered", "snapshot.mode": "initial" } }

The heartbeat is critical. If your monitored tables have no changes for hours, the replication slot's restart_lsn does not advance — because there's nothing to consume. The slot lag metric grows even though the connector is healthy and connected. The heartbeat writes a small change to a dedicated heartbeat table on a regular interval, advancing the LSN and keeping the lag metric accurate. Without it, your slot lag alert fires false positives constantly.

sql
-- Create the heartbeat table the connector writes to CREATE TABLE debezium_heartbeat ( id INT PRIMARY KEY DEFAULT 1, ts TIMESTAMPTZ NOT NULL DEFAULT now() ); INSERT INTO debezium_heartbeat (id, ts) VALUES (1, now());

DDL Changes — The Logical Replication Blindspot

Logical replication does NOT replicate DDL. If you run ALTER TABLE orders ADD COLUMN shipped_at timestamptz on the source, the column does not appear on the destination. Ongoing INSERTs that include shipped_at will fail on the destination because the column doesn't exist there.

ERROR: column "shipped_at" of relation "orders" does not exist

The subscription stops. Debezium stops consuming. Your replication slot lag starts growing.

For the major version upgrade pattern

Apply all DDL to PG16 before enabling the subscription. The pg_dump --schema-only step handles the initial DDL. Schema changes after that point must be applied to PG16 manually before they are applied to PG12.

The workflow during the upgrade period:

bash
# For every schema migration: # 1. Apply to PG16 first psql -h pg16-host -d myapp -c "ALTER TABLE orders ADD COLUMN shipped_at timestamptz;" # 2. Then apply to PG12 psql -h pg12-host -d myapp -c "ALTER TABLE orders ADD COLUMN shipped_at timestamptz;" # Order matters: destination must have the column before the source starts writing it

For Debezium CDC

Schema changes require a controlled process:

bash
# 1. Pause the Debezium connector curl -X PUT http://kafka-connect:8083/connectors/orders-connector/pause # 2. Apply the DDL to the destination (your analytics DB, data warehouse, etc.) psql -h destination-host -d analytics -c "ALTER TABLE orders ADD COLUMN shipped_at timestamptz;" # 3. Apply the DDL to the source PostgreSQL psql -h postgres-host -d myapp -c "ALTER TABLE orders ADD COLUMN shipped_at timestamptz;" # 4. Resume the connector curl -X PUT http://kafka-connect:8083/connectors/orders-connector/resume

Some CDC platforms (Confluent Schema Registry, Fivetran) handle schema evolution automatically. If you are managing Debezium manually, this coordination step is yours to own. Missing it means 3am pages from a stopped connector and growing slot lag.


Bi-Directional Replication — The Careful Part

Logical replication supports bi-directional setups (both databases can write, changes replicate in both directions). This enables active-active architectures, blue-green database deployments, and merging databases.

The catch: bi-directional replication requires conflict resolution. If two clients write to the same row on different nodes simultaneously, both changes get replicated to the other node and one of them will fail.

PostgreSQL's logical replication has no built-in conflict resolution. You must:

  1. Design your application to avoid write conflicts (partition writes by region, user, or shard)
  2. Or use a third-party tool that adds conflict resolution (BDR from EDB, Citus, etc.)

For the major version upgrade case, bi-directional is unnecessary — you are replicating in one direction until cutover, then switching completely. Don't add bi-directional complexity unless you have a specific operational reason.


The Replication Slot Disaster, Revisited

The slot lag disaster from Module 4's production incident is worth examining from the logical replication angle. The setup:

  • PostgreSQL with a logical replication slot for a Debezium connector
  • Debezium goes offline for 3 days (Kafka Connect cluster maintenance)
  • No max_slot_wal_keep_size configured
  • No slot lag alerting

What the monitoring dashboard should have shown:

sql
-- This query should run every 5 minutes and alert when lag_bytes > 5GB SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes, now() - pg_walfile_name_offset(restart_lsn)::text::interval AS approx_lag_age FROM pg_replication_slots WHERE active = false; -- inactive slots are the dangerous ones

What the recovery looked like:

sql
-- Drop the inactive slot (after confirming Debezium will resync from scratch) SELECT pg_drop_replication_slot('debezium_slot'); -- Verify WAL cleanup has started SELECT pg_size_pretty(sum(size)) AS wal_size FROM pg_ls_waldir(); -- Should decrease over the next few minutes

Debezium resyncs by running a new snapshot (full table scan of all configured tables) and replaying from there. Depending on table sizes, this takes minutes to hours. During resync, downstream Kafka topics may have duplicate messages if Debezium uses at-least-once delivery — your consumers should be idempotent.


Production Checklist

Before enabling logical replication for CDC or a version upgrade:

[ ] wal_level = 'logical' on source (requires restart if not already set)
[ ] max_slot_wal_keep_size configured (10GB is a reasonable starting point)
[ ] Slot lag monitoring alert configured (alert at 5GB, page at 20GB)
[ ] Alert on inactive slots (active = false for > 1 hour)
[ ] All tables in the publication have primary keys or REPLICA IDENTITY FULL
[ ] Replication user has appropriate permissions (REPLICATION role + SELECT on tables)
[ ] Heartbeat table and query configured (for Debezium/CDC connectors)
[ ] Sequence values handled separately for version upgrades (sequences don't replicate)
[ ] DDL coordination process documented (who applies schema changes and in what order)
[ ] Rollback plan documented (how to revert to source if cutover fails)

Logical replication is not plug-and-play. The initial setup is straightforward. The operational discipline — slot monitoring, DDL coordination, sequence management — is where production incidents happen.


Summary

ConceptKey Takeaway
Physical vs. logicalPhysical copies WAL bytes (same major version only). Logical decodes row changes (cross-version).
Replication slotsRetain WAL for subscribers. Inactive slots cause disk-filling disasters. Monitor lag.
max_slot_wal_keep_sizeCap WAL per slot. Slot gets dropped if it falls behind — subscriber resyncs, disk stays healthy.
Publication/subscriptionSource defines what to publish. Destination subscribes and receives initial sync + ongoing changes.
Replica identityTables need a PK (or REPLICA IDENTITY FULL) for UPDATE/DELETE to replicate correctly.
Major version upgradeLogical replication to new version + near-zero lag + brief maintenance window = minutes of downtime, not hours.
Debezium/CDCReads the replication stream, publishes to Kafka. Heartbeat keeps slot lag metric accurate.
DDLNot replicated. Must be applied manually to destination before being applied to source.
Bi-directionalPossible but requires conflict resolution strategy. Avoid unless you have a specific requirement.

Next: P-10 — Zero-Downtime Schema Migrations →

© 2026 Jatin Jain Saraf (JJS). All rights reserved.