PostgreSQL Logical Replication Architecture & Fundamentals

PostgreSQL logical replication operates at the row and transaction level, decoding committed changes from the Write-Ahead Log and replaying them as logical INSERT, UPDATE, and DELETE operations on a subscriber — unlike physical streaming replication, which ships opaque WAL blocks and mirrors the entire cluster byte-for-byte. That distinction is the whole reason this architecture exists: it enables selective table movement, column and row filtering, cross-major-version replication (PG 15 → 17), and delivery into heterogeneous targets such as a Python ETL consumer, Kafka, or a data warehouse. For database engineers, data platform teams, and DevOps practitioners running PostgreSQL 15, 16, or 17 in production, the components below — the decoding subsystem, publications and subscriptions, replication slots, and the LSN bookkeeping that ties them together — are the primitives every reliable change-data-capture pipeline is built from. Get the slot and retention mechanics wrong and an unbounded pg_wal directory will take the primary offline; get them right and the same machinery streams changes with sub-second lag for years.

Logical replication data flow — committed changes are decoded into row-level operations under a publication filter, retained by a replication slot, and applied by the subscriber, which confirms its flushed LSN back to the slot.

Core Architecture: WAL, Logical Decoding, and Slots

Every logical replication stream begins in the Write-Ahead Log. Each committed transaction writes WAL records that describe tuple-level modifications, and the logical decoding subsystem reassembles those physical records back into an ordered, per-transaction change set. Decoding only starts once wal_level = logical is set (a change that requires a restart), because that setting instructs PostgreSQL to log the additional relation and old-tuple metadata that reconstruction needs. The precise batching, spill-to-disk, and reorder-buffer behaviour that governs how these change sets are assembled is detailed in WAL stream mechanics; the summary that matters at the architecture level is that decoding is transactional and in-commit-order — a transaction is never streamed until its COMMIT record is decoded, so a long-running transaction on the primary holds back the entire stream.

The default output plugin, pgoutput, serializes the decoded change set according to the PostgreSQL logical replication protocol, preserving transaction boundaries, commit LSNs, commit timestamps, and relation (schema) metadata. Third-party plugins such as wal2json and decoderbufs plug into the same decoding callbacks but emit JSON or Protobuf instead — the choice of plugin changes the wire format a consumer must parse, not the underlying decoding guarantees. A Debezium connector and a hand-written Python consumer both sit at this same boundary: they open a replication connection, name a slot, and receive an ordered stream of committed changes.

Three server-side objects define the topology:

The primary (publisher) decodes WAL and streams changes. It carries all decoding CPU and the reorder-buffer memory, bounded by logical_decoding_work_mem (default 64MB per walsender).
The replication slot is the durable position marker. It records how far a specific consumer has confirmed receipt and pins WAL retention accordingly.
The subscriber (or external consumer) applies changes and periodically confirms its flushed LSN back to the slot.

sql

-- Prerequisite GUCs on the publisher (postgresql.conf), then restart:
--   wal_level = logical
--   max_wal_senders = 10        -- one per active stream + headroom
--   max_replication_slots = 10  -- one per slot, plus temporaries

-- Confirm the running configuration before building anything on top of it.
SELECT name, setting, pending_restart
FROM pg_settings
WHERE name IN ('wal_level', 'max_wal_senders',
               'max_replication_slots', 'logical_decoding_work_mem');

Because decoding is per-slot, one high-churn table can throttle everything sharing its slot. PG 16+ added parallelisable logical decoding and, on the apply side, max_parallel_apply_workers_per_subscription, letting large transactions stream and apply concurrently instead of strictly serially — the single most impactful change for isolating a hot table from low-churn datasets and avoiding head-of-line blocking.

Declarative Configuration Model: Publications & Subscriptions

Logical replication is configured declaratively. A publication on the primary names the set of tables (and, optionally, the columns and rows) exposed to consumers; a subscription on the downstream node names a publication and a connection string and drives an apply worker. Neither side hard-codes the other’s schema, which is what makes fan-out, partial replication, and cross-version topologies possible without cloning the whole database. The full DDL walkthrough lives under creating publications; the design decisions that belong at the architecture level are covered by the publication and subscription models reference.

sql

-- On the publisher: expose a filtered slice of one table.
CREATE PUBLICATION sales_cdc
  FOR TABLE public.orders (id, customer_id, status, updated_at)  -- PG 15+: column list
  WHERE (status <> 'draft');                                     -- PG 15+: row filter

-- On the subscriber: consume it. This implicitly creates a slot named
-- after the subscription on the publisher unless you point it at an
-- existing one with create_slot = false / slot_name = '...'.
CREATE SUBSCRIPTION sales_sub
  CONNECTION 'host=primary.internal port=5432 dbname=app user=repl sslmode=verify-full'
  PUBLICATION sales_cdc
  WITH (copy_data = true, streaming = 'parallel');  -- streaming='parallel' is PG 16+

Two evaluation details drive capacity planning. First, row filters (WHERE) and column lists are evaluated on the publisher during decoding, so aggressive filtering trims network payload but adds decode-time CPU. Second, a publication’s publish option controls which operation types stream (insert, update, delete, truncate), and publish_via_partition_root (PG 13+) determines whether partitioned tables replicate as the root or as individual partitions — a mismatch here is a common cause of “changes silently not arriving.”

Every replicated UPDATE/DELETE also needs a replica identity so the subscriber can locate the target row. REPLICA IDENTITY DEFAULT uses the primary key; a table with no primary key must be switched to REPLICA IDENTITY FULL (which logs the entire old row and inflates WAL volume) or it will reject updates and deletes outright. External consumers such as a Debezium connector rely on the same identity metadata to build stable message keys.

sql

-- Fail fast: find replicated tables that will break on UPDATE/DELETE.
SELECT c.relname,
       c.relreplident   -- 'd'=default(pk), 'f'=full, 'n'=nothing, 'i'=index
FROM pg_publication_tables pt
JOIN pg_class c ON c.relname = pt.tablename
WHERE pt.pubname = 'sales_cdc'
  AND c.relreplident = 'd'
  AND NOT EXISTS (
    SELECT 1 FROM pg_index i
    WHERE i.indrelid = c.oid AND i.indisprimary
  );
-- Any row returned = an UPDATE/DELETE on that table will error until you
-- set REPLICA IDENTITY FULL or add a primary/unique key.

State Persistence & Lifecycle: Slots and LSN Bookkeeping

State persistence is the linchpin of reliable CDC, and it lives entirely in the replication slot. A logical slot stores two LSNs that every operator must be able to reason about:

restart_lsn — the oldest WAL position the slot still needs. WAL at or after this point cannot be recycled, no matter how old. This is the value that fills disks.
confirmed_flush_lsn — the position the consumer has durably applied. The gap between the primary’s current LSN and this value is the true end-to-end lag.

Choosing the right slot — a persistent logical slot for a durable pipeline, a temporary slot for an ephemeral consumer, or a physical slot for a standby — is an operational decision covered in depth under replication slot types, and the safe creation sequence (create slot → snapshot → initial copy → stream) is walked through in initializing replication slots. The failure mode to internalise is simple and severe: an inactive slot never advances restart_lsn, so WAL accumulates without bound until pg_wal fills the volume and the primary refuses to write. A slot for a consumer that has been down for an hour is indistinguishable, from the server’s perspective, from a slot that is simply behind — both pin WAL.

sql

-- Slot health at a glance: retained WAL and liveness in one query.
SELECT slot_name,
       active,
       pg_size_pretty(
         pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
       ) AS retained_wal,
       pg_size_pretty(
         pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)
       ) AS apply_lag
FROM pg_replication_slots
WHERE slot_type = 'logical'
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
-- Alert when retained_wal approaches your pg_wal budget, and page when
-- active = false persists beyond a consumer's expected restart window.

PG 13+ provides a guardrail: max_slot_wal_keep_size caps how much WAL a slot may pin, after which the slot is invalidated (the pipeline breaks and must be rebuilt) rather than letting the whole cluster fill up — a deliberate trade of one pipeline for cluster survival. PG 16+ adds failover slots so a slot’s position can be synchronised to a physical standby, and PG 17 improves slot-invalidation reporting and exposes richer progress in pg_stat_progress_subscription, making lifecycle state far easier to alert on. Downstream, consumers must checkpoint their applied LSN and perform idempotent upserts so that resuming from confirmed_flush_lsn after a restart re-delivers changes at least once without producing duplicate rows.

Security & Privilege Boundaries

Logical replication has a materially smaller privilege footprint than physical streaming, which requires the REPLICATION attribute and streams the entire cluster. Here, the replication role needs the REPLICATION attribute (or, PG 16+, membership in the built-in pg_create_subscription/pg_use_reserved_connections-style roles) and SELECT on exactly the published tables — nothing more. The full role model, pg_hba.conf requirements, and TLS posture are enumerated under security boundaries and permissions; the principles that shape the architecture are least privilege, transport encryption, and externalised secrets.

sql

-- Least-privilege replication role: can stream, can read published tables,
-- cannot write, cannot see unpublished schemas.
CREATE ROLE repl WITH LOGIN REPLICATION PASSWORD :'pw';
GRANT USAGE ON SCHEMA public TO repl;
GRANT SELECT ON public.orders TO repl;   -- only what the publication exposes

The replication connection itself must be pinned in pg_hba.conf and forced onto TLS — sslmode=verify-full on the subscriber’s connection string validates both the certificate chain and the hostname, defeating man-in-the-middle interception of the change stream. Prefer SCRAM-SHA-256 or certificate authentication over md5. Critically, credentials for a subscription are stored in the catalog and visible to superusers, so a Python ETL consumer or Debezium connector should never embed long-lived passwords inline; route through a connection proxy or a secrets manager (Vault, AWS/GCP secret stores) so rotation does not require editing catalog objects or restarting streams. In cloud-managed Postgres, layer VPC peering / private endpoints and IAM-integrated database roles on top so replication traffic never traverses a public path.

Observability & Diagnostics

You cannot operate this architecture on faith; every component exposes a system view, and each has a threshold worth alerting on. The four that matter most:

System view	What it tells you	Alert threshold
`pg_replication_slots`	Retained WAL (`restart_lsn`), liveness (`active`)	`retained_wal` > 50% of `pg_wal` budget, or `active = false` beyond restart window
`pg_stat_replication`	Per-connection send/write/flush/replay LSN lag	`replay_lag` > 30 s sustained
`pg_stat_subscription`	Subscriber apply position and last message time	`last_msg_receipt_time` stale > 60 s
`pg_stat_progress_subscription` (PG 17)	Live progress of initial table sync / apply	sync stalled with no LSN movement

sql

-- Send-to-replay lag per streaming client, in bytes and time.
SELECT client_addr,
       state,
       pg_size_pretty(
         pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)
       ) AS replay_lag_bytes,
       write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

Building this into a standing dashboard and alert set — Prometheus postgres_exporter scrapes, Grafana panels, and alert rules on the thresholds above — is the subject of asynchronous monitoring integration. The non-negotiable rule: restart_lsn growth and slot liveness must be alerted independently of apply lag, because a paused-but-alive slot and a dead slot both silently retain WAL while a lag-only dashboard shows nothing wrong until the disk fills.

Resilience Patterns & Failure Modes

Production streams must survive network jitter, consumer restarts, primary failover, and schema evolution. The recurring failure signatures and their remedies:

WAL exhaustion from a stalled slot. A consumer dies, restart_lsn freezes, pg_wal grows until the volume is full and the primary stops accepting writes. Remedy: bound every slot with max_slot_wal_keep_size (PG 13+), alert on retained WAL early, and treat a slot that is active = false past its restart budget as a page, not a warning.
Schema drift. DDL is not replicated by logical replication. Add a column on the subscriber that the publisher lacks and applies proceed; add a NOT NULL column on the publisher without one downstream and the apply worker errors and halts. Remedy: apply DDL downstream-first, detect drift by comparing pg_attribute/pg_class across nodes, and gate migrations in CI.
Replica-identity mismatch. An UPDATE/DELETE on a table with REPLICA IDENTITY NOTHING (or default with no PK) errors at decode time. Remedy: the pre-flight query in the configuration section above; fix before enabling the publication.
Long-running publisher transactions. Because changes only stream at COMMIT, a multi-hour transaction pins restart_lsn and inflates apply lag for the whole slot. Remedy: monitor pg_stat_activity for old xact_start; keep publisher transactions short.
Failover. Promoting a standby historically dropped logical slots. Remedy: PG 16+ failover slots synchronise slot position to the standby; otherwise recreate the slot and reconcile with the consumer’s last checkpoint on the new primary.

On the consumer side, resilience is code: exponential backoff on connection resets, transactional batch commits keyed to LSN, idempotent upserts on the target’s primary key, and a dead-letter path for un-parseable payloads. Initial synchronisation and re-sync after a break are their own procedure — see subscription synchronisation procedures — since the copy_data snapshot phase and the catch-up streaming phase have distinct failure characteristics.

Slot lifecycle — an inactive slot keeps pinning WAL (its restart_lsn frozen) exactly like a lagging one, so breaching max_slot_wal_keep_size invalidates it and forces a rebuild; alert on liveness and retained WAL, not just apply lag.

Conclusion

PostgreSQL logical replication is not merely a data-movement feature; it is a small distributed system whose correctness depends on operational discipline as much as on DDL. Align the decoding configuration (wal_level = logical, adequate max_wal_senders/max_replication_slots) with a declarative publication design that respects replica identity, treat every replication slot as a live claim on WAL that must be monitored on both restart_lsn and liveness, enforce least-privilege TLS-pinned access, and write consumers that checkpoint and upsert idempotently. Lean on the version-specific gains — max_slot_wal_keep_size (PG 13+) to protect the primary, parallel apply and failover slots (PG 16+) to isolate hot tables and survive promotion, and richer subscription progress views (PG 17) to close the observability gap. Do that, and the same architecture that powers a two-node database mirror scales cleanly into a fan-out CDC platform feeding Kafka, Avro, and Python ETL consumers.

WAL stream mechanics — how decoding batches, buffers, and spills change sets.
Publication and subscription models — topology, filtering, and fan-out design.
Replication slot types — persistent, temporary, and physical slots compared.
Security boundaries & permissions — roles, TLS, and secrets in context.
Logical replication setup & management — the hands-on companion to this reference.
CDC pipeline implementation with Python & Debezium — building consumers on top of this architecture.

Core Architecture: WAL, Logical Decoding, and Slots #

Declarative Configuration Model: Publications & Subscriptions #

State Persistence & Lifecycle: Slots and LSN Bookkeeping #

Security & Privilege Boundaries #

Observability & Diagnostics #

Resilience Patterns & Failure Modes #

Conclusion #

Related guides #