Tuning synchronous_commit for logical replication

In production Change Data Capture (CDC) pipelines, synchronouscommit dictates the precise durability boundary at which the primary PostgreSQL instance…

In production Change Data Capture (CDC) pipelines, synchronous_commit dictates the precise durability boundary at which the primary PostgreSQL instance acknowledges a transaction commit. For logical replication workloads, this parameter directly governs commit latency, WAL generation velocity, and downstream consumer backpressure characteristics. Misalignment between synchronous_commit and logical decoding throughput routinely manifests as replication slot lag, WAL retention exhaustion, or Python ETL consumer timeout cascades. This document isolates the operational procedure for tuning synchronous_commit specifically for logical replication, providing exact parameter mappings, diagnostic queries, and safe deployment sequences validated against PostgreSQL 14–17.

Commit Semantics and Logical Decoding Throughput

The logical decoding subsystem reads WAL records asynchronously via the pgoutput plugin, but the client-facing acknowledgment path remains strictly bound by the primary’s synchronous_commit setting. The parameter does not alter how the logical worker decodes data; it alters when the producer application receives COMMIT OK.

Value Durability Guarantee Latency Impact Logical Replication Behavior
on (default) WAL flushed to persistent storage via fsync() before commit acknowledgment 1–5 ms/tx Zero committed data loss on primary crash. Optimal for financial/audit workloads.
local WAL written to OS page cache; fsync() deferred to background writer 0.1–0.5 ms/tx Accepts up to wal_writer_delay (default 200 ms) of data loss on unclean shutdown. Ideal for high-throughput CDC where downstream consumers implement idempotent replay.
remote_write Primary waits for WAL receiver on a synchronous standby to acknowledge network receipt 1–3 ms + RTT Guarantees at least one standby has WAL in OS cache. Rarely used for pure logical replication unless paired with physical streaming.
off WAL handed to OS; no explicit sync or network wait <0.1 ms/tx Strictly contraindicated for production logical replication. Unpredictable commit ordering and potential wal_writer_delay loss during crashes.

When configuring the initial topology, engineers must align publication scope with commit behavior. During Creating Publications, restrict FOR TABLE clauses to actively mutating relations. Publishing high-churn system catalogs or append-only logs under synchronous_commit = local will saturate the WAL writer, trigger wal_keep_size exhaustion, and cause consumer desynchronization.

Diagnostic Patterns and Slot Lag Thresholds

Improper tuning follows predictable diagnostic signatures. The most frequent failure mode is rapid WAL accumulation paired with slot inactivity, yielding ERROR: replication slot "cdc_slot" is not active or FATAL: too many replication slots. This occurs when the logical replication worker cannot decode and transmit WAL records faster than the primary generates them under lowered commit durability, causing restart_lsn to fall behind wal_level=logical retention boundaries.

Execute the following diagnostic sequence at 15-second intervals during peak load:

sql
SELECT
  slot_name,
  plugin,
  active,
  restart_lsn,
  confirmed_flush_lsn,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS wal_retention_gap_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS apply_lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'logical';

Operational Thresholds:

  • apply_lag_bytes > 256 MB: Consumer serialization or network I/O bottleneck. Investigate Python ETL batch sizing or pgoutput filter overhead.
  • wal_retention_gap_bytes > 1 GB: WAL generation outpaces archive/consumer consumption. Increase max_wal_size or throttle producer write rate.
  • active = false for >300s: Worker crash or subscription disconnect. Verify Initializing Replication Slots state and network keepalives.

Safe Deployment Sequence

Never modify synchronous_commit globally without validating downstream idempotency. Follow this zero-downtime rollout:

1. Baseline measurement — Record current commit_latency_ms (via pg_stat_statements or APM) and apply_lag_bytes under synchronous_commit = on.

2. Session-level validation — Apply SET synchronous_commit = local; to a single ETL connection. Run a representative write batch. Verify downstream consumers process duplicate/missing records without error.

3. Cluster-wide application

sql
ALTER SYSTEM SET synchronous_commit = local;
SELECT pg_reload_conf();

The change takes effect immediately for new transactions. Existing connections retain their session value until reconnect.

4. Backpressure monitoring — Track pg_stat_replication.write_lag and consumer queue depth. If Python ETL connection pools exhaust or psycopg/asyncpg raise OperationalError: server closed the connection unexpectedly, revert immediately:

sql
ALTER SYSTEM RESET synchronous_commit;
SELECT pg_reload_conf();

Pipeline Integration: ETL, Retry Logic, and Failover

Logical replication operates independently of physical standby promotion, but synchronous_commit tuning directly impacts Subscription Sync Procedures and emergency recovery paths.

Python ETL and Error Handling

When synchronous_commit = local, the primary acknowledges commits before disk persistence. Python consumers must implement:

  • Idempotent Upserts: Use INSERT ... ON CONFLICT DO UPDATE or MERGE-equivalent patterns with deterministic primary keys.
  • Exponential Backoff Retry Logic: Wrap transaction fetches in a retry loop with jitter. On psycopg.errors.ConnectionFailure, reconnect and resume from the last acknowledged confirmed_flush_lsn.
  • Async Monitoring Integration: Export apply_lag_bytes and wal_retention_gap_bytes to Prometheus via pg_exporter. Alert when lag exceeds 500 MB for >2 minutes.

Emergency Failover Procedures

Logical replication slots are not automatically promoted during primary failover. If the primary crashes with synchronous_commit = local, unflushed WAL records are lost. The downstream consumer must:

  1. Query the new primary for SELECT pg_current_wal_lsn();
  2. Compare against the last successfully applied LSN in the ETL state table.
  3. If a gap exists, trigger a full snapshot sync or recreate the subscription with CREATE SUBSCRIPTION ... WITH (copy_data = true) to reconcile state. Document this procedure in your runbook to prevent silent data divergence.

For authoritative parameter behavior and plugin architecture, consult the official PostgreSQL WAL Configuration and Logical Replication Architecture documentation.