Tuning synchronous_commit for logical replication

synchronous_commit fixes the exact durability boundary at which the primary acknowledges a COMMIT, and for a logical replication topology that single publication-side knob decides how much commit latency your producers pay and how large a data-loss window every downstream Change Data Capture (CDC) consumer must be built to tolerate. Set it wrong and the failure is deferred, not avoided: it surfaces later as replication slot lag, WAL retention exhaustion on pg_wal, or a Python ETL consumer replaying transactions the primary never actually persisted after a crash.

This page isolates the operational procedure for tuning synchronous_commit for logical decoding workloads specifically — the parameter-to-behavior mapping, the diagnostic queries that catch a bad setting before it fills the disk, and a zero-downtime rollout with an exact revert. It assumes you have already worked through creating publications and provisioned the durable WAL cursor described in initializing replication slots. All behavior is validated against PostgreSQL 14 through 17.

synchronous_commit only moves where the producer's COMMIT OK returns along the durability path; the pgoutput decoder reads already-committed WAL and advances the slot on its own schedule.

Commit Semantics and Logical Decoding Throughput

The logical decoding subsystem reads committed WAL asynchronously through the pgoutput plugin, and synchronous_commit does not change how it decodes — it changes when the producing application receives COMMIT OK. The decoder only ever sees records that already reached WAL, so the parameter’s real effect on a CDC pipeline is indirect: it governs producer commit latency, WAL generation velocity, and therefore the rate the slot’s restart_lsn must advance to keep up. Understanding the underlying cursor arithmetic here depends on the WAL stream mechanics that move restart_lsn and confirmed_flush_lsn forward.

Value	Durability guarantee	Latency impact	Logical-replication behavior
`on` (default)	WAL flushed to persistent storage via `fsync()` before commit is acknowledged	1–5 ms/tx	Zero committed-data loss on primary crash. Correct default for financial and audit CDC streams where the consumer treats every emitted change as durable.
`local`	WAL written to the OS page cache; `fsync()` deferred to the background WAL writer	0.1–0.5 ms/tx	Accepts up to `wal_writer_delay` (default 200 ms) of loss on an unclean shutdown. Suitable for high-throughput CDC where consumers implement idempotent replay.
`remote_write`	Primary waits for a synchronous standby’s WAL receiver to acknowledge network receipt	1–3 ms + RTT	Guarantees at least one standby holds the WAL in OS cache. Rarely used for pure logical replication unless paired with physical streaming.
`remote_apply`	Primary waits until a synchronous standby has applied (replayed) the commit	RTT + apply time	Strongest guarantee; only relevant when a physical standby also fronts the logical publisher. Highest latency; never combine with `local`-style throughput goals.
`off`	WAL queued in buffers; no `fsync()` and no network wait	<0.1 ms/tx	Not for production logical replication. Up to `wal_writer_delay` of loss and unpredictable commit visibility ordering on crash.

Two interactions are specific to logical replication and easy to miss. First, synchronous_commit is per-session and per-transaction — you can lower it for the high-churn write path with SET LOCAL synchronous_commit = local inside a transaction while leaving critical writes at on. Second, when this publisher is itself a synchronous-standby primary, synchronous_standby_names must name at least one standby or remote_write/remote_apply silently degrade to local. Align the publication scope from creating publications with the durability you pick: publishing a high-churn append-only log under local saturates the WAL writer and pushes the decoder behind the producer.

Diagnostic Patterns and Slot Lag Thresholds

A mistuned synchronous_commit produces a predictable signature: WAL accumulates faster than the slot can drain, so restart_lsn falls behind and pg_wal grows toward max_slot_wal_keep_size, eventually surfacing as ERROR: replication slot "cdc_slot" is not active or an invalidated slot. Poll the following at 15-second intervals during peak write load — it is the single query that separates a consumer bottleneck from a producer-outpaces-consumer problem.

sql

-- Run on the publisher. Distinguishes apply-side lag from WAL-retention pressure.
SELECT
  slot_name,
  plugin,
  active,
  restart_lsn,
  confirmed_flush_lsn,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn))        AS wal_retention_gap,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS apply_lag
FROM pg_replication_slots
WHERE slot_type = 'logical';

Interpret the two gaps against concrete thresholds:

apply_lag > 256 MB — the consumer cannot process decoded changes fast enough. The producer’s synchronous_commit is not the cause; investigate Python ETL batch sizing, pgoutput filter overhead, or network I/O to the subscriber.
wal_retention_gap > 1 GB — WAL generation is outpacing consumption, the classic symptom of dropping durability without adding consumer capacity. Raise max_wal_size, add consumer parallelism, or throttle the producer write rate.
active = false for > 300 s — the apply worker crashed or the subscription disconnected. Confirm the slot state provisioned in initializing replication slots and check network keepalives; a lowered synchronous_commit makes an inactive slot bloat WAL far faster.

Correlate producer-side latency separately so you can prove a durability change actually bought the throughput you traded for:

sql

-- PG 14+: mean commit latency per statement, before/after the change.
SELECT query, calls, round(mean_exec_time::numeric, 3) AS mean_ms
FROM pg_stat_statements
WHERE query ILIKE 'commit%' OR query ILIKE 'insert into orders%'
ORDER BY mean_exec_time DESC
LIMIT 10;

Safe Deployment Sequence

Never lower synchronous_commit globally without first proving downstream idempotency. Roll it out in four steps with the revert kept one command away.

1. Baseline. Record current commit latency (from pg_stat_statements or your APM) and apply_lag while still at synchronous_commit = on. Without this number you cannot tell whether the change helped.

2. Session-scoped validation. Apply the lowered value to a single ETL connection only, run a representative write batch, and confirm consumers tolerate any duplicate or missing rows:

sql

-- One connection only. Verify downstream replay is clean before going wider.
SET synchronous_commit = local;
-- ... run a representative write workload on this session ...

3. Cluster-wide application. The change takes effect immediately for new transactions; existing sessions keep their old value until reconnect.

sql

ALTER SYSTEM SET synchronous_commit = local;
SELECT pg_reload_conf();   -- no restart required

4. Backpressure watch, with revert ready. Track pg_stat_replication.write_lag and consumer queue depth. If a connection pool exhausts or psycopg2/asyncpg raise OperationalError: server closed the connection unexpectedly, revert immediately — the reset restores on for all new transactions:

sql

ALTER SYSTEM RESET synchronous_commit;
SELECT pg_reload_conf();

Because the reload is non-disruptive in both directions, the effective blast radius of a bad change is bounded by how quickly your monitoring detects it — which is the argument for wiring the thresholds above into alerting before step 3, not after.

Pipeline Integration: ETL, Retry Logic, and Failover

Under synchronous_commit = local the primary acknowledges a commit before that WAL is guaranteed on disk, so the durability contract your consumers were built against changes. This directly shapes both the native subscription sync apply path and any Python or Debezium connector consumer reading the same slot.

Python ETL patterns

Idempotent upserts. Every applied change must be replayable. Use INSERT ... ON CONFLICT DO UPDATE (or PG 15+ MERGE) keyed on a deterministic primary key so a post-crash replay of an unflushed transaction converges instead of duplicating.
Retry with backoff and jitter. Wrap the change-fetch loop so that on psycopg2.errors.ConnectionFailure the consumer reconnects and resumes from the last acknowledged confirmed_flush_lsn, never re-requesting from an earlier arbitrary LSN.
Monitoring export. Ship apply_lag and wal_retention_gap to Prometheus via postgres_exporter and alert when either exceeds its threshold for more than 2 minutes; the reusable dashboard and alert rules live in asynchronous monitoring integration.

python

# Resume-from-last-ack pattern; safe under synchronous_commit = local.
import time, random, psycopg2

def stream_changes(dsn, last_lsn):
    backoff = 0.5
    while True:
        try:
            conn = psycopg2.connect(dsn, connection_factory=psycopg2.extras.LogicalReplicationConnection)
            cur = conn.cursor()
            cur.start_replication(slot_name="cdc_slot", decode=True, start_lsn=last_lsn)
            for msg in cur:
                apply_idempotent(msg.payload)   # ON CONFLICT DO UPDATE
                last_lsn = msg.data_start
                msg.cursor.send_feedback(flush_lsn=msg.data_start)  # advances confirmed_flush_lsn
            backoff = 0.5
        except psycopg2.OperationalError:
            time.sleep(backoff + random.random() * 0.25)  # jitter avoids thundering herd
            backoff = min(backoff * 2, 30)

Failover handling

Logical replication slots are not promoted automatically on primary failover before PostgreSQL 17 (failover = true at slot creation makes them fail over on PG 17+). If the primary crashes under synchronous_commit = local, any WAL not yet flushed is gone, and the promoted standby’s pg_current_wal_lsn() can sit behind the last LSN your consumer already applied. Reconcile deterministically:

On the new primary, read SELECT pg_current_wal_lsn();.
Compare it against the last applied LSN in the ETL state table.
If the consumer applied changes the new primary never persisted, re-snapshot: recreate the subscription with CREATE SUBSCRIPTION ... WITH (copy_data = true), or on the Python side truncate-and-reload the affected relations. Record this branch in the runbook so a silent divergence never ships to analytics.

Authoritative references

PostgreSQL: synchronous_commit (WAL configuration) — the canonical per-value durability definitions and standby interactions.
PostgreSQL: Logical Replication — decoding, slots, and apply-worker architecture.
PostgreSQL: ALTER SYSTEM — reload-vs-restart semantics for the rollout above.

Creating publications — the parent workflow; the publish set and durability chosen here define the data-loss window this page trades against throughput.
Initializing replication slots — provision and recover the durable WAL cursor whose restart_lsn a mistuned commit setting can bloat.
Subscription sync procedures — the apply-side workflow whose durability boundary this setting governs.
Asynchronous monitoring integration — export apply_lag and wal_retention_gap with the alerting thresholds referenced above.
WAL stream mechanics — the LSN and cursor arithmetic underneath every threshold on this page.

Commit Semantics and Logical Decoding Throughput #

Diagnostic Patterns and Slot Lag Thresholds #

Safe Deployment Sequence #

Pipeline Integration: ETL, Retry Logic, and Failover #

Python ETL patterns #

Failover handling #

Authoritative references #

Related guides #