Configuring max_replication

max_replication_slots sets the hard, shared-memory ceiling on how many logical and physical replication slots a PostgreSQL primary can hold at once, and getting the number wrong either rejects new consumers outright or lets idle slots quietly pin WAL until the primary runs out of disk. This page gives deterministic sizing, the diagnostic queries that expose slot pressure, and a zero-downtime procedure for changing the value, extending the slot-type decisions covered in the parent guide to replication slot types.

Unlike most tuning knobs, this one is not hot-reloadable — it is allocated at startup and requires a full restart to change. That single fact reshapes every decision below: you size once, you size with headroom, and you treat the restart as a scheduled maintenance event rather than a runtime ALTER SYSTEM. Undersizing surfaces as FATAL: all replication slots are in use the moment a Debezium connector or a native subscriber tries to attach; oversizing wastes shared memory and, worse, removes the natural backstop that would otherwise cap how many consumers can independently retain WAL.

Commit & Behavior Semantics

max_replication_slots does not itself change durability or commit latency — it caps how many independent WAL-retention anchors can exist. But it never operates alone: three neighbouring GUCs govern whether a correctly-sized slot pool actually behaves safely. The table below fixes the exact behaviour of each, targeting PostgreSQL 14–17.

Parameter	Reload behaviour	Durability / retention guarantee	Logical-replication behaviour
`max_replication_slots`	Restart only; allocated in shared memory at startup	Caps concurrent slots (logical + physical). No effect on WAL volume itself — only on how many consumers can pin `restart_lsn`	Exceeding it returns `FATAL: all replication slots are in use` when a new slot is requested; existing slots are unaffected
`max_wal_senders`	Restart only	Caps concurrent walsender processes; each active streaming consumer needs one	Must be `>= max_replication_slots + headroom` (base backups also consume senders). Too low blocks attach even when a slot exists
`wal_level`	Restart only	Determines what is logged, not retention	Must be `logical` for logical slots; `replica` supports physical slots only
`max_slot_wal_keep_size` (PG 13+)	Reload (SIGHUP)	Bounds WAL a slot may retain before it is invalidated	The safety valve: a slot that would retain more than this is marked `lost`, freeing WAL. Set it to protect the volume; unset (`-1`) means unlimited retention

The operational contract is: max_replication_slots decides how many consumers can hold WAL, max_slot_wal_keep_size decides how much any one of them may hold before PostgreSQL sacrifices it to save the disk. Size the first for your consumer inventory; set the second as a circuit breaker so a single stalled slot cannot fill pg_wal. The retention mechanics behind that trade-off are detailed in WAL stream mechanics.

Deterministic capacity planning

Size from an explicit inventory, never a round number. The baseline:

code

max_replication_slots = N_logical + N_physical + ceil(0.20 * N_logical)

N_logical is the count of concurrent active subscriptions across all publications — each subscription binds to exactly one logical slot. N_physical covers streaming standbys holding physical slots. The 20% buffer absorbs transient duplication during rolling connector restarts, Kubernetes pod rescheduling, or a network partition where the old slot has not yet been reaped when the replacement attaches. For a platform running 12 logical subscribers and 2 physical standbys: 12 + 2 + ceil(2.4) = 17. Set max_wal_senders at least one above that (18+) to leave room for pg_basebackup.

Diagnostic Patterns

Audit the current allocation before touching the ceiling, and monitor pressure continuously afterward. Every query below runs against pg_replication_slots on the primary.

Baseline inventory — what exists and whether it is being consumed:

sql

SELECT slot_name, slot_type, active,
       restart_lsn, confirmed_flush_lsn,
       wal_status                    -- reserved | extended | unreserved | lost (PG 13+)
FROM pg_replication_slots
ORDER BY restart_lsn;

Headroom against the ceiling — alert when utilisation crosses 80%:

sql

SELECT count(*)                             AS slots_used,
       current_setting('max_replication_slots')::int AS slot_ceiling,
       round(100.0 * count(*)
             / current_setting('max_replication_slots')::int, 1) AS pct_used
FROM pg_replication_slots;

Per-slot retained-WAL bytes — the number that actually threatens the disk. Alert on > 1 GB sustained for 10 minutes on any single logical slot, and page on any slot whose restart_lsn has not advanced while active = false:

sql

SELECT slot_name,
       active,
       pg_size_pretty(
         pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
WHERE slot_type = 'logical'
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

An inactive slot with growing retained_wal is the canonical pre-outage signature: the downstream consumer has died, restart_lsn is frozen, and pg_wal climbs until the volume fills and the primary refuses writes. Treat a slot that is active = false for more than a few minutes on a production pipeline as an incident, not a warning. wal_status = 'lost' means max_slot_wal_keep_size already fired and that consumer must be re-initialised from a fresh snapshot.

Safe Deployment Sequence

Changing max_replication_slots requires a restart, so it is executed as a controlled, reversible rollout — never during peak ingestion.

Pre-flight validation. Confirm the new ceiling and its dependency:
sql
```
SELECT name, setting, pending_restart
FROM pg_settings
WHERE name IN ('max_replication_slots', 'max_wal_senders', 'wal_level');
```
If max_wal_senders is at or below the target slot count, raise it in the same change so a single restart applies both.
Stage the new values. Apply via ALTER SYSTEM (or the managed-service parameter group / postgresql.conf):
sql
```
ALTER SYSTEM SET max_replication_slots = 18;
ALTER SYSTEM SET max_wal_senders       = 20;
```
pending_restart now reads true for both — staged but not live.
Drain and checkpoint. Pause new subscription/slot creation, let active consumers reach a stable confirmed_flush_lsn, and issue a manual CHECKPOINT; so recovery after restart is short.
Rolling restart. Restart physical standbys first, then the primary. Front the primary with a pooler (PgBouncer) so replication connection attempts queue through the brief window rather than erroring. Never lower the value below the count of existing slots — the server refuses to start if it cannot re-register every persistent slot.
Verify. Confirm the live value and that every consumer reattached:
sql
```
SHOW max_replication_slots;
SELECT slot_name, active FROM pg_replication_slots WHERE NOT active;
```
An empty second result means every slot reattached cleanly.

Revert. Because the change is restart-scoped, rollback is symmetric: ALTER SYSTEM SET max_replication_slots = <previous>; then restart — but only after dropping any slots created above the old ceiling (SELECT pg_drop_replication_slot('name');), or the lower value will block startup.

Pipeline Integration

Python ETL workers must treat slot exhaustion and runaway lag as expected states, degrading gracefully instead of blocking. Consumers built on psycopg2 logical decoding — the same pattern used when creating a logical slot step by step — should self-limit before they ever hit the ceiling.

python

import psycopg2
from psycopg2 import errors

LAG_LIMIT_BYTES = 500 * 1024 * 1024   # 500 MB → shed load past this

def attach_slot(dsn, slot_name):
    # replication=database is mandatory for logical decoding connections
    conn = psycopg2.connect(dsn, connection_factory=psycopg2.extras.LogicalReplicationConnection)
    cur = conn.cursor()
    try:
        cur.start_replication(slot_name=slot_name, decode=True)
    except errors.ObjectNotInPrerequisiteState:
        # 'all replication slots are in use' surfaces here — do not busy-retry.
        raise SlotCeilingExhausted(slot_name)
    return conn, cur

def check_lag(admin_cur, slot_name):
    admin_cur.execute("""
        SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
        FROM pg_replication_slots WHERE slot_name = %s
    """, (slot_name,))
    return admin_cur.fetchone()[0]

Wire three behaviours around that core:

Circuit breaker. Poll check_lag on the interval you already export metrics on. Past LAG_LIMIT_BYTES, route incoming change events to a dead-letter sink (a Kafka DLQ or S3 staging prefix) and alert, rather than letting the slot fall further behind and drag WAL retention with it.
Bounded retry, not busy-loop. all replication slots are in use means the pool is full, not that this consumer will get a slot by retrying tightly — that only burns connections. Back off exponentially, alert after N failures, and surface the ceiling as a capacity signal to whatever manages max_replication_slots.
Ownership handoff on migration. When replacing a connector or upgrading, advance the new slot to the old position with pg_replication_slot_advance('new_slot', '<lsn>') before dropping the legacy slot, so no WAL gap opens between the two. For critical tables where recovery is delayed, fall back to a full-table snapshot via the subscription sync copy phase, then resume streaming once lag normalises.

Slot creation is privileged: only roles with the REPLICATION attribute (or the pg_create_logical_replication_slot privilege on managed platforms) can allocate one, which is the correct place to stop unbounded slot growth in a multi-tenant platform. Scope those grants as described in security boundaries and permissions, and understand how the underlying logical decoding subsystem turns a slot into a decoded change stream before you decide how many to provision.

Authoritative References

PostgreSQL — max_replication_slots (runtime configuration for replication)
PostgreSQL — max_slot_wal_keep_size and the wal_status column in pg_replication_slots
PostgreSQL — replication management functions (pg_drop_replication_slot, pg_replication_slot_advance)
psycopg — logical replication support

Replication Slot Types — physical vs. logical slots and the lifecycle the ceiling constrains.
pg_create_logical_replication_slot step by step — the hands-on slot creation this ceiling gates.
Automating slot creation with Ansible — provisioning slots reproducibly within the configured ceiling.
WAL Stream Mechanics — how a slot’s restart_lsn pins WAL retention on disk.

← Back to Replication Slot Types

Commit & Behavior Semantics #

Deterministic capacity planning #

Diagnostic Patterns #

Safe Deployment Sequence #

Pipeline Integration #

Authoritative References #

Related #