Automating slot creation with Ansible

This page covers provisioning persistent logical replication slots declaratively with Ansible so that slot creation is idempotent, auditable, and repeatable across every environment. It matters because a slot created by hand — or created twice — is the difference between a Change Data Capture pipeline that resumes cleanly and one that pins restart_lsn and fills pg_wal at the primary’s full write rate; this is the automation counterpart to the interactive procedure in Initializing Replication Slots.

Manual psql execution of pg_create_logical_replication_slot works once, on one host, when someone is watching. It does not survive scale: a fleet of Python ETL consumers, a multi-tenant data platform, or a blue/green cutover needs the same slot, with the same output plugin and the same naming convention, provisioned the same way every time. Ansible supplies that control plane — it queries the current slot state, converges only the drift, and exposes the captured restart_lsn to the rest of the deployment as a fact. The result is version-controlled slot lifecycle management instead of ad-hoc administrative commands.

Slot Argument Semantics

Every slot Ansible provisions is a call to pg_create_logical_replication_slot(slot_name, plugin, temporary, two_phase[, failover]). The four decisive inputs each carry a distinct durability and streaming guarantee — get one wrong and the pipeline either loses its resume point or silently retains WAL forever. The table below is the contract the playbook must encode. All values target PostgreSQL 14–17.

Argument	Value used in automation	Durability / WAL-retention guarantee	Latency / throughput impact	Logical-replication behavior
`plugin`	`pgoutput`	Independent of plugin; retention is driven by `restart_lsn`	Binary protocol, lowest CPU per change vs JSON decoders	Native output plugin; honors publication row/column filtering
`temporary`	`false`	WAL retained across consumer restarts, network partitions, and rolling deploys	None	Persistent slot; survives disconnect so offset tracking is not destroyed
`two_phase`	`false` (set `true` only for distributed txns)	No change to retention	Slightly higher decode cost when `true`	`true` streams `PREPARE`/`COMMIT PREPARED` for atomic two-phase decoding
`failover` (PG 17+)	`false` (or `true` for HA)	`true` syncs slot to physical standbys via `pg_sync_replication_slots()`	Negligible	`true` lets the slot survive standby promotion; unavailable ≤ PG 16

The persistent (temporary := false) flag is the load-bearing choice for automation: a temporary slot vanishes when the provisioning connection closes, so the very act of Ansible finishing the task would drop it. Encode a naming convention that carries environment, consumer group, and pipeline identifier — for example cdc_prod_etl_orders_v2 — so parallel deployments never collide in the flat, database-scoped slot namespace.

Idempotent playbook implementation

The tasks below use community.postgresql.postgresql_query (v2.3+). Credentials are injected via Ansible Vault or external inventory and never appear inline. Task one reads the desired-state guard; task two converges only when the slot is absent; task three publishes the restart_lsn anchor as a fact for the subscription sync step that follows.

yaml

- name: Query existing replication slots
  community.postgresql.postgresql_query:
    db: "{{ target_db }}"
    query: >
      SELECT slot_name, plugin, slot_type, active, restart_lsn
      FROM pg_replication_slots
      WHERE slot_name = '{{ slot_name }}'
  register: slot_state
  retries: 3
  delay: 2
  until: slot_state is not failed
  changed_when: false

- name: Create persistent logical replication slot
  community.postgresql.postgresql_query:
    db: "{{ target_db }}"
    query: >
      SELECT slot_name, restart_lsn
      FROM pg_create_logical_replication_slot('{{ slot_name }}', 'pgoutput', false, false)
  register: slot_creation
  when: slot_state.query_result | length == 0
  retries: 3
  delay: 5
  until: slot_creation is not failed

- name: Register slot metadata for downstream consumers
  ansible.builtin.set_fact:
    cdc_slot_restart_lsn: "{{ slot_creation.query_result[0].restart_lsn }}"
    cdc_slot_active: false
  when: slot_creation is changed

The changed_when: false on the guard keeps a read-only query from being reported as a change; the when: slot_state.query_result | length == 0 gate is what makes the second task idempotent — a re-run against an existing slot is a no-op that returns changed: false. The executing role must hold the REPLICATION attribute and connect directly to the target database; provisioning that role is covered under security boundaries and permissions.

Diagnostic Patterns

Automation does not end at creation — the playbook (or a companion monitoring role) must assert that the slot it created is behaving. Run these against pg_replication_slots; each carries an operational threshold you can wire straight into an assert task or an alert rule. For the mechanics behind restart_lsn and confirmed_flush_lsn, see WAL stream mechanics.

sql

-- Post-provision assertion: slot exists with the expected plugin and type
SELECT slot_name, plugin, slot_type, active, restart_lsn, confirmed_flush_lsn, wal_status
FROM pg_replication_slots
WHERE slot_name = 'cdc_prod_etl_orders_v2';

confirmed_flush_lsn IS NULL immediately after creation is expected — it advances only once a consumer acknowledges changes.
wal_status = 'extended': the slot is retaining WAL beyond max_slot_wal_keep_size and will be invalidated if the consumer does not advance — treat as an immediate page.
wal_status = 'lost': the slot has already been invalidated; the pipeline must resnapshot.

sql

-- Retained-WAL pressure: how many bytes the slot is pinning
SELECT slot_name,
       active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots
WHERE slot_name = 'cdc_prod_etl_orders_v2';

Alert when retained_bytes exceeds a hard ceiling — 5 GB is a common first threshold — or when active = false persists longer than 300 s during ingestion hours, which means a consumer has disconnected while WAL keeps accumulating. Poll on a 15–30 s interval and export the delta to your metrics stack; the dashboards and alert rules for this live under async monitoring integration.

Server configuration the playbook must validate

Slot creation is only safe when the publisher is configured to retain and decode WAL. Assert these before the create task; changing wal_level requires a restart.

Parameter	Recommended value	Operational impact
`wal_level`	`logical`	Enables logical decoding; requires restart if changed from `replica`
`max_replication_slots`	`≥ 10`	Must exceed total active + standby slots — see configuring max_replication_slots safely
`max_wal_senders`	`≥ max_replication_slots + 2`	Reserves connection slots for streaming and logical decoders
`max_slot_wal_keep_size`	e.g. `10GB`	Hard cap; invalidates a stalled slot instead of letting `pg_wal` exhaust the disk

Safe Deployment Sequence

Provisioning a slot into a live primary is zero-downtime when ordered correctly. The sequence below never touches an active slot and has an explicit revert at each stage.

Pre-flight assert. Run the config-validation query (SHOW wal_level; and the parameter table above). Fail the play if wal_level <> 'logical'. Revert: none — nothing has changed.
Guarded query. Execute the read-only pg_replication_slots lookup. If a row exists with the correct plugin and slot_type, converge nothing and exit changed: false. Revert: none.
Create if absent. Run pg_create_logical_replication_slot('{{ slot_name }}', 'pgoutput', false, false) under the when gate. This reserves restart_lsn at the current WAL position but streams nothing yet. Revert: SELECT pg_drop_replication_slot('{{ slot_name }}'); — safe while active = false.
Register the anchor. Set the cdc_slot_restart_lsn fact so the subscription step binds to the exact position captured here. Revert: discard the fact.
Attach the consumer. Bind the slot with CREATE SUBSCRIPTION ... WITH (slot_name = '{{ slot_name }}', create_slot = false, copy_data = false). create_slot = false prevents a duplicate slot; copy_data = false starts streaming from the captured restart_lsn with no redundant historical copy. Revert: DROP SUBSCRIPTION (this leaves the slot; drop it separately once inactive).

Never drop an active = true slot during peak ingestion — you destroy the only cursor that lets the consumer resume without a full resnapshot. Schedule slot removal for a low-throughput window or after a graceful consumer shutdown.

Pipeline Integration

Once Ansible has published cdc_slot_restart_lsn, the Python ETL consumer or a Debezium connector attaches to the pre-created slot rather than creating its own. The consumer’s only job is to stream, apply idempotently, and advance the confirmed LSN — the slot already exists and its resume point is known.

python

import psycopg2
from psycopg2.extras import LogicalReplicationConnection

DSN = "host=primary-db dbname=cdc_source user=replicator replication=database"
conn = psycopg2.connect(DSN, connection_factory=LogicalReplicationConnection)
cur = conn.cursor()

cur.start_replication(
    slot_name="cdc_prod_etl_orders_v2",   # the slot Ansible provisioned
    options={"proto_version": "1", "publication_names": "cdc_pub"},
    decode=False,
)

def on_change(msg):
    apply_upsert(msg.payload)                 # idempotent by primary key
    cur.send_feedback(flush_lsn=msg.data_start)  # advances confirmed_flush_lsn

cur.consume_stream(on_change)

Send feedback on every processed message (or on a bounded interval) — a consumer that never ACKs stalls confirmed_flush_lsn, and the slot pins WAL exactly as an unconsumed slot would. Wrap start_replication in exponential backoff; on replication slot "..." does not exist, the slot was invalidated past max_slot_wal_keep_size, so recreate it via the same Ansible role and trigger a full resnapshot.

Failover handling

Consumer crash. The slot stays active = false and keeps retaining WAL. Detect via the active-duration alert, restart the consumer, and let it resume from restart_lsn. No slot changes.
Stale/irrecoverable slot. Verify active = false, SELECT pg_drop_replication_slot('{{ slot_name }}');, re-run the Ansible role to re-provision with a fresh anchor, then re-attach the subscription.
Primary failover (PG 17+). If the slot was created with failover := true, it is already synchronized to the standby and survives promotion. On PG 16 and earlier, logical slots do not migrate — the role must recreate the slot on the promoted primary and coordinate a snapshot-based catch-up before restarting the consumer.

Authoritative references

PostgreSQL manual: pg_create_logical_replication_slot and the replication management functions.
PostgreSQL manual: pg_replication_slots view — column semantics for wal_status, restart_lsn, and confirmed_flush_lsn.
PostgreSQL manual: Logical Decoding — output plugin and streaming protocol.
Ansible: community.postgresql.postgresql_query module reference.

pg_create_logical_replication_slot step-by-step — the interactive procedure this playbook automates, with per-argument detail.
Configuring max_replication_slots safely — sizing the slot ceiling before you provision a fleet.
Async Monitoring Integration — the lag dashboards and alert rules that watch the slot this role creates.
Subscription Sync Procedures — binding the provisioned slot to a publication with create_slot = false.

← Back to Initializing Replication Slots

Slot Argument Semantics #

Idempotent playbook implementation #

Diagnostic Patterns #

Server configuration the playbook must validate #

Safe Deployment Sequence #

Pipeline Integration #

Failover handling #

Authoritative references #

Related #