Async Monitoring Integration

Asynchronous monitoring is the observability layer of Logical Replication Setup & Management: a dedicated collector that polls slot, sender, and subscription state on its own schedule and exports it to a time-series backend, so telemetry never rides the same connection or lock path as the change stream itself. This page covers how to build that collector against PostgreSQL 14–17, which system views to scrape, the exact byte and time thresholds to alert on, and the failure signatures that separate a healthy pipeline from a disk that fills at 3 a.m.

Get this wrong and the failure is silent by construction. A stalled consumer keeps a replication slot pinned, restart_lsn freezes, and the primary retains WAL indefinitely while still accepting writes — there is no error, no degraded query, nothing until pg_wal exhausts the filesystem and the primary halts. Synchronous health probes make it worse: a health check that opens a transaction against pg_stat_activity on every scrape competes for shared buffers, inflates idle in transaction counts, and can itself hold back WAL cleanup. Decoupling the collector from the data path is what makes monitoring safe to run at 1–5 second cadence in production.

Prerequisites & Configuration Objects

The collector observes the same objects created during publication and slot setup, so those must exist and be streaming before monitoring adds value. Beyond that, the monitoring role and a handful of server parameters must be configured to keep scraping cheap and non-blocking.

The role needs read access to the statistics views and nothing else. On PostgreSQL 10+ the built-in pg_monitor role grants exactly the right set — pg_read_all_stats, pg_read_all_settings, and access to the replication views — without handing the collector REPLICATION or table privileges.

sql

-- Least-privilege monitoring role: read-only stats, no REPLICATION, no table access.
CREATE ROLE mon LOGIN PASSWORD 'use-a-secret-manager';
GRANT pg_monitor TO mon;                       -- PG 10+: pg_read_all_stats + settings + repl views

-- Fence every session this role opens so a hung scrape cannot pin WAL or a lock.
ALTER ROLE mon SET statement_timeout = '5s';
ALTER ROLE mon SET idle_in_transaction_session_timeout = '10s';
ALTER ROLE mon SET default_transaction_read_only = on;

Server-side, three settings determine whether monitoring is a safety net or a liability. track_commit_timestamp must be on if you want to derive apply latency in wall-clock time on the subscriber (it requires a restart). max_slot_wal_keep_size is not a monitoring parameter as such, but its value is the denominator for your most important alert — you page at a fraction of it, so it has to be a known, finite number. And the collector’s own connections should come from a small, capped pool so a scrape storm can never exhaust max_connections.

sql

-- Publisher: bound WAL a slot can pin, and enable commit timestamps for latency math.
ALTER SYSTEM SET max_slot_wal_keep_size = '20GB';   -- PG 13+: finite denominator for lag alerts
ALTER SYSTEM SET track_commit_timestamp = on;       -- PG requires restart to take effect
SELECT pg_reload_conf();

Object / setting	Where	Required value	Why the collector needs it
`pg_monitor` role grant	publisher + subscriber	granted to `mon`	Read `pg_stat_replication`, slot state, and settings with no write surface
`statement_timeout`	monitoring role	`5s`	A wedged scrape self-aborts instead of holding a snapshot open
`idle_in_transaction_session_timeout`	monitoring role	`10s`	Prevents idle scrape transactions from blocking WAL cleanup
`max_slot_wal_keep_size`	publisher	finite (e.g. `20GB`)	The value your slot-lag alert is measured against
`track_commit_timestamp`	publisher	`on`	Enables time-based apply lag on the subscriber

Step-by-Step Implementation

The collector is a long-lived async worker that opens a tiny connection pool, scrapes a fixed query set on an interval, and pushes gauges to your time-series backend. The pattern below uses asyncpg and prometheus_client, but the shape maps cleanly onto an OpenTelemetry exporter or a Datadog agent check.

1. Open a small, read-only pool — not one connection per scrape. Persistent connections through a bounded pool avoid the connect/auth cost on every interval and cap the collector’s footprint on the primary.

python

import asyncio
import asyncpg
from prometheus_client import Gauge, start_http_server

DSN = "postgresql://mon@pub.internal:5432/app?sslmode=verify-full"

async def make_pool() -> asyncpg.Pool:
    return await asyncpg.create_pool(
        dsn=DSN,
        min_size=1,
        max_size=3,                     # hard cap: collector never eats > 3 backends
        command_timeout=5,              # mirror statement_timeout server-side
        server_settings={"application_name": "lr_monitor"},
    )

2. Define the gauges once. Model each metric on the state that actually causes incidents: bytes of WAL pinned, unconfirmed bytes, apply lag in seconds, and a liveness flag per slot.

python

SLOT_RETAINED = Gauge("pg_repl_slot_retained_bytes", "WAL bytes pinned by slot", ["slot"])
SLOT_ACTIVE   = Gauge("pg_repl_slot_active",         "1 if slot has a live consumer", ["slot"])
SLOT_STATUS   = Gauge("pg_repl_slot_wal_status",     "0 reserved 1 extended 2 unreserved 3 lost", ["slot"])
APPLY_LAG_S   = Gauge("pg_repl_apply_lag_seconds",   "flush_lag on the publisher", ["client"])
SUB_ERRORS    = Gauge("pg_repl_sub_apply_errors",    "apply_error_count", ["subscription"])

3. Scrape in a READ ONLY transaction. Wrapping the read set in one short read-only transaction gives a consistent snapshot across the views and guarantees the collector can never issue a write, even by accident.

python

_STATUS = {"reserved": 0, "extended": 1, "unreserved": 2, "lost": 3}

async def scrape(pool: asyncpg.Pool) -> None:
    async with pool.acquire() as conn:
        async with conn.transaction(readonly=True):
            slots = await conn.fetch(
                """
                SELECT slot_name, active, wal_status,
                       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)        AS retained,
                       pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS unconfirmed
                FROM pg_replication_slots
                WHERE slot_type = 'logical'
                """
            )
            senders = await conn.fetch(
                """
                SELECT application_name AS client,
                       EXTRACT(EPOCH FROM flush_lag) AS flush_lag_s
                FROM pg_stat_replication
                """
            )

    for r in slots:
        SLOT_RETAINED.labels(r["slot_name"]).set(r["retained"] or 0)
        SLOT_ACTIVE.labels(r["slot_name"]).set(1 if r["active"] else 0)
        SLOT_STATUS.labels(r["slot_name"]).set(_STATUS.get(r["wal_status"], 3))
    for r in senders:
        APPLY_LAG_S.labels(r["client"]).set(r["flush_lag_s"] or 0.0)

4. Drive the loop with backoff and jitter. A fixed interval with exponential backoff on failure prevents a brief publisher blip from turning into a reconnect storm, and jitter stops every collector replica scraping on the same tick.

python

import random

async def run(interval: float = 5.0) -> None:
    start_http_server(9187)                       # Prometheus scrape target
    pool = await make_pool()
    backoff = interval
    while True:
        try:
            await scrape(pool)
            backoff = interval
        except (asyncpg.PostgresError, OSError) as exc:
            backoff = min(backoff * 2, 60)        # cap at 60 s
            print(f"scrape failed: {exc!r}; retry in {backoff:.1f}s")
        await asyncio.sleep(backoff + random.uniform(0, interval * 0.2))

if __name__ == "__main__":
    asyncio.run(run())

5. Recycle connections and fail loud, not silent. Configure the pool to retire backends periodically (or run a supervisor that restarts the worker), and route three consecutive scrape failures to your incident channel rather than swallowing them — a collector that dies quietly is worse than no collector, because dashboards go stale-green.

Parameter Reference Table

These are the knobs that decide whether the collector is cheap and safe or a source of load and false confidence. Defaults are the PostgreSQL or client-library defaults; the “monitoring behavior” column is what the value means specifically for a scraping workload.

Parameter	Layer	Default	Monitoring behavior
`scrape interval`	collector	—	1–5 s is safe with a bounded pool; sub-second adds load without new signal
`max_size` (pool)	collector	driver-specific	Hard ceiling on backends the collector consumes; keep at 2–3
`command_timeout`	collector	none	Client-side abort; set equal to server `statement_timeout` (5 s)
`statement_timeout`	monitoring role	`0` (off)	Server-side kill switch for a hung scrape; set `5s` at role level
`idle_in_transaction_session_timeout`	monitoring role	`0` (off)	Prevents an idle scrape txn from pinning WAL; set `10s`
`track_commit_timestamp`	server	`off`	Must be `on` for wall-clock apply latency; restart required
`max_slot_wal_keep_size`	server	`-1` (unlimited)	Finite value = the denominator for the retained-bytes alert (PG 13+)
`flush_lag` scrape	publisher view	—	Time-based send/flush lag; primary latency SLI
`apply_error_count`	`pg_stat_subscription_stats`	—	Durable error counter (PG 15+); survives worker restarts

Diagnostic Queries

These are copy-paste queries for the collector and for on-call. Each carries the threshold that should trip an alert. The mechanics of restart_lsn versus confirmed_flush_lsn — why a slot pins WAL and how the cursor advances — are detailed in WAL stream mechanics.

sql

-- Slot lag and health, per consumer.
-- ALERT: retained_bytes > 10 GB (or > 50% of max_slot_wal_keep_size), OR wal_status <> 'reserved'.
SELECT slot_name,
       active,
       wal_status,                                              -- reserved | extended | unreserved | lost
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn))         AS retained,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS unconfirmed
FROM pg_replication_slots
WHERE slot_type = 'logical'
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

sql

-- Publisher-side send/flush/replay lag.
-- ALERT: write_lag or flush_lag > 60 s sustained; state <> 'streaming'.
SELECT application_name, client_addr, state,
       write_lag, flush_lag, replay_lag,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS apply_backlog_bytes
FROM pg_stat_replication;

sql

-- Subscriber-side worker liveness and durable error counters.
-- ALERT: pid IS NULL for an expected worker (down), or apply_error_count increasing.
SELECT s.subname,
       ss.pid,                                    -- NULL pid = apply worker not running
       st.apply_error_count,
       st.sync_error_count,
       st.stats_reset
FROM pg_subscription s
LEFT JOIN pg_stat_subscription      ss ON ss.subname = s.subname
LEFT JOIN pg_stat_subscription_stats st ON st.subname = s.subname;   -- stats view: PG 15+

sql

-- Catch monitoring sessions that overstayed and could themselves pin WAL.
-- ALERT: any lr_monitor session in 'idle in transaction' for > 10 s.
SELECT pid, state, wait_event_type, wait_event,
       now() - xact_start AS txn_age
FROM pg_stat_activity
WHERE application_name = 'lr_monitor'
  AND state = 'idle in transaction';

The single highest-priority signal on the whole system is a slot with active = false and retained climbing — it is the direct precursor to disk exhaustion. Alert on the transition into unreserved; by the time wal_status reads lost, the slot is invalidated and the consumer must be re-seeded from a fresh subscription sync.

Failure Modes & Gotchas

The monitor becomes the stall. Signature: pg_wal grows, and the slot pinning it belongs to no CDC consumer — it is a leftover from the collector, or the collector left an idle in transaction session open. Root cause: a scrape opened a transaction and the process was killed mid-flight, or a health check accidentally created a temporary slot. Remediation: enforce idle_in_transaction_session_timeout and statement_timeout at the role level (not per query, so they cannot be forgotten), and never let a monitoring path call pg_create_logical_replication_slot or pg_logical_slot_get_changes — peeking consumes the stream. Slots are created deliberately during initializing replication slots, never as a monitoring side effect.

Stale-green dashboards after the collector dies. Signature: every panel shows the last-known-good value, alerts stay quiet, but replication has actually been broken for an hour. Root cause: the collector crashed and its gauges froze at their last scrape. Remediation: emit a heartbeat/up metric and alert on its absence, and use stats_reset and scrape-age freshness so a metric older than a few intervals reads as unknown, not healthy. Absence of data must page as loudly as bad data.

Time-based lag reads zero on an idle system. Signature: write_lag/flush_lag are NULL or 00:00:00 and someone concludes replication is instant. Root cause: pg_stat_replication lag columns are populated from feedback on actual traffic; with no writes flowing there is nothing to measure. Remediation: alert on byte-based backlog (pg_wal_lsn_diff) as the primary signal and treat time-based lag as a secondary, traffic-dependent view. Do not build the only alert on a column that is legitimately zero at 3 a.m.

Wrong-node scraping in a failover topology. Signature: after a promotion the collector reports zero slots and green health while the new primary is silently retaining WAL. Root cause: the collector’s DSN pointed at a fixed host that is now a standby, or (pre-PG 17) the logical slots never existed on the promoted node. Remediation: point the collector at the same service endpoint the consumers use, and on PostgreSQL 17 pair slot failover (failover = true with sync_replication_slots) with a scrape that verifies expected slot names exist on whichever node is currently primary.

Scrape amplification under many slots. Signature: the collector’s own load shows up in pg_stat_activity and interval drift grows as slot count rises. Root cause: per-slot subqueries or one connection per metric multiply backend churn. Remediation: keep the read set to a handful of set-based queries over the views (as above), scrape them in one transaction, and hold max_size at 2–3 — the view scans are cheap; the cost is connection and snapshot overhead, so minimize both.

Integration Touchpoints

Async monitoring closes the loop that the rest of the setup opens. The slot and lag metrics here are the operational readout of the objects defined when creating publications and reserving cursors during slot initialization; the byte thresholds you alert on are the practical enforcement of the retention policy chosen in replication slot types.

For downstream event-driven pipelines the observability responsibility splits cleanly. The SQL views on this page cover everything up to the WAL sender; once changes cross into the streaming layer, consumer lag moves to the Kafka event routing integration, where a Debezium connector exposes its own connector and task metrics that reuse the exact slot this collector watches. Correlating pg_replication_slots.confirmed_flush_lsn against the connector’s committed offset is how you localize a stall to the database, the connector, or the broker. The privilege model for the monitoring role — and why it must never share credentials with the replication role — follows the same least-privilege rules laid out in security boundaries and permissions, and the decoding internals the metrics reflect are documented across the logical replication architecture fundamentals.

Frequently Asked Questions

How often should the collector scrape replication state?

A 1–5 second interval is safe when scrapes run through a bounded pool (max_size 2–3) and each is a single read-only transaction over the statistics views. The view scans are inexpensive; the real cost is connection and snapshot overhead, so favour fewer, set-based queries over frequent per-slot polling. Sub-second intervals add load without surfacing signal that a 1-second cadence misses.

Should I alert on byte-based lag or time-based lag?

Make byte-based lag (pg_wal_lsn_diff on restart_lsn) the primary alert, because it is always meaningful and maps directly to disk risk against max_slot_wal_keep_size. Time-based lag (write_lag/flush_lag) is a useful latency SLI but is populated only when traffic is flowing, so it legitimately reads zero on an idle system and must never be the sole signal.

Can the monitoring queries themselves stall replication?

Yes, if they open transactions that linger. An idle in transaction session holds back the transaction horizon and can prevent WAL cleanup, and any code path that peeks a slot via pg_logical_slot_get_changes consumes the stream. Enforce statement_timeout and idle_in_transaction_session_timeout at the role level, keep every scrape READ ONLY, and never let a monitoring path create or advance a slot.

Creating publications — the exposure boundary whose throughput and filter selectivity you monitor here.
Initializing replication slots — the durable cursors whose restart_lsn lag is the core metric on this page.
Subscription sync procedures — recover a consumer once monitoring shows a slot invalidated or worker down.
Kafka event routing integration — where consumer-lag observability continues past the WAL sender.
Monitoring & alerting — the off-the-shelf postgres_exporter, Grafana, and Alertmanager stack that complements this bespoke collector.
Logical Replication Setup & Management — the parent guide this monitoring layer completes.

Prerequisites & Configuration Objects #

Step-by-Step Implementation #

Parameter Reference Table #

Diagnostic Queries #

Failure Modes & Gotchas #

Integration Touchpoints #

Frequently Asked Questions #

Related guides #

Prerequisites & Configuration Objects

Step-by-Step Implementation

Parameter Reference Table

Diagnostic Queries

Failure Modes & Gotchas

Integration Touchpoints

Frequently Asked Questions

Related guides