Python CDC Parser Development

Building a custom Change Data Capture parser in Python means reading PostgreSQL’s replication protocol directly — subscribing to a logical slot, decoding BEGIN/RELATION/INSERT/UPDATE/DELETE/COMMIT messages, and driving your own delivery guarantees — instead of delegating that work to the CDC pipeline built on Kafka Connect. This reference covers the exact server configuration, the step-by-step parser implementation, the parameter matrix, the diagnostic queries, and the failure signatures required to run a Python-native decoder safely on PostgreSQL 14 through 17.

Teams reach for a hand-written parser to strip out JVM and Kafka Connect overhead, to enforce a strict data contract in Python before anything leaves the process, or to land changes into a lightweight sink — Redis, SQLite, an HTTP webhook, a columnar file — where a full Debezium connector is disproportionate. The trade is ownership: the moment you bypass the managed connector, you own backpressure, slot lifecycle, and crash recovery. Get it wrong and the failure is not a dropped event — it is a stalled consumer that pins restart_lsn, so pg_wal grows unbounded until the primary exhausts disk and refuses new writes. A Python parser is a small amount of code wrapped around a large amount of operational discipline, and this page is about the discipline.

The parser as a state machine: the dispatcher routes each pgoutput message by its one-byte tag, and the COMMIT handler enforces the invariant that send_feedback fires only after the sink confirms a durable write — advancing restart_lsn so PostgreSQL can reclaim WAL.

Prerequisites & Configuration Objects

A Python parser consumes the same server-side objects as any other logical consumer, so the logical decoding subsystem must be enabled and provisioned before the client connects. Three areas have to be correct: server GUCs, the publication and slot, and the connecting role’s privileges.

Server parameters (primary, postgresql.conf). wal_level cannot be changed without a restart, so plan for it:

sql

-- Requires a restart. Everything downstream depends on this.
ALTER SYSTEM SET wal_level = 'logical';
-- One slot + one walsender per independent consumer, plus headroom.
ALTER SYSTEM SET max_replication_slots = '10';
ALTER SYSTEM SET max_wal_senders = '10';
-- Cap the reorder-buffer memory a single decoding session holds before spilling.
ALTER SYSTEM SET logical_decoding_work_mem = '256MB';
-- PG 13+: bound how long WAL is retained for a lagging slot. 0 = unlimited (dangerous).
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';

max_slot_wal_keep_size is the single most important guardrail for a hand-built parser. Set to 0 (the default), a stalled Python process retains WAL forever and takes the primary down with it. Set to 10GB, PostgreSQL invalidates the slot once retained WAL exceeds the cap — you trade a resnapshot for a healthy primary, which is almost always the right trade in production.

Publication and slot. The publication defines which tables and operations reach the stream; see creating publications for the full object model. The parser reads through a logical replication slot, which is what guarantees WAL retention until you acknowledge an LSN:

sql

-- Narrow the change stream to exactly the tables you decode.
CREATE PUBLICATION cdc_py FOR TABLE public.orders, public.line_items;

-- Full-row images so UPDATE/DELETE carry every column, not just the key.
ALTER TABLE public.orders     REPLICA IDENTITY FULL;
ALTER TABLE public.line_items REPLICA IDENTITY FULL;

-- Create the slot with pgoutput. Provisioning is covered in depth under
-- replication slot initialization.
SELECT * FROM pg_create_logical_replication_slot('cdc_py_slot', 'pgoutput');

Slot provisioning, snapshot export, and the restart_lsn/confirmed_flush_lsn distinction are covered in detail under initializing replication slots; the parser inherits whatever consistent point the slot was created at.

Role privileges. The connecting role needs REPLICATION and must be able to read the published tables. Keep it least-privilege — the full model is in security boundaries and permissions:

sql

CREATE ROLE cdc_reader WITH LOGIN REPLICATION PASSWORD 'use-a-secret-manager';
GRANT SELECT ON public.orders, public.line_items TO cdc_reader;

The client side needs psycopg2 (or psycopg 3) built against libpq. psycopg2’s LogicalReplicationConnection handles the replication protocol framing so you decode messages rather than raw socket bytes.

Step-by-Step Implementation

The parser is a state machine over the pgoutput message stream. The contract is absolute: buffer every change under an open transaction, and only after the sink confirms the whole batch is durable do you advance the slot with send_feedback. Advancing early is how you silently lose data on a crash.

1. Open a replication connection and start the stream. Pass pgoutput options explicitly — never rely on implicit defaults:

python

import psycopg2
from psycopg2.extras import LogicalReplicationConnection

conn = psycopg2.connect(
    "host=db port=5432 dbname=app user=cdc_reader password=... "
    "connect_timeout=10",
    connection_factory=LogicalReplicationConnection,
)
cur = conn.cursor()

cur.start_replication(
    slot_name="cdc_py_slot",
    decode=False,                       # pgoutput is binary; decode it yourself
    options={
        "proto_version": "1",          # "2" enables streaming of in-progress txns (PG 14+)
        "publication_names": "cdc_py",
    },
    status_interval=10,                 # heartbeat cadence, seconds
)

2. Route messages through a dispatcher. Every pgoutput message begins with a one-byte type tag. Buffer DML under the open transaction; treat RELATION as schema metadata, not data:

python

class Parser:
    def __init__(self):
        self.relations = {}   # relation_id -> column metadata (schema cache)
        self.buffer = []      # rows accumulated for the open transaction
        self.begin_lsn = None

    def dispatch(self, payload: memoryview):
        tag = chr(payload[0])
        if tag == "B":        # BEGIN
            self.buffer.clear()
            self.begin_lsn = read_begin_lsn(payload)
        elif tag == "R":      # RELATION (schema)
            rel = parse_relation(payload)
            self.relations[rel.id] = rel
        elif tag in ("I", "U", "D"):   # INSERT / UPDATE / DELETE
            row = parse_dml(tag, payload, self.relations)
            self.buffer.append(row)
        elif tag == "C":      # COMMIT
            return self.on_commit(payload)   # -> commit_lsn to acknowledge
        # 'O' (ORIGIN), 'T' (TRUNCATE), 'Y' (TYPE) handled as needed
        return None

The message wire format — column flags, tuple encoding, TOAST sentinels — is where most of the real work lives; the byte-level decoding of each tag is covered in parsing pgoutput format with psycopg2, and the surrounding plugin architecture in building a Python logical decoding plugin.

3. Persist on COMMIT, then acknowledge — in that order. This ordering is the entire correctness argument for the parser:

python

    def on_commit(self, payload) -> int:
        commit_lsn = read_commit_lsn(payload)
        # 1) Serialize + persist the whole transaction atomically.
        sink.write_batch(self.buffer, commit_lsn)   # raises on failure
        # 2) ONLY now is it safe to let PostgreSQL free the WAL.
        self.buffer.clear()
        return commit_lsn

4. Drive the read loop and send feedback. Use a blocking read with a timeout so heartbeats still fire on an idle slot. Acknowledge the persisted LSN so restart_lsn advances:

python

import select

parser = Parser()
while True:
    msg = cur.read_message()
    if msg:
        ack_lsn = parser.dispatch(msg.payload)
        if ack_lsn is not None:
            # flush_lsn tells PostgreSQL it may reclaim WAL up to here.
            cur.send_feedback(flush_lsn=msg.data_start, reply=True)
    else:
        # No data: block on the socket, but wake to send a keepalive.
        timeout = cur.io_timeout if hasattr(cur, "io_timeout") else 10
        if not select.select([cur], [], [], timeout)[0]:
            cur.send_feedback()   # keepalive; prevents idle-slot disconnects

5. Maintain the schema cache. PostgreSQL re-emits a RELATION message whenever a table’s shape changes and before the first row that uses the new shape. Trust the stream: overwrite self.relations[rel.id] on every RELATION, and never cache column metadata across a process restart — the first message after reconnect re-establishes it. A stale cache is the classic cause of “column count mismatch” decode errors after an ALTER TABLE.

Parameter Reference Table

Parameter / option	Where set	Default	Behavior for a Python parser
`wal_level`	server GUC	`replica`	Must be `logical`; requires a restart. No logical decoding otherwise.
`max_replication_slots`	server GUC	`10`	Each parser instance consumes one slot. Cap slots to bound WAL retention risk.
`max_wal_senders`	server GUC	`10`	One walsender per active replication connection.
`logical_decoding_work_mem`	server GUC	`64MB`	Reorder-buffer memory before spilling to `pg_replslot/<slot>/`. Raise to `256MB` for large transactions.
`max_slot_wal_keep_size`	server GUC (PG 13+)	`-1` (unlimited)	Hard cap on WAL retained for a lagging slot. Set to a finite value (e.g. `10GB`) so a stalled parser can’t fill the disk.
`proto_version`	`start_replication` option	`1`	`2`+ (PG 14+) enables `streaming` of in-progress transactions, bounding memory on huge commits.
`publication_names`	`start_replication` option	—	Comma-separated publications the slot decodes. Must match a `CREATE PUBLICATION`.
`status_interval`	`start_replication` arg	`10`	Seconds between automatic keepalive feedback. Keep ≤ `wal_sender_timeout / 3`.
`flush_lsn`	`send_feedback` arg	—	LSN up to which WAL may be reclaimed. Advance ONLY after the sink confirms durability.
`wal_sender_timeout`	server GUC	`60s`	Server drops the connection if no feedback arrives in this window. Heartbeat well under it.

Diagnostic Queries

Watch the slot from the server side; the client’s health is only half the picture. These map directly onto the dashboards in async monitoring integration.

Slot lag and liveness. active = false on a slot that should be streaming means your parser died or never reconnected:

sql

SELECT
  slot_name,
  active,
  restart_lsn,
  confirmed_flush_lsn,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
WHERE slot_name = 'cdc_py_slot';
-- Alert when retained_wal climbs past ~25% of max_slot_wal_keep_size, or when
-- active = false for more than 60 s on a slot expected to be live.

Live sender throughput. Confirms the walsender is actually shipping bytes and shows how far behind the flushed position is:

sql

SELECT
  application_name,
  state,
  pg_size_pretty(pg_wal_lsn_diff(sent_lsn, flush_lsn))    AS unflushed,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS total_lag
FROM pg_stat_replication;

Inspect pending changes without consuming them. peek reads WAL without advancing the slot cursor — the safest way to isolate a parser bug from a slot problem:

sql

-- Look at what the slot WOULD emit next, without acknowledging anything.
SELECT lsn, xid, left(data, 120) AS preview
FROM pg_logical_slot_peek_binary_changes(
       'cdc_py_slot', NULL, 10,
       'proto_version', '1', 'publication_names', 'cdc_py');

If peek returns clean, well-formed messages but your consumer chokes on them, the defect is in the Python decoder, not the slot.

Failure Modes & Gotchas

Stalled consumer pins restart_lsn. Signature: pg_wal grows steadily, retained_wal in the query above climbs without bound, active = false. Root cause: the parser crashed, or its sink is blocking, so no send_feedback advances the slot. Remediation: set a finite max_slot_wal_keep_size so the server protects itself, add downstream-depth backpressure that pauses reads before the sink saturates, and page on retained_wal well before disk pressure.

Feedback sent before the sink is durable. Signature: after a crash, rows are missing downstream even though the slot advanced past them. Root cause: send_feedback(flush_lsn=...) was called before sink.write_batch confirmed persistence, so PostgreSQL discarded WAL you never actually stored. Remediation: acknowledge strictly after durable write; persist the sink offset and the LSN in one atomic action so recovery resumes from a single source of truth.

Idle slot disconnects. Signature: the connection drops after ~60 s of quiet with terminating walsender process due to replication timeout. Root cause: no feedback reached the server within wal_sender_timeout because the read loop blocked without sending keepalives. Remediation: send send_feedback() on every idle wakeup and keep status_interval at roughly one-third of wal_sender_timeout.

Schema cache desync after DDL. Signature: decode errors (wrong column count, type mismatch) immediately after an ALTER TABLE on a published table. Root cause: a cached RELATION was reused past a schema change. Remediation: overwrite the relation cache on every RELATION message and rebuild it from scratch after any reconnect; never persist it across restarts.

Reorder buffer spills on a bulk transaction. Signature: a nightly bulk UPDATE of several million rows stalls the stream and floods pg_replslot/<slot>/ with spill files before the first change arrives. Root cause: the whole transaction is buffered in memory up to logical_decoding_work_mem, then spilled to disk. Remediation: raise logical_decoding_work_mem to 256MB, and on PG 14+ set proto_version to 2 with streaming so in-progress transactions are delivered incrementally rather than buffered whole.

Frequently Asked Questions

When should I write a Python parser instead of running Debezium?

Choose a Python parser when you need to eliminate JVM/Kafka Connect footprint, when the sink is lightweight (Redis, SQLite, an HTTP endpoint), or when you want to enforce a data contract inside one process before anything is published. Stay on the Debezium connector when you need snapshotting, schema-registry integration, distributed scaling, and exactly-once delivery out of the box — reimplementing those correctly is far more work than the parser itself.

Should I use pg_recvlogical or psycopg2 for consuming the slot?

pg_recvlogical is fine for edge ingestion, quick captures, and shell-driven prototypes because it needs no persistent client library. For production, psycopg2’s LogicalReplicationConnection is preferred: you control feedback timing, backpressure, and error handling in-process rather than shelling out, and you can persist the sink offset and LSN in the same transaction.

How do I recover when the slot is dropped or invalidated?

Treat it as a resnapshot, never as an automatic reconnect. If the slot is invalidated (for example by max_slot_wal_keep_size) or dropped during a failover, take a fresh consistent snapshot of the published tables, reconcile it against your last durably persisted LSN, recreate the slot, and only then resume streaming. Validate that the new slot’s restart_lsn aligns with your recorded offset before trusting the stream — the mechanics mirror subscription sync procedures for built-in logical replication.

Integration Touchpoints

A Python parser is one stage in a longer flow and only behaves correctly when its neighbors do. Upstream it depends on the WAL stream mechanics that decode committed transactions and on the publication and subscription models that decide which columns ever reach the stream. The raw change envelope it produces is the same one normalized by JSON to Avro transformation before it becomes a contract-governed data product. Downstream, serialized events are partitioned, keyed, and dead-lettered by event routing and Kafka integration, and the health of the whole chain — slot lag, sender throughput, feedback cadence — is surfaced by async monitoring integration.

For the protocol itself, consult the PostgreSQL logical decoding documentation and the pgoutput message format, plus the psycopg2 replication support reference for the client API.

Building a Python Logical Decoding Plugin — the plugin architecture that intercepts WAL changes at the database level.
Parsing pgoutput Format with psycopg2 — byte-level decoding of BEGIN/RELATION/INSERT/UPDATE/DELETE/COMMIT tuples.
JSON to Avro Transformation — turning the raw change envelope into a registry-validated, typed contract.
Event Routing & Kafka Integration — partitioning for ordering, dead-letter queues, and delivery guarantees.
Debezium Connector Configuration — the managed alternative and its type-handling parameters.
Async Monitoring & Integration — lag dashboards and alert rules that watch the slot feeding this parser.

← Back to CDC Pipeline Implementation with Python & Debezium

Prerequisites & Configuration Objects #

Step-by-Step Implementation #

Parameter Reference Table #

Diagnostic Queries #

Failure Modes & Gotchas #

Frequently Asked Questions #

Integration Touchpoints #

Related #

Prerequisites & Configuration Objects

Step-by-Step Implementation

Parameter Reference Table

Diagnostic Queries

Failure Modes & Gotchas

Frequently Asked Questions

Integration Touchpoints

Related