Debezium Connector Configuration

The Debezium PostgreSQL connector is the extraction stage of a CDC pipeline: it owns a replication slot, decodes the WAL into change events, and streams them into Kafka. This reference specifies the exact connector configuration that keeps that stage reproducible in production — database prerequisites, an idempotent deployment payload, the parameters that decide fault tolerance and data fidelity, the diagnostic queries that prove the connector is healthy, and the failure signatures you will actually meet. It assumes PostgreSQL 14–17 and Debezium 2.x with the pgoutput plugin.

A misconfigured connector does not fail loudly at deploy time; it fails days later. A slot that Debezium was allowed to auto-name orphans on every restart and pins pg_wal until the primary’s disk fills. A snapshot.mode left at a default that resnapshots on restart re-reads a billion-row table and saturates the primary. Batch and queue sizes tuned for a demo OOM-kill the Kafka Connect worker under a bulk UPDATE. Every one of these is a configuration decision made once and paid for continuously, which is why the connector config is treated here as a versioned, reviewed artifact rather than a form filled in a UI.

Prerequisites & Configuration Objects

Before the connector can start, three things must exist on the primary: logical WAL, slot capacity, and a narrow publication. These are the logical decoding subsystem objects the connector binds to; getting them wrong surfaces as a connector that starts and then immediately fails its first streaming poll.

Set wal_level = logical and size the slot and sender pools for concurrent connectors plus failover headroom. wal_level is the only one of these that requires a restart:

sql

-- Primary server settings. wal_level change needs a restart; the rest reload.
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = '10';   -- one per connector + spares
ALTER SYSTEM SET max_wal_senders = '10';         -- >= max_replication_slots
ALTER SYSTEM SET logical_decoding_work_mem = '256MB';
-- Bound the WAL a lagging slot can pin before PostgreSQL invalidates it. PG 13+.
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';

Create a dedicated, least-privilege role for the connector rather than reusing a superuser. The connector needs REPLICATION to open a logical decoding session and SELECT on the captured tables to run its initial snapshot; the full host-based rules and TLS posture are covered in security boundaries and permissions:

sql

CREATE ROLE cdc_replicator WITH LOGIN REPLICATION PASSWORD 'from-secrets-manager';
GRANT USAGE ON SCHEMA public TO cdc_replicator;
GRANT SELECT ON public.orders, public.users, public.inventory TO cdc_replicator;

Create a narrow publication that enumerates exactly the CDC surface. A FOR ALL TABLES publication decodes churn from tables no consumer wants and inflates decoding CPU; scope it to the tables and, on PG 15+, the columns and rows you actually stream:

sql

CREATE PUBLICATION cdc_prod_publication
  FOR TABLE public.orders, public.users, public.inventory;

-- REPLICA IDENTITY decides what UPDATE/DELETE events carry. DEFAULT emits only
-- the primary key in "before"; FULL emits every prior column value.
ALTER TABLE public.orders REPLICA IDENTITY FULL;

REPLICA IDENTITY is a correctness decision, not a tuning knob. Under the default identity an UPDATE that does not touch a TOASTed column emits null for it, and a DELETE carries only the key — which breaks any consumer that needs the prior row or builds strict Avro records. Provision the publication and slot with the same static names the connector will reference; letting Debezium create them implicitly works, but pre-creating them keeps the objects in the same infrastructure-as-code review as the connector.

Step-by-Step Implementation

Deploy the connector through the Kafka Connect REST API with an idempotent PUT so repeated applies converge to one connector instead of spawning duplicates. Each step below is annotated with the reason it matters, not just the mechanics.

1. Write the connector config as a reviewed artifact. Reference secrets through a config provider — never inline the password. The ${file:...} and ${env:...} syntaxes resolve at runtime from a properties file or environment injected by your secrets manager:

json

{
  "name": "pg-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "pg-primary.internal",
    "database.port": "5432",
    "database.user": "cdc_replicator",
    "database.password": "${file:/opt/kafka/config/secrets/db-pass.properties:password}",
    "database.dbname": "analytics_prod",
    "database.sslmode": "verify-full",
    "topic.prefix": "pg-analytics",
    "plugin.name": "pgoutput",
    "publication.name": "cdc_prod_publication",
    "slot.name": "debezium_pg_analytics",
    "slot.drop.on.stop": "false",
    "snapshot.mode": "initial",
    "snapshot.locking.mode": "minimal",
    "snapshot.isolation.mode": "repeatable_read",
    "snapshot.fetch.size": "10240",
    "heartbeat.interval.ms": "10000",
    "poll.interval.ms": "500",
    "max.batch.size": "2048",
    "max.queue.size": "8192",
    "tombstones.on.delete": "true",
    "decimal.handling.mode": "precise",
    "time.precision.mode": "connect",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": "false",
    "value.converter.schemas.enable": "false"
  }
}

Three settings here are load-bearing. plugin.name must be pgoutput — Debezium 2.x dropped the legacy decoderbufs plugin, so a config carried over from 1.x fails to start. A static slot.name prevents orphaned slots accumulating on every restart. And slot.drop.on.stop: false keeps the slot across maintenance restarts so a bounce does not force a full resnapshot.

2. Apply idempotently. PUT /connectors/{name}/config creates the connector if absent and updates it in place if present, so re-running the same payload is safe:

bash

curl -X PUT http://kafka-connect:8083/connectors/pg-cdc-connector/config \
     -H "Content-Type: application/json" \
     -d @connector-config.json

3. Confirm both the connector and its task reach RUNNING. A connector can be RUNNING while its task is FAILED; check the task state, not just the connector:

bash

curl -s http://kafka-connect:8083/connectors/pg-cdc-connector/status \
  | jq '{connector: .connector.state, tasks: [.tasks[].state]}'
# Expect: {"connector":"RUNNING","tasks":["RUNNING"]}

4. Verify the initial snapshot completes before trusting downstream counts. During snapshot.mode: initial the connector holds a consistent read at repeatable_read isolation and streams every existing row before switching to live WAL streaming. Watch the connector log for the Snapshot ended line, then cross-check that the slot has begun advancing (see Diagnostic Queries).

Connector lifecycle — the initial snapshot freezes the slot's restart_lsn until it completes, streaming advances it, and a FAILED task resumes from the last committed offset on re-PUT rather than resnapshotting.

5. Tune the snapshot for large tables. snapshot.fetch.size caps the JDBC fetch batch during the initial read; a small value keeps worker heap flat on wide or billion-row tables at the cost of more round trips. Pair it with snapshot.locking.mode chosen for your consistency needs (see the reference table). For the raw-PostgreSQL equivalent of this snapshot-then-stream handoff — used when you drive subscribers natively instead of through Kafka — see subscription sync procedures.

Parameter Reference

The parameters below are the ones that change behavior under load or failure. Defaults are Debezium 2.x defaults; the “logical-replication behavior” column is the effect specific to the WAL/slot layer, not the generic description.

Parameter	Valid values	Default	Logical-replication behavior
`plugin.name`	`pgoutput`, `decoderbufs`	`decoderbufs`	Use `pgoutput` — the in-core plugin, no extension. `decoderbufs` is removed in Debezium 2.x.
`slot.name`	identifier	`debezium`	The persistent logical slot. Static and unique per connector; frozen `restart_lsn` here pins WAL.
`slot.drop.on.stop`	`true`, `false`	`false`	`true` drops the slot on graceful stop — forces a full resnapshot on restart. Keep `false` in production.
`publication.name`	identifier	`dbz_publication`	Must match an existing publication, or `publication.autocreate.mode` governs creation.
`publication.autocreate.mode`	`all_tables`, `disabled`, `filtered`	`all_tables`	`filtered` scopes the auto-created publication to the connector’s table list; `all_tables` over-captures.
`snapshot.mode`	`initial`, `never`, `when_needed`, `initial_only`, `no_data`	`initial`	`initial` snapshots once then streams; `never` streams only new WAL; `when_needed` resnapshots if the slot’s offset is gone.
`snapshot.locking.mode`	`minimal`, `extended`, `none`	`minimal`	Table-lock duration during the initial read (details below).
`snapshot.isolation.mode`	`serializable`, `repeatable_read`, `read_committed`, `read_uncommitted`	`repeatable_read`	Isolation for the snapshot transaction; `repeatable_read` gives a consistent point-in-time without full serialization cost.
`snapshot.fetch.size`	integer rows	`10240`	JDBC fetch batch during snapshot; lower to cap worker heap on wide/large tables.
`heartbeat.interval.ms`	ms (`0` = off)	`0`	Emits heartbeats that advance `confirmed_flush_lsn` when captured tables are idle but the DB is busy — prevents an idle slot pinning unrelated WAL. Set to `10000`.
`max.batch.size`	integer	`2048`	Events per Connect batch; scales with worker heap.
`max.queue.size`	integer	`8192`	In-memory event buffer; must exceed `max.batch.size`. Backpressure ceiling before the connector blocks.
`tombstones.on.delete`	`true`, `false`	`true`	Emits a null-value tombstone after a delete so log-compacted Kafka topics reclaim the key.
`decimal.handling.mode`	`precise`, `double`, `string`	`precise`	`precise` preserves `numeric` scale as bytes; `double` risks floating-point drift on money columns.
`time.precision.mode`	`adaptive`, `adaptive_time_microseconds`, `connect`	`adaptive`	`connect` maps timestamps to millisecond logical types for portability across consumers.

snapshot.locking.mode deserves its own note because it trades production impact against consistency:

minimal — acquires table locks only during the brief initial catalog read, then relies on repeatable_read isolation for the row scan. Lowest production impact; the correct default.
extended — holds locks across the whole snapshot. Guarantees strict consistency against concurrent DDL but can block writers for the snapshot duration.
none — takes no locks. Safe only for append-only or read-only tables; concurrent DDL during the snapshot can corrupt the captured schema.

Diagnostic Queries

The connector’s health is ultimately measured on the PostgreSQL side: is its slot advancing, and how much WAL is it pinning? Run these against the primary and alert on the thresholds inline. These are the same primitives the dashboards under async monitoring integration are built on.

Slot health — retained WAL and inactivity are the two numbers that precede a full-disk outage:

sql

SELECT
  slot_name,
  active,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS retained_wal,
  pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS flush_lag_bytes
FROM pg_replication_slots
WHERE slot_name = 'debezium_pg_analytics';
-- ALERT: active = false on a production slot   -> connector down, WAL pinning
-- ALERT: retained_wal > 4 GB                   -> approaching max_slot_wal_keep_size
-- ALERT: flush_lag_bytes > 256 MB sustained    -> consumer falling behind

Live streaming connection — pg_stat_replication shows whether the connector is actually streaming and where the lag sits (network vs. flush vs. replay):

sql

SELECT
  application_name,               -- matches the connector's slot/app name
  state,                          -- expect 'streaming'
  pg_wal_lsn_diff(sent_lsn, replay_lsn) AS total_lag_bytes,
  write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
-- ALERT: state <> 'streaming'    -> connection stalled or still snapshotting

Kafka Connect offsets — cross-reference the committed connector offset with the slot’s confirmed_flush_lsn to confirm they agree:

bash

curl -s http://kafka-connect:8083/connectors/pg-cdc-connector/offsets | jq .

On the pipeline side, scrape Debezium’s JMX MilliSecondsBehindSource and records-sent rate, plus kafka.connect.source-record-poll-rate and kafka.connect.source-record-active-count — a rising active-count with a flat poll-rate is the classic backpressure signature.

Failure Modes & Gotchas

Connector FAILED after a transient error. Signature: task state is FAILED, trace shows a network timeout, schema mismatch, or WAL gap. Root cause: an exception the connector could not retry through. Remediation: do not delete the connector — inspect the trace via the status endpoint, fix the root cause, and re-PUT the config. Kafka Connect resumes from the last committed offset. Deleting it discards offsets and, with a dropped slot, forces a full resnapshot.

Orphaned slots on every restart. Signature: pg_replication_slots accumulates debezium_<random> slots, retained WAL climbs. Root cause: slot.name left unset, so Debezium auto-generates a fresh slot each start while old ones stay active = false. Remediation: set a static slot.name, drop the orphans with pg_drop_replication_slot(), and alert on active = false.

WAL exhaustion from an inactive slot. Signature: active = false, retained_wal climbing, disk alerts on the primary. Root cause: the connector died or the network partitioned and restart_lsn froze. Remediation: restore the connector to resume acknowledgment; if it is gone for good, SELECT pg_drop_replication_slot('debezium_pg_analytics'); to release WAL immediately, accepting a resnapshot on next start. Prevent with max_slot_wal_keep_size plus the active=false alert.

Worker OOM-kill during snapshot or bulk write. Signature: the Kafka Connect container restarts with an OOM exit; large transactions precede it. Root cause: max.batch.size/max.queue.size sized beyond the JVM heap, or snapshot.fetch.size too high on a wide table. Remediation: keep max.queue.size > max.batch.size, size both against KAFKA_HEAP_OPTS (a 4–8 GB heap comfortably runs max.batch.size=2048, max.queue.size=8192), and lower snapshot.fetch.size on wide tables. PG 16+ transaction streaming (streaming in the pipeline) bounds decode memory on multi-million-row transactions.

Failover loses the slot. Signature: after promoting a replica, the connector cannot find its slot and triggers a full snapshot. Root cause: before PG 16 logical slots did not fail over — they lived only on the old primary. Remediation: on PG 16/17 create failover-enabled slots and enable slot synchronization to the standby; on PG 15 and earlier use snapshot.mode=when_needed with an idempotent consumer that tolerates replayed rows. The slot durability options are compared in replication slot types.

Integration Touchpoints

The connector config is one object in a chain, and its parameters only make sense against the stages on either side. The slot it binds to is decoded from the WAL stream mechanics on the primary; the events it emits are consumed by Python code that must be idempotent and LSN-aware.

The key.converter/value.converter choice is the seam between this connector and downstream serialization. JSON converters suit development and human-readable debugging, but production pipelines switch to io.confluent.connect.avro.AvroConverter (or io.debezium.converters.CloudEventsConverter) to enforce schema-registry contracts and shrink payloads — the governance, compatibility policy, and TOAST hydration fallback live in JSON to Avro transformation. tombstones.on.delete=true propagates deletes explicitly so downstream consumers can distinguish soft from hard deletes.

On the consumption side, the connector’s at-least-once delivery means a restart can replay committed changes, so consumers must upsert on the primary key and drop any event whose source LSN or ts_ms is not newer than the stored high-water mark — the pattern implemented in Python CDC parser development. Partitioning for ordering, dead-letter routing for non-conforming records, and exactly-once delivery are the concern of event routing and Kafka integration.

Frequently Asked Questions

Should I let Debezium create the publication and slot, or pre-create them?

Pre-create both with static names and keep them in the same infrastructure-as-code review as the connector. Auto-creation works, but publication.autocreate.mode=all_tables over-captures, and an auto-named slot orphans on restart. Pre-creating keeps the WAL surface explicit and reviewable.

What happens if I change snapshot.mode on a running connector?

snapshot.mode is read at connector start. Changing it and re-PUT-ing takes effect on the next restart. Switching to never on a connector that has already snapshotted is safe; switching to initial does not re-trigger a snapshot unless the offset/slot is gone, which is what when_needed handles automatically.

Why is my slot pinning WAL even though the captured tables are idle?

Because confirmed_flush_lsn only advances when the connector acknowledges WAL, and an idle capture table produces no events to acknowledge while the rest of the database keeps generating WAL. Set heartbeat.interval.ms=10000 so Debezium emits heartbeats that advance the slot during quiet periods.

Is it safe to delete a FAILED connector and recreate it?

No — deleting discards the committed offsets. If slot.drop.on.stop is false the slot survives, but the recreated connector may resnapshot depending on snapshot.mode. Always fix the root cause and re-PUT the same config so the connector resumes from its last offset.

Python CDC Parser Development — building idempotent, LSN-aware consumers for the events this connector emits.
JSON to Avro Transformation — converter selection, schema-registry governance, and TOAST hydration.
Event Routing & Kafka Integration — partitioning for ordering, dead-letter queues, and exactly-once delivery.
Creating Publications — row filters, column lists, and REPLICA IDENTITY mechanics for the capture surface.
Replication Slot Types — persistent vs. temporary vs. failover slots and their durability semantics.

← Back to CDC Pipeline Implementation with Python & Debezium

Prerequisites & Configuration Objects #

Step-by-Step Implementation #

Parameter Reference #

Diagnostic Queries #

Failure Modes & Gotchas #

Integration Touchpoints #

Frequently Asked Questions #

Related #

Prerequisites & Configuration Objects

Step-by-Step Implementation

Parameter Reference

Diagnostic Queries

Failure Modes & Gotchas

Integration Touchpoints

Frequently Asked Questions

Related