JSON to Avro Transformation

Converting the JSON change events emitted by a CDC pipeline into Avro is the boundary at which a pipeline stops being a prototype and becomes a contract-governed data product. PostgreSQL logical replication slots emit row-level mutations, and whether they reach Kafka as JSON — via the Debezium connector or a raw wal2json reader — or as registry-validated Avro determines how the pipeline behaves under schema change, network pressure, and consumer failure. This reference specifies the serialization objects, the step-by-step transform, the parameter matrix, the diagnostic queries, and the failure signatures that govern a production JSON-to-Avro layer on PostgreSQL 14 through 17.

JSON is the right format for the first week of a pipeline: it is human-readable, schema-optional, and trivial to kcat into a terminal. It is the wrong format for the second year. A JSON change event repeats every field name on every row, so a 200-byte row commonly serializes to 600-900 bytes on the wire; it carries no type contract, so a numeric(20,4) and a float collapse to the same textual token and drift silently; and it lets a producer ship an incompatible field the instant someone runs ALTER TABLE, with the break surfacing hours later as a stack trace in a downstream consumer rather than as a rejected write at the source. Avro closes all three gaps — compact binary encoding, an enforced writer schema, and registry-mediated compatibility checks — but only if the transformation layer is built with the same operational discipline as the replication slot feeding it. Done carelessly, the Avro layer simply moves the failure from the consumer to the serializer and adds a schema registry as a new single point of failure.

The transform boundary: verbose, untyped JSON becomes a compact, registry-governed Avro record whose Confluent wire prefix carries the schema id that lets any consumer decode it.

Prerequisites & Configuration Objects

Before a single record is serialized, the source database, the connector, and the registry must agree on how PostgreSQL types map to Avro logical types. Misalignment here is not a runtime error; it is silent data corruption discovered weeks later during a reconciliation audit.

PostgreSQL-side configuration. Logical decoding must be enabled and the replicated tables must expose enough of each row for Avro’s non-null constraints to be satisfiable:

sql

-- postgresql.conf (requires restart for wal_level)
-- wal_level = logical
-- max_replication_slots = 10
-- max_wal_senders = 10

-- Force full-row images so UPDATE events carry every column, not just the key.
-- Critical for tables with TOASTed columns you intend to mark non-nullable in Avro.
ALTER TABLE public.orders REPLICA IDENTITY FULL;   -- PG 14-17

-- Narrow the change stream to exactly the tables you serialize.
CREATE PUBLICATION cdc_avro FOR TABLE public.orders, public.line_items;

The publication is the same object described under creating publications; the transform layer inherits whatever column set it enumerates, so an over-broad FOR ALL TABLES publication forces you to define Avro schemas for tables you never intended to stream.

Roles and privileges. The connector authenticates as a role holding REPLICATION plus SELECT on the published tables — the least-privilege boundary covered in the security and permissions reference. The transform layer additionally needs read-only credentials for the schema registry (to fetch and cache writer schemas) and write credentials only in the CI job that pre-registers schemas.

Converter and registry objects. On the connector, serialization is delegated by two converter properties. The registry itself must have a compatibility policy set before the first schema is registered:

properties

# Debezium / Kafka Connect worker or connector config
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://schema-registry:8081
value.converter.schema.registry.url=http://schema-registry:8081

# Preserve numeric scale and use connect-style int64 millisecond timestamps.
decimal.handling.mode=precise
time.precision.mode=connect

Set the subject compatibility to BACKWARD_TRANSITIVE up front, because the default global policy on a fresh Confluent registry is BACKWARD (per-subject, not transitive), which does not protect consumers reading from the start of a compacted topic:

bash

curl -X PUT http://schema-registry:8081/config/orders-value \
  -H 'Content-Type: application/vnd.schemaregistry.v1+json' \
  -d '{"compatibility": "BACKWARD_TRANSITIVE"}'

Step-by-Step Implementation

The transform runs either inside the Debezium converter (zero custom code, least flexibility) or in a dedicated Python stage between the raw JSON topic and the Avro topic (full control over enrichment, filtering, and TOAST hydration). The steps below build the Python stage, which is the path most teams reach once they need field-level shaping. It builds directly on the envelope parsing described in Python CDC parser development.

1. Define the Avro schema as the contract, deriving it from the table. Treat the schema file as source-controlled truth, not a runtime-generated artifact. Every TOAST-prone or nullable column becomes a union with null:

json

{
  "type": "record",
  "name": "Order",
  "namespace": "org.example.cdc",
  "fields": [
    {"name": "id",         "type": "long"},
    {"name": "customer_id","type": "long"},
    {"name": "total",      "type": {"type": "bytes", "logicalType": "decimal", "precision": 20, "scale": 4}},
    {"name": "notes",      "type": ["null", "string"], "default": null},
    {"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}},
    {"name": "__lsn",      "type": ["null", "long"], "default": null}
  ]
}

Carrying the source __lsn in the payload is deliberate: it is the high-water mark the downstream consumer uses for idempotent, LSN-aware apply, mirroring the ordering guarantee provided by the WAL stream mechanics.

2. Parse and normalize the incoming JSON change event. Strip Debezium envelope metadata down to the business columns, and resolve PostgreSQL types to their Avro representations:

python

from decimal import Decimal
from datetime import datetime, timezone

def normalize(change: dict) -> dict:
    """Map a Debezium 'after' image to an Avro-ready dict."""
    after = change["payload"]["after"]
    source = change["payload"]["source"]
    return {
        "id":          after["id"],
        "customer_id": after["customer_id"],
        # numeric arrives as a scaled string under decimal.handling.mode=precise
        "total":       Decimal(after["total"]),
        "notes":       after.get("notes"),           # may be None on TOAST omission
        "created_at":  datetime.fromtimestamp(after["created_at"] / 1000, tz=timezone.utc),
        "__lsn":       source.get("lsn"),
    }

3. Hydrate omitted TOAST values before serialization. When REPLICA IDENTITY FULL is not set, an UPDATE that does not touch a large column emits it as absent. Backfill from the source using the primary key so a non-nullable Avro field is never fed a placeholder:

python

def hydrate_toast(row: dict, cur) -> dict:
    if row["notes"] is None:                         # candidate TOAST omission
        cur.execute("SELECT notes FROM public.orders WHERE id = %s", (row["id"],))
        fetched = cur.fetchone()
        if fetched:
            row["notes"] = fetched[0]
    return row

Keep this path rare — every hydration is a synchronous round-trip to the primary. REPLICA IDENTITY FULL at the source is the cheaper long-run control; reserve hydration for tables where the WAL-volume cost of full-row images is unacceptable.

4. Serialize with a cached schema id. Use confluent-kafka’s AvroSerializer, which fetches and caches the schema id on first use and prepends the Confluent wire prefix — magic byte 0x00 followed by the 4-byte big-endian schema id — automatically:

python

from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer
from confluent_kafka import Producer

sr = SchemaRegistryClient({"url": "http://schema-registry:8081"})
with open("order.avsc") as f:
    serializer = AvroSerializer(sr, f.read())      # caches schema id after first register

producer = Producer({
    "bootstrap.servers": "kafka:9092",
    "enable.idempotence": True,                    # exactly-once producer semantics
    "acks": "all",
    "linger.ms": 20,                               # batch window; pair with batch.size
})

def emit(row: dict, ctx):
    producer.produce(
        topic="orders.avro",
        key=str(row["id"]),
        value=serializer(row, ctx),
        on_delivery=lambda err, msg: err and log_dlq(row, err),
    )

5. Route serialization failures to a dead-letter queue, never to a blocked pipeline. A record that cannot be serialized (schema mismatch, unregistered field) must be diverted, not retried in place — the DLQ and partitioning mechanics are detailed under event routing and Kafka integration. Flush on a bounded interval so linger.ms batching does not become unbounded latency:

python

try:
    row = hydrate_toast(normalize(change), pg_cursor)
    emit(row, ctx)
except Exception as exc:                            # serialization or schema error
    log_dlq(change, exc)                            # divert; do not block the stream
producer.poll(0)

Parameter Reference Table

Parameter	Layer	Valid values	Default	Behavior in a CDC transform
`decimal.handling.mode`	Debezium	`precise`, `double`, `string`	`precise`	`precise` maps `numeric` to Avro `bytes`/decimal preserving scale; `double` loses precision above 2^53; `string` is lossless but non-arithmetic downstream.
`time.precision.mode`	Debezium	`adaptive`, `adaptive_time_microseconds`, `connect`	`adaptive`	`connect` forces int64 millisecond `timestamp-millis`; `adaptive` emits microsecond logical types that many Avro consumers cannot decode.
`schema.name.adjustment.mode`	Debezium	`none`, `avro`, `avro_unicode`	`none`	`avro` sanitizes field/record names to the Avro identifier grammar; required when columns contain characters Avro forbids.
compatibility	Schema Registry	`BACKWARD`, `BACKWARD_TRANSITIVE`, `FORWARD`, `FULL`, `NONE`	`BACKWARD`	`BACKWARD_TRANSITIVE` guarantees a new schema reads all prior versions — mandatory for compacted topics replayed from offset 0.
`auto.register.schemas`	Converter/serializer	`true`, `false`	`true`	Set `false` in production so an unreviewed producer cannot register a breaking schema; register in CI instead.
`use.latest.version`	Converter/serializer	`true`, `false`	`false`	With auto-register off, `true` serializes against the registry’s latest registered schema rather than the local one.
`linger.ms`	Producer	`0`–`n`	`0`	Batch window; 10–20 ms materially raises Avro compression ratio without meaningful latency cost.
`REPLICA IDENTITY`	PostgreSQL	`DEFAULT`, `FULL`, `USING INDEX`, `NOTHING`	`DEFAULT`	`FULL` populates every column on `UPDATE`/`DELETE`, eliminating TOAST omissions at the cost of WAL volume.

Diagnostic Queries

The transform layer sits downstream of a slot, so most transform-visible symptoms (stalls, growing lag) actually originate at the source. Watch the slot first, then the registry.

Confirm the slot feeding the transform is active and not retaining WAL unboundedly:

sql

SELECT
  slot_name,
  active,                                             -- expect true
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS retained_wal,                                  -- ALERT: > 5 GB and climbing
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)
  ) AS unconfirmed
FROM pg_replication_slots
WHERE slot_name = 'cdc_avro_slot';
-- active = false while retained_wal grows means the transform consumer has stalled.

Check which tables lack full-row identity — the root cause of TOAST omissions that break non-nullable Avro fields:

sql

SELECT
  c.relname AS table_name,
  CASE c.relreplident
    WHEN 'd' THEN 'DEFAULT (key only)'
    WHEN 'f' THEN 'FULL'
    WHEN 'i' THEN 'USING INDEX'
    WHEN 'n' THEN 'NOTHING'
  END AS replica_identity
FROM pg_class c
JOIN pg_publication_tables p ON p.tablename = c.relname
WHERE p.pubname = 'cdc_avro'
  AND c.relkind = 'r';
-- Any large-column table showing 'DEFAULT (key only)' will emit null on unchanged TOAST values.

Inspect the raw Avro wire prefix off the topic to verify the magic byte and schema id:

bash

# Dump the first 5 bytes of each value: 0x00 magic + 4-byte big-endian schema id.
kcat -C -b kafka:9092 -t orders.avro -c 1 -f '%s' | xxd | head -1
# 00000000: 00 00 00 00 07 ...   -> magic 0x00, schema id 7

A non-zero first byte means the payload was not written by a Confluent-framed serializer; the consumer’s schema lookup will fail before Avro decoding even begins.

Failure Modes & Gotchas

Schema id mismatch after a DDL change. Signature: consumers throw Schema being registered is incompatible or decode against the wrong schema after someone runs ALTER TABLE. Root cause: the source DDL shipped before the Avro schema was registered, so the writer schema and the registry disagree. Remediation: register schemas in CI ahead of the DDL, set auto.register.schemas=false, and gate the migration on a POST .../compatibility/subjects/<subject>/versions/latest check returning is_compatible: true.

Numeric precision drift. Signature: monetary totals differ by fractions of a cent between source and sink. Root cause: decimal.handling.mode=double (or a Python float cast in the transform) collapsed a numeric(20,4) into IEEE-754. Remediation: use decimal.handling.mode=precise, keep the value as Decimal end to end, and encode as Avro bytes with an explicit precision/scale.

TOAST columns arrive null on update. Signature: large text/jsonb columns are populated on INSERT but null on UPDATE, tripping a non-nullable Avro field. Root cause: REPLICA IDENTITY DEFAULT omits unchanged out-of-line values. Remediation: ALTER TABLE ... REPLICA IDENTITY FULL, or add the primary-key hydration SELECT from step 3, or mark the field nullable — in that priority order.

Registry becomes a single point of failure. Signature: the whole transform stalls when the schema registry is briefly unreachable, even though every schema is already cached locally. Root cause: a serializer configured to validate against the registry on every call rather than trusting its cache. Remediation: rely on the serializer’s schema-id cache, add exponential backoff on registry calls, and route to the DLQ rather than blocking when the registry is down — the pipeline must degrade, not halt.

Timestamp logical type the consumer cannot read. Signature: consumers fail on an unknown logical type or read timestamps off by orders of magnitude. Root cause: time.precision.mode=adaptive emitted MicroTimestamp while the consumer expected timestamp-millis. Remediation: standardize on time.precision.mode=connect and timestamp-millis across every table in the publication.

Frequently Asked Questions

Should the transform run in the Debezium converter or in a separate Python stage?

Use the AvroConverter directly when the change event needs no reshaping — it is zero custom code and the fewest moving parts. Move to a dedicated Python stage the moment you need field filtering, enrichment, TOAST hydration, or per-table routing that the converter cannot express. Many teams run both: the converter for simple tables, a Python stage for the handful that need shaping.

Why Avro rather than Protobuf or JSON Schema through the registry?

All three are registry-governed and interchangeable at the framing level. Avro’s writer/reader-schema resolution is the most mature fit for CDC because the writer schema travels by id on every record, so a consumer can decode historical messages written under an older schema without redeploying. Protobuf is a reasonable choice where the same messages are also consumed by gRPC services; JSON Schema trades Avro’s compact binary for readability.

How do I change an Avro schema without breaking live consumers?

Only make backward-compatible changes under BACKWARD_TRANSITIVE: add fields with defaults, never remove a field a consumer requires or narrow a type. Register the new schema in CI, confirm the registry reports it compatible, deploy consumers that understand the new field, then ship the source DDL and let producers pick up the new writer schema. Reversing that order is the single most common cause of a stalled Avro pipeline.

Integration Touchpoints

The JSON-to-Avro layer is one stage in a longer flow and only behaves correctly when its neighbors do. Upstream, it depends on the slot and publication objects created during replication slot initialization and on the connector tuning documented in Debezium connector configuration — decimal.handling.mode and time.precision.mode are set there, not here. The envelope shape it consumes is produced by the parsing logic in Python CDC parser development. Downstream, its serialized records are partitioned, keyed, and dead-lettered by event routing and Kafka integration, and the health of the whole chain is surfaced by the dashboards and alert rules in async monitoring integration. The type contract it enforces ultimately traces back to the publication and subscription models that decide which columns ever reach the stream.

For the binary encoding rules themselves, consult the Apache Avro specification, and for the replication semantics underneath, the PostgreSQL logical replication documentation.

Debezium Connector Configuration — converter properties, snapshot modes, and the type-handling parameters this layer depends on.
Python CDC Parser Development — parsing the change envelope and building idempotent, LSN-aware consumers.
Event Routing & Kafka Integration — partitioning for ordering, dead-letter queues, and exactly-once delivery.
Async Monitoring & Integration — lag dashboards and alert rules that watch the slot and registry feeding this transform.
Replication Slot Types — the slot mechanics that determine WAL retention behind the transform.

← Back to CDC Pipeline Implementation with Python & Debezium

Prerequisites & Configuration Objects #

Step-by-Step Implementation #

Parameter Reference Table #

Diagnostic Queries #

Failure Modes & Gotchas #

Frequently Asked Questions #

Integration Touchpoints #

Related #

Prerequisites & Configuration Objects

Step-by-Step Implementation

Parameter Reference Table

Diagnostic Queries

Failure Modes & Gotchas

Frequently Asked Questions

Integration Touchpoints

Related