The Problem
The client’s Confluent Platform deployment was four major versions behind, blocking adoption of exactly-once semantics their payment team needed. Consumer lag alerts were manual — an engineer checked a dashboard daily.
The Approach
- Performed a rolling upgrade across 9 brokers with no message loss using Confluent’s documented upgrade path.
- Introduced Schema Registry with Avro schemas for all new topics, with a compatibility gate in CI.
- Built a Prometheus + Grafana consumer-lag dashboard with alerting rules, replacing the manual check.
The Outcome
- Zero data loss across the upgrade window.
- Exactly-once semantics enabled for the payments pipeline within two sprints of the upgrade.
- Consumer lag incidents dropped from ~3/month to 0 in the three months following the new alerting.