Specialised · Senior & Automation

Event-Driven & Async Testing

Kafka and async messaging are the backbone of NZ fintech, healthtech, and government integration platforms. Unlike REST APIs, failures in async systems are invisible — messages go nowhere and nobody knows. This lesson teaches you to test the full pipeline, from producer to consumer to dead letter queue.

Senior Automation Engineer ISTQB CTAL-TA ~15 min read

1 The Hook

An Auckland KiwiSaver administration platform processes employer payroll contribution events through Apache Kafka. An employer submits payroll on the 20th. The Kafka producer fires successfully — the event lands in the topic and the producer logs a confirmation. The employer's system shows "submitted."

Fourteen days later, during the monthly fund reconciliation, the operations team notices 847 employee contribution records did not update. The Kafka consumer had failed silently on a malformed event caused by a schema version mismatch introduced in a hotfix three days earlier. The consumer threw a deserialization exception, moved the message to a dead letter queue, and continued without alerting anyone. The DLQ was not monitored.

847 employees missed a fortnight of KiwiSaver contributions. The fund administrator spent three weeks in manual remediation, two weeks in customer communications, and filed a Serious Incident Report with the Financial Markets Authority. The root cause was not a complex bug — it was a schema change that was never tested against existing consumers, and a DLQ that was never wired to an alert.

2 The Rule

"In synchronous systems, failure is visible — an API returns 500. In async systems, failure is invisible — the message disappears and nobody knows. Test the whole pipeline: producer, schema, consumer, DLQ, and monitoring."

A test that verifies the producer sent a message and stops there has tested the least important part of the system. The producer sending is table stakes. What matters is whether the right message arrived at the right consumer, was processed correctly, and produced the right side effects — and what happened when it did not.

3 The Analogy

Analogy

A Kafka pipeline is a postal system, not a phone call.

When you make a phone call, you know immediately if it connected. The conversation is synchronous: you speak, they respond, the interaction is complete. Posting a letter is asynchronous: you hand the letter to the courier (producer), it goes into the postal system (Kafka topic), a delivery person picks it up (consumer), and eventually (if nothing goes wrong) it arrives and is acted on. Testing the phone call means checking the conversation. Testing the postal system means checking every step: was the right address on the envelope? Could the courier read the address? Did the recipient know what to do with the letter? What happened if the letter was unreadable — was it returned, discarded, or silently left in a pile?

4 How Async Systems Differ from REST

Understanding the differences drives every testing decision you make with Kafka and async messaging.

DimensionREST / SynchronousKafka / Async
Failure visibility HTTP 4xx/5xx — immediate, visible Silent — consumer exception lands in DLQ, caller never notified
Timing Request → response in same thread, same millisecond Event produced → consumed seconds, minutes, or hours later
Coupling Producer and consumer must be running simultaneously Producer and consumer are decoupled — consumer can be offline
Ordering Call sequence is deterministic Only guaranteed within a partition; multi-partition = no global order guarantee
Delivery Exactly-once (one request, one response) At-least-once by default — consumers must be idempotent
Schema versioning Usually in URL or HTTP Accept header In Schema Registry (Confluent Avro or JSON Schema)

Each of these differences is a test dimension. At-least-once delivery means duplicates are possible — you must test that double-processing a KiwiSaver contribution doesn't credit twice. No global ordering means you must test out-of-order arrival. Silent failure means your DLQ monitoring is part of the system under test.

5 Schema Registry and Avro

Kafka messages are bytes. Without a contract, a producer can send anything and consumers silently break. The Confluent Schema Registry enforces a shared contract between producers and consumers, with versioning and compatibility rules.

Avro schema example: KiwiSaver contribution event

{ "type": "record", "name": "KiwiSaverContributionEvent", "namespace": "nz.co.kiwibank.kiwisaver", "fields": [ { "name": "memberId", "type": "string" }, { "name": "employerId", "type": "string" }, { "name": "amountNZD", "type": "double" }, { "name": "payPeriodEnd", "type": "string" }, { "name": "contributionType", "type": { "type": "enum", "name": "ContributionType", "symbols": ["EMPLOYEE", "EMPLOYER", "VOLUNTARY"] }} ] }

Breaking vs non-breaking changes

The Schema Registry enforces compatibility rules. Test both directions when changing a schema:

Change typeBreaking?Test scenario
Add optional field with defaultNo — backward compatibleOld consumer ignores new field; new consumer uses default if absent
Remove required fieldYesOld consumer fails to deserialize; test old consumer against new schema
Change field type (stringint)YesConsumer throws ClassCastException; verify DLQ receives the message
Add new enum valueDepends on consumerOld consumer may throw on unknown enum symbol; test explicitly
Rename fieldYesOld consumers read null for the renamed field; test data integrity

Schema compatibility policy: Set your Schema Registry to BACKWARD_TRANSITIVE compatibility for production topics. This means every new schema version must be readable by all previous consumer versions — the discipline that prevents the KiwiSaver DLQ incident in this lesson's hook.

6 Producer Testing

Producer tests verify that your application publishes the right events, with the right schema, to the right topic, under all conditions.

Using Testcontainers for real Kafka in CI

// Java (Spring Boot) — Testcontainers Kafka integration test @Testcontainers @SpringBootTest class KiwiSaverContributionProducerTest { @Container static KafkaContainer kafka = new KafkaContainer( DockerImageName.parse("confluentinc/cp-kafka:7.6.0") ); @Test void givenValidContribution_whenSubmitted_thenEventPublishedToTopic() { // Arrange var contribution = new ContributionRequest("MBR-001", "EMP-99", 450.00, "EMPLOYEE"); // Act contributionService.submit(contribution); // Assert — consume the event and verify contents var records = testConsumer.poll(Duration.ofSeconds(5)); assertThat(records).hasSize(1); var event = records.iterator().next().value(); assertThat(event.getMemberId()).isEqualTo("MBR-001"); assertThat(event.getAmountNZD()).isEqualTo(450.00); assertThat(event.getContributionType()).isEqualTo(ContributionType.EMPLOYEE); } @Test void givenNegativeAmount_whenSubmitted_thenNoEventPublished() { // Validation should reject before producer fires assertThrows(ValidationException.class, () -> contributionService.submit(new ContributionRequest("MBR-001", "EMP-99", -50.00, "EMPLOYEE")) ); assertThat(testConsumer.poll(Duration.ofSeconds(2))).isEmpty(); } }

What to test on the producer side

  • Happy path: valid input → correct event on correct topic with correct schema version
  • Validation rejection: invalid input → no event produced, appropriate exception thrown
  • Partition key: events for the same member always land on the same partition (ordering guarantee)
  • Schema registration: producer registers schema on first publish; subsequent publishes use cached schema ID
  • Producer failure handling: what happens if Kafka is unavailable — does the producer retry, queue locally, or surface an error to the caller?

7 Consumer Testing

Consumer tests verify that your application processes incoming events correctly — including the invariant that processing the same event twice does not corrupt data.

Idempotency: the most important consumer test

Kafka guarantees at-least-once delivery. This means a consumer may receive the same event more than once — during a rebalance, a consumer restart, or a broker failover. Your consumer must be idempotent: processing the same message twice must produce the same result as processing it once.

// Test: duplicate event does not double-credit KiwiSaver account @Test void givenContributionEvent_whenProcessedTwice_thenBalanceIncrementedOnlyOnce() { var event = buildEvent("MBR-001", 450.00, "EVENT-UUID-9812"); contributionConsumer.consume(event); contributionConsumer.consume(event); // simulates at-least-once duplicate var balance = memberRepository.getBalance("MBR-001"); // Idempotency key (EVENT-UUID-9812) prevents double processing assertThat(balance.getKiwiSaverContributions()).isEqualTo(450.00); }

Consumer test checklist

Test scenarioWhat you are verifying
Valid event → correct state changeCore business logic: event processed, database updated correctly
Same event twice → no duplicationIdempotency key or upsert logic prevents double-processing
Out-of-order eventsConsumer handles event N+1 arriving before event N (if partition key is consistent, this should not happen; if not, verify graceful handling)
Schema version mismatchConsumer receives old-version message → should either deserialize with defaults or route to DLQ
Malformed payload (poison pill)Consumer does not crash; message routed to DLQ; offset committed so processing continues
Consumer group rebalanceNo messages lost or double-processed during rebalance (requires integration test)

8 Dead Letter Queues and Poison Pill Testing

A dead letter queue (DLQ) is a Kafka topic where messages go when a consumer cannot process them. Without a DLQ, a poison pill message (one the consumer cannot deserialize or process) blocks the entire partition — the consumer cannot advance the offset, and all subsequent messages are stuck.

DLQ test scenarios

1
Poison pill routing: publish a message with an unrecognised schema version. Assert it lands in the kiwisaver.contributions.dlq topic, NOT that the consumer crashes or hangs.
2
Partition unblocking: after the poison pill is routed to DLQ, assert that the next valid message in the same partition is processed normally. The consumer must not stall.
3
DLQ message enrichment: many teams enrich DLQ messages with the original topic, partition, offset, and error cause. Verify the DLQ message contains this metadata — it makes manual remediation possible.
4
DLQ monitoring alert: this is the test most teams miss. Publish a message to the DLQ topic and verify your monitoring system (Datadog, PagerDuty, or similar) fires an alert within the SLA window. An unmonitored DLQ is not a safety net — it is a silent graveyard.
// Test: poison pill goes to DLQ, processing continues @Test void givenMalformedEvent_whenConsumed_thenRoutedToDLQAndNextEventProcessed() { // Publish unparseable message (raw bytes, not Avro) producer.send(new ProducerRecord<>("kiwisaver.contributions", "MBR-001", "not-valid-avro".getBytes())); // Publish a valid message after the poison pill producer.send(buildValidEvent("MBR-002", 300.00)); // Wait for consumer to process await().atMost(10, SECONDS).until(() -> dlqConsumer.poll(Duration.ZERO).count() == 1 ); // DLQ has the poison pill var dlqRecord = dlqConsumer.poll(Duration.ofSeconds(2)).iterator().next(); assertThat(new String(dlqRecord.value())).contains("not-valid-avro"); // Valid message after it was also processed assertThat(memberRepository.getBalance("MBR-002").getContributions()).isEqualTo(300.00); }

9 End-to-End Saga Testing

A saga is a business transaction that spans multiple services via events. Each step publishes an event that triggers the next service. If any step fails, compensating events roll back the preceding steps.

NZ example: IRD tax refund saga

When a taxpayer's return is assessed and a refund is due, a chain of events fires across three services:

  1. Assessment Service publishes ReturnAssessed event (refund: $2,340 NZD)
  2. Payment Service consumes it, initiates bank transfer, publishes PaymentInitiated
  3. Notification Service consumes it, sends SMS to taxpayer, publishes NotificationSent
  4. Ledger Service consumes PaymentInitiated, records the debit, publishes LedgerUpdated

Saga tests must cover both the happy path and compensation flows:

Test scenarioExpected outcome
Happy path: valid return, bank details on fileAll 4 events fire in sequence; refund arrives; SMS sent; ledger updated
Bank transfer fails (account closed)Payment Service publishes PaymentFailed; Assessment Service compensates, marks return as "refund-pending-manual-review"
Notification Service is down during sagaPayment saga completes; notification retried later; no duplication of payment
Ledger Service times outRetry with backoff; idempotency key prevents double-ledger entry
Duplicate ReturnAssessed eventPayment Service recognises duplicate (idempotency key on return ID), discards; no double refund

Tooling: Use Testcontainers with Kafka + a database container to spin up the full event pipeline in CI. For complex multi-service sagas, consider Conduktor or AKHQ for visual debugging during exploratory testing.

10 AsyncAPI: Event Contracts

OpenAPI documents REST APIs. AsyncAPI documents event-driven APIs — what topics exist, what schemas they carry, and which services publish and subscribe. It is the async equivalent of a Pact consumer contract.

# asyncapi.yml — KiwiSaver contribution events asyncapi: '2.6.0' info: title: KiwiSaver Contribution Events version: '3.0.0' description: Events published by the payroll contribution service channels: kiwisaver.contributions: publish: operationId: submitContribution summary: Employer submits payroll contribution message: $ref: '#/components/messages/ContributionEvent' bindings: kafka: partitions: 12 replicas: 3 kiwisaver.contributions.dlq: subscribe: operationId: receiveFailedContribution summary: Failed contribution events for manual remediation message: $ref: '#/components/messages/DLQEnvelope' components: messages: ContributionEvent: name: KiwiSaverContributionEvent schemaFormat: 'application/vnd.apache.avro+json;version=1.9.0' payload: $ref: './schemas/contribution-event-v3.avsc'

AsyncAPI enables:

  • Contract-first development: define the event schema before writing producer or consumer code
  • Consumer-driven contract tests: consumers publish their expected schema; producers must not introduce breaking changes
  • Documentation: auto-generated human-readable docs showing all topics, schemas, and publishers/subscribers
  • Schema drift detection: CI compares the AsyncAPI spec against the Schema Registry; fails if they diverge

11 Common Mistakes

Only testing the producer — not the consumer

The producer firing successfully is the least important verification. What matters is whether the consumer received, deserialised, and correctly processed the event. Many teams have 100% producer test coverage and 0% consumer test coverage. Always write consumer tests first.

Not testing for duplicate message processing

Kafka at-least-once delivery is not just a theoretical edge case. Consumer restarts, rebalances, and broker failovers all trigger duplicate delivery. Idempotency must be tested explicitly. A single test processing the same event twice will catch an entire class of production data corruption bugs.

Using an in-memory Kafka mock instead of a real broker

EmbeddedKafka and most Kafka mocks do not reproduce broker behaviours: partition rebalancing, consumer group coordination, schema registry enforcement, or DLQ routing. Use Testcontainers with a real Kafka image in CI. The extra 30 seconds of startup time catches an entire class of bugs that mocks hide.

Never testing what happens when the DLQ fills up

A DLQ that nobody monitors is a ticking clock. Test that your monitoring fires when message count in the DLQ topic crosses a threshold, and that the alert reaches the team within the SLA window. The absence of a DLQ alert is not safety — it is ignorance of failure.

Making breaking schema changes without testing backward compatibility

Renaming a field, removing a field, or changing a field type is a breaking change. Old consumers that have not been updated will fail to deserialise the new schema. Before deploying a schema change, test the new producer schema against the old consumer code — in CI, not in production.

12 Now You Try

Three AI-graded exercises. Apply event-driven testing concepts to NZ scenarios.

Exercise 1 — Design a DLQ test strategy

Design a DLQ test strategy for the KiwiSaver contribution Kafka pipeline described in the hook. Your strategy must cover:

  • What events should be routed to the DLQ, and under what conditions
  • What metadata the DLQ message should contain for remediation teams
  • How to test that the partition continues processing after a poison pill
  • How to test the monitoring alert (what triggers it, who is notified, within what SLA)
  • The manual remediation process for DLQ messages (and how to test it)
Assessed by a real LLM
Exercise 2 — Identify breaking schema changes

You are reviewing a pull request that modifies the Avro schema for Te Whatu Ora patient health events. Below is the current (v2) schema and the proposed (v3) schema. Identify every breaking change, explain why it is breaking, and propose a safe migration path for each.

// v2 (current)                    // v3 (proposed)
{ "name": "patientId",            { "name": "patientNHI",  ← renamed
  "type": "string" }                "type": "string" }
{ "name": "nhsNumber",            { "name": "nhiNumber",   ← renamed
  "type": ["null","string"] }       "type": "string" }     ← null removed
{ "name": "admissionDate",        { "name": "admissionDate",
  "type": "string" }                "type": "long" }       ← type changed
                                  { "name": "ward",        ← new field (no default)
                                    "type": "string" }
Assessed by a real LLM
Exercise 3 — Write idempotent consumer tests

You are writing tests for the Payment Service in the IRD tax refund saga (described in section 9). The service consumes ReturnAssessed events and initiates bank transfers. Write three test cases:

  1. Happy path: first-time receipt of a valid event
  2. Idempotency: same event received twice (simulate Kafka at-least-once delivery)
  3. Compensation: bank transfer API returns "account closed" — verify the rollback event is published

Use the Given/When/Then format. You do not need to write real code — describe what each step sets up, what action is taken, and what assertions are made.

Assessed by a real LLM

13 Self-Check

Click each question to reveal the answer.

Why must Kafka consumers be idempotent, and how do you implement idempotency?

Kafka guarantees at-least-once delivery — the same message may be delivered more than once during consumer restarts, rebalances, or broker failovers. If a consumer processes the same event twice without idempotency protection, it may corrupt data (e.g. credit a bank account twice). Implement idempotency using a unique event ID (idempotency key): before processing, check if the event ID has already been processed; if so, skip it. Alternatively, use upsert (INSERT ON CONFLICT DO NOTHING) database operations with the event ID as the primary key.

What is a poison pill message, and what should happen when a consumer receives one?

A poison pill is a message the consumer cannot deserialise or process — typically due to a schema mismatch, corrupt bytes, or an unexpected payload format. Without handling, the consumer cannot advance the partition offset and all subsequent messages are blocked. The consumer should catch the deserialisation exception, route the poison pill to a dead letter queue topic, commit the offset to unblock the partition, and continue processing the next message. The DLQ must be monitored for manual remediation.

What is the difference between at-least-once and exactly-once delivery in Kafka?

At-least-once delivery (the default) means every message is delivered at minimum once, but duplicates are possible. Exactly-once delivery (available with Kafka transactions and idempotent producers) means each message is processed precisely once, even in the event of retries or failures. Exactly-once has higher latency and complexity. Most NZ teams use at-least-once with idempotent consumers — the consumer handles duplicates — rather than paying the performance cost of exactly-once at the broker level.

What is an AsyncAPI document, and how does it relate to schema testing?

AsyncAPI is an open specification (similar to OpenAPI for REST) that documents event-driven APIs: which topics exist, what schema each message uses, and which services are publishers or subscribers. In a CI pipeline, AsyncAPI can be used to validate that the actual schema in the Schema Registry matches the AsyncAPI specification — catching schema drift before it reaches production. It also enables consumer-driven contract tests, where consumers publish their required schema and producers must not break it.

Name three Kafka test scenarios that a Testcontainers-based integration test covers that an in-memory Kafka mock cannot.

1. Consumer group rebalancing — only a real Kafka broker manages consumer group coordination; mocks do not reproduce partition rebalancing behaviour. 2. Schema Registry enforcement — a real Confluent Schema Registry validates Avro schema compatibility; most mocks bypass schema validation entirely. 3. DLQ routing configuration — the error handler that routes poison pills to the DLQ topic is a Kafka Streams or Spring Kafka configuration that only takes effect against a real broker. Mocks typically call the consumer directly, skipping this layer.

14 ISTQB Mapping

ISTQB CTAL-TA (Certified Tester Advanced Level — Test Analyst)

  • Chapter 3 — Structure-Based Testing: Integration testing patterns, interface testing, system integration
  • Chapter 4 — Defect-Based Testing: Fault injection, error handling verification

ISTQB CTAL-DevOps (Advanced DevOps Tester)

  • Chapter 2 — Continuous Testing: Async system test automation, integration testing in CI pipelines
  • Chapter 5 — Test Environments: Containerised test environments (Testcontainers), infrastructure-as-code for test brokers

ISTQB CT-TAS (Test Automation Strategies)

  • Test isolation patterns, test data management for event-driven systems, contract testing strategies