Specialised · Senior & Automation

Event-Driven & Async Testing

Kafka and async messaging are the backbone of NZ fintech, healthtech, and government integration platforms. Unlike REST APIs, failures in async systems are invisible — messages go nowhere and nobody knows. This lesson teaches you to test the full pipeline, from producer to consumer to dead letter queue.

Senior Automation Engineer ISTQB CTAL-TA ~15 min read

1 The Hook

An Auckland KiwiSaver administration platform processes employer payroll contribution events through Apache Kafka. An employer submits payroll on the 20th. The Kafka producer fires successfully — the event lands in the topic and the producer logs a confirmation. The employer's system shows "submitted."

Fourteen days later, during the monthly fund reconciliation, the operations team notices 847 employee contribution records did not update. The Kafka consumer had failed silently on a malformed event caused by a schema version mismatch introduced in a hotfix three days earlier. The consumer threw a deserialization exception, moved the message to a dead letter queue, and continued without alerting anyone. The DLQ was not monitored.

847 employees missed a fortnight of KiwiSaver contributions. The fund administrator spent three weeks in manual remediation, two weeks in customer communications, and filed a Serious Incident Report with the Financial Markets Authority. The root cause was not a complex bug — it was a schema change that was never tested against existing consumers, and a DLQ that was never wired to an alert.

2 The Rule

"In synchronous systems, failure is visible — an API returns 500. In async systems, failure is invisible — the message disappears and nobody knows. Test the whole pipeline: producer, schema, consumer, DLQ, and monitoring."

A test that verifies the producer sent a message and stops there has tested the least important part of the system. The producer sending is table stakes. What matters is whether the right message arrived at the right consumer, was processed correctly, and produced the right side effects — and what happened when it did not.

3 The Analogy

Analogy

A Kafka pipeline is a postal system, not a phone call.

When you make a phone call, you know immediately if it connected. The conversation is synchronous: you speak, they respond, the interaction is complete. Posting a letter is asynchronous: you hand the letter to the courier (producer), it goes into the postal system (Kafka topic), a delivery person picks it up (consumer), and eventually (if nothing goes wrong) it arrives and is acted on. Testing the phone call means checking the conversation. Testing the postal system means checking every step: was the right address on the envelope? Could the courier read the address? Did the recipient know what to do with the letter? What happened if the letter was unreadable — was it returned, discarded, or silently left in a pile?

4 How Async Systems Differ from REST

Understanding the differences drives every testing decision you make with Kafka and async messaging.

Dimension	REST / Synchronous	Kafka / Async
Failure visibility	HTTP 4xx/5xx — immediate, visible	Silent — consumer exception lands in DLQ, caller never notified
Timing	Request → response in same thread, same millisecond	Event produced → consumed seconds, minutes, or hours later
Coupling	Producer and consumer must be running simultaneously	Producer and consumer are decoupled — consumer can be offline
Ordering	Call sequence is deterministic	Only guaranteed within a partition; multi-partition = no global order guarantee
Delivery	Exactly-once (one request, one response)	At-least-once by default — consumers must be idempotent
Schema versioning	Usually in URL or HTTP Accept header	In Schema Registry (Confluent Avro or JSON Schema)

Each of these differences is a test dimension. At-least-once delivery means duplicates are possible — you must test that double-processing a KiwiSaver contribution doesn't credit twice. No global ordering means you must test out-of-order arrival. Silent failure means your DLQ monitoring is part of the system under test.

5 Schema Registry and Avro

Kafka messages are bytes. Without a contract, a producer can send anything and consumers silently break. The Confluent Schema Registry enforces a shared contract between producers and consumers, with versioning and compatibility rules.

Avro schema example: KiwiSaver contribution event

{
  "type": "record",
  "name": "KiwiSaverContributionEvent",
  "namespace": "nz.co.kiwibank.kiwisaver",
  "fields": [
    { "name": "memberId",     "type": "string" },
    { "name": "employerId",    "type": "string" },
    { "name": "amountNZD",     "type": "double" },
    { "name": "payPeriodEnd",   "type": "string" },
    { "name": "contributionType", "type": {
      "type": "enum",
      "name": "ContributionType",
      "symbols": ["EMPLOYEE", "EMPLOYER", "VOLUNTARY"]
    }}
  ]
}

Breaking vs non-breaking changes

The Schema Registry enforces compatibility rules. Test both directions when changing a schema:

Change type	Breaking?	Test scenario
Add optional field with default	No — backward compatible	Old consumer ignores new field; new consumer uses default if absent
Remove required field	Yes	Old consumer fails to deserialize; test old consumer against new schema
Change field type (`string` → `int`)	Yes	Consumer throws ClassCastException; verify DLQ receives the message
Add new enum value	Depends on consumer	Old consumer may throw on unknown enum symbol; test explicitly
Rename field	Yes	Old consumers read null for the renamed field; test data integrity

Schema compatibility policy: Set your Schema Registry to BACKWARD_TRANSITIVE compatibility for production topics. This means every new schema version must be readable by all previous consumer versions — the discipline that prevents the KiwiSaver DLQ incident in this lesson's hook.

6 Producer Testing

Producer tests verify that your application publishes the right events, with the right schema, to the right topic, under all conditions.

Using Testcontainers for real Kafka in CI

// Java (Spring Boot) — Testcontainers Kafka integration test
@Testcontainers
@SpringBootTest
class KiwiSaverContributionProducerTest {

  @Container
  static KafkaContainer kafka = new KafkaContainer(
    DockerImageName.parse("confluentinc/cp-kafka:7.6.0")
  );

  @Test
  void givenValidContribution_whenSubmitted_thenEventPublishedToTopic() {
    // Arrange
    var contribution = new ContributionRequest("MBR-001", "EMP-99", 450.00, "EMPLOYEE");

    // Act
    contributionService.submit(contribution);

    // Assert — consume the event and verify contents
    var records = testConsumer.poll(Duration.ofSeconds(5));
    assertThat(records).hasSize(1);
    var event = records.iterator().next().value();
    assertThat(event.getMemberId()).isEqualTo("MBR-001");
    assertThat(event.getAmountNZD()).isEqualTo(450.00);
    assertThat(event.getContributionType()).isEqualTo(ContributionType.EMPLOYEE);
  }

  @Test
  void givenNegativeAmount_whenSubmitted_thenNoEventPublished() {
    // Validation should reject before producer fires
    assertThrows(ValidationException.class, () ->
      contributionService.submit(new ContributionRequest("MBR-001", "EMP-99", -50.00, "EMPLOYEE"))
    );
    assertThat(testConsumer.poll(Duration.ofSeconds(2))).isEmpty();
  }
}

What to test on the producer side

Happy path: valid input → correct event on correct topic with correct schema version
Validation rejection: invalid input → no event produced, appropriate exception thrown
Partition key: events for the same member always land on the same partition (ordering guarantee)
Schema registration: producer registers schema on first publish; subsequent publishes use cached schema ID
Producer failure handling: what happens if Kafka is unavailable — does the producer retry, queue locally, or surface an error to the caller?

7 Consumer Testing

Consumer tests verify that your application processes incoming events correctly — including the invariant that processing the same event twice does not corrupt data.

Idempotency: the most important consumer test

Kafka guarantees at-least-once delivery. This means a consumer may receive the same event more than once — during a rebalance, a consumer restart, or a broker failover. Your consumer must be idempotent: processing the same message twice must produce the same result as processing it once.

// Test: duplicate event does not double-credit KiwiSaver account
@Test
void givenContributionEvent_whenProcessedTwice_thenBalanceIncrementedOnlyOnce() {
  var event = buildEvent("MBR-001", 450.00, "EVENT-UUID-9812");

  contributionConsumer.consume(event);
  contributionConsumer.consume(event); // simulates at-least-once duplicate

  var balance = memberRepository.getBalance("MBR-001");
  // Idempotency key (EVENT-UUID-9812) prevents double processing
  assertThat(balance.getKiwiSaverContributions()).isEqualTo(450.00);
}

Consumer test checklist

Test scenario	What you are verifying
Valid event → correct state change	Core business logic: event processed, database updated correctly
Same event twice → no duplication	Idempotency key or upsert logic prevents double-processing
Out-of-order events	Consumer handles event N+1 arriving before event N (if partition key is consistent, this should not happen; if not, verify graceful handling)
Schema version mismatch	Consumer receives old-version message → should either deserialize with defaults or route to DLQ
Malformed payload (poison pill)	Consumer does not crash; message routed to DLQ; offset committed so processing continues
Consumer group rebalance	No messages lost or double-processed during rebalance (requires integration test)

8 Dead Letter Queues and Poison Pill Testing

A dead letter queue (DLQ) is a Kafka topic where messages go when a consumer cannot process them. Without a DLQ, a poison pill message (one the consumer cannot deserialize or process) blocks the entire partition — the consumer cannot advance the offset, and all subsequent messages are stuck.

DLQ test scenarios

Poison pill routing: publish a message with an unrecognised schema version. Assert it lands in the kiwisaver.contributions.dlq topic, NOT that the consumer crashes or hangs.

Partition unblocking: after the poison pill is routed to DLQ, assert that the next valid message in the same partition is processed normally. The consumer must not stall.

DLQ message enrichment: many teams enrich DLQ messages with the original topic, partition, offset, and error cause. Verify the DLQ message contains this metadata — it makes manual remediation possible.

DLQ monitoring alert: this is the test most teams miss. Publish a message to the DLQ topic and verify your monitoring system (Datadog, PagerDuty, or similar) fires an alert within the SLA window. An unmonitored DLQ is not a safety net — it is a silent graveyard.

// Test: poison pill goes to DLQ, processing continues
@Test
void givenMalformedEvent_whenConsumed_thenRoutedToDLQAndNextEventProcessed() {
  // Publish unparseable message (raw bytes, not Avro)
  producer.send(new ProducerRecord<>("kiwisaver.contributions",
    "MBR-001", "not-valid-avro".getBytes()));

  // Publish a valid message after the poison pill
  producer.send(buildValidEvent("MBR-002", 300.00));

  // Wait for consumer to process
  await().atMost(10, SECONDS).until(() ->
    dlqConsumer.poll(Duration.ZERO).count() == 1
  );

  // DLQ has the poison pill
  var dlqRecord = dlqConsumer.poll(Duration.ofSeconds(2)).iterator().next();
  assertThat(new String(dlqRecord.value())).contains("not-valid-avro");

  // Valid message after it was also processed
  assertThat(memberRepository.getBalance("MBR-002").getContributions()).isEqualTo(300.00);
}

9 End-to-End Saga Testing

A saga is a business transaction that spans multiple services via events. Each step publishes an event that triggers the next service. If any step fails, compensating events roll back the preceding steps.

NZ example: IRD tax refund saga

When a taxpayer's return is assessed and a refund is due, a chain of events fires across three services:

Assessment Service publishes ReturnAssessed event (refund: $2,340 NZD)
Payment Service consumes it, initiates bank transfer, publishes PaymentInitiated
Notification Service consumes it, sends SMS to taxpayer, publishes NotificationSent
Ledger Service consumes PaymentInitiated, records the debit, publishes LedgerUpdated

Saga tests must cover both the happy path and compensation flows:

Test scenario	Expected outcome
Happy path: valid return, bank details on file	All 4 events fire in sequence; refund arrives; SMS sent; ledger updated
Bank transfer fails (account closed)	Payment Service publishes `PaymentFailed`; Assessment Service compensates, marks return as "refund-pending-manual-review"
Notification Service is down during saga	Payment saga completes; notification retried later; no duplication of payment
Ledger Service times out	Retry with backoff; idempotency key prevents double-ledger entry
Duplicate ReturnAssessed event	Payment Service recognises duplicate (idempotency key on return ID), discards; no double refund

Tooling: Use Testcontainers with Kafka + a database container to spin up the full event pipeline in CI. For complex multi-service sagas, consider Conduktor or AKHQ for visual debugging during exploratory testing.

10 AsyncAPI: Event Contracts

OpenAPI documents REST APIs. AsyncAPI documents event-driven APIs — what topics exist, what schemas they carry, and which services publish and subscribe. It is the async equivalent of a Pact consumer contract.

# asyncapi.yml — KiwiSaver contribution events
asyncapi: '2.6.0'
info:
  title: KiwiSaver Contribution Events
  version: '3.0.0'
  description: Events published by the payroll contribution service

channels:
  kiwisaver.contributions:
    publish:
      operationId: submitContribution
      summary: Employer submits payroll contribution
      message:
        $ref: '#/components/messages/ContributionEvent'
    bindings:
      kafka:
        partitions: 12
        replicas: 3

  kiwisaver.contributions.dlq:
    subscribe:
      operationId: receiveFailedContribution
      summary: Failed contribution events for manual remediation
      message:
        $ref: '#/components/messages/DLQEnvelope'

components:
  messages:
    ContributionEvent:
      name: KiwiSaverContributionEvent
      schemaFormat: 'application/vnd.apache.avro+json;version=1.9.0'
      payload:
        $ref: './schemas/contribution-event-v3.avsc'

AsyncAPI enables:

Contract-first development: define the event schema before writing producer or consumer code
Consumer-driven contract tests: consumers publish their expected schema; producers must not introduce breaking changes
Documentation: auto-generated human-readable docs showing all topics, schemas, and publishers/subscribers
Schema drift detection: CI compares the AsyncAPI spec against the Schema Registry; fails if they diverge

11 Common Mistakes

✗ Only testing the producer — not the consumer

The producer firing successfully is the least important verification. What matters is whether the consumer received, deserialised, and correctly processed the event. Many teams have 100% producer test coverage and 0% consumer test coverage. Always write consumer tests first.

✗ Not testing for duplicate message processing

Kafka at-least-once delivery is not just a theoretical edge case. Consumer restarts, rebalances, and broker failovers all trigger duplicate delivery. Idempotency must be tested explicitly. A single test processing the same event twice will catch an entire class of production data corruption bugs.

✗ Using an in-memory Kafka mock instead of a real broker

EmbeddedKafka and most Kafka mocks do not reproduce broker behaviours: partition rebalancing, consumer group coordination, schema registry enforcement, or DLQ routing. Use Testcontainers with a real Kafka image in CI. The extra 30 seconds of startup time catches an entire class of bugs that mocks hide.

✗ Never testing what happens when the DLQ fills up

A DLQ that nobody monitors is a ticking clock. Test that your monitoring fires when message count in the DLQ topic crosses a threshold, and that the alert reaches the team within the SLA window. The absence of a DLQ alert is not safety — it is ignorance of failure.

✗ Making breaking schema changes without testing backward compatibility

Renaming a field, removing a field, or changing a field type is a breaking change. Old consumers that have not been updated will fail to deserialise the new schema. Before deploying a schema change, test the new producer schema against the old consumer code — in CI, not in production.

12 Now You Try

Three AI-graded exercises. Apply event-driven testing concepts to NZ scenarios.

Exercise 1 — Design a DLQ test strategy

Design a DLQ test strategy for the KiwiSaver contribution Kafka pipeline described in the hook. Your strategy must cover:

What events should be routed to the DLQ, and under what conditions
What metadata the DLQ message should contain for remediation teams
How to test that the partition continues processing after a poison pill
How to test the monitoring alert (what triggers it, who is notified, within what SLA)
The manual remediation process for DLQ messages (and how to test it)

Assessed by a real LLM

Exercise 2 — Identify breaking schema changes

You are reviewing a pull request that modifies the Avro schema for Te Whatu Ora patient health events. Below is the current (v2) schema and the proposed (v3) schema. Identify every breaking change, explain why it is breaking, and propose a safe migration path for each.

// v2 (current)                    // v3 (proposed)
{ "name": "patientId",            { "name": "patientNHI",  ← renamed
  "type": "string" }                "type": "string" }
{ "name": "nhsNumber",            { "name": "nhiNumber",   ← renamed
  "type": ["null","string"] }       "type": "string" }     ← null removed
{ "name": "admissionDate",        { "name": "admissionDate",
  "type": "string" }                "type": "long" }       ← type changed
                                  { "name": "ward",        ← new field (no default)
                                    "type": "string" }

Assessed by a real LLM

Exercise 3 — Write idempotent consumer tests

You are writing tests for the Payment Service in the IRD tax refund saga (described in section 9). The service consumes ReturnAssessed events and initiates bank transfers. Write three test cases:

Happy path: first-time receipt of a valid event
Idempotency: same event received twice (simulate Kafka at-least-once delivery)
Compensation: bank transfer API returns "account closed" — verify the rollback event is published

Use the Given/When/Then format. You do not need to write real code — describe what each step sets up, what action is taken, and what assertions are made.

Assessed by a real LLM

13 Self-Check

Click each question to reveal the answer.

Why must Kafka consumers be idempotent, and how do you implement idempotency?

Kafka guarantees at-least-once delivery — the same message may be delivered more than once during consumer restarts, rebalances, or broker failovers. If a consumer processes the same event twice without idempotency protection, it may corrupt data (e.g. credit a bank account twice). Implement idempotency using a unique event ID (idempotency key): before processing, check if the event ID has already been processed; if so, skip it. Alternatively, use upsert (INSERT ON CONFLICT DO NOTHING) database operations with the event ID as the primary key.

What is a poison pill message, and what should happen when a consumer receives one?

A poison pill is a message the consumer cannot deserialise or process — typically due to a schema mismatch, corrupt bytes, or an unexpected payload format. Without handling, the consumer cannot advance the partition offset and all subsequent messages are blocked. The consumer should catch the deserialisation exception, route the poison pill to a dead letter queue topic, commit the offset to unblock the partition, and continue processing the next message. The DLQ must be monitored for manual remediation.

What is the difference between at-least-once and exactly-once delivery in Kafka?

At-least-once delivery (the default) means every message is delivered at minimum once, but duplicates are possible. Exactly-once delivery (available with Kafka transactions and idempotent producers) means each message is processed precisely once, even in the event of retries or failures. Exactly-once has higher latency and complexity. Most NZ teams use at-least-once with idempotent consumers — the consumer handles duplicates — rather than paying the performance cost of exactly-once at the broker level.

What is an AsyncAPI document, and how does it relate to schema testing?

AsyncAPI is an open specification (similar to OpenAPI for REST) that documents event-driven APIs: which topics exist, what schema each message uses, and which services are publishers or subscribers. In a CI pipeline, AsyncAPI can be used to validate that the actual schema in the Schema Registry matches the AsyncAPI specification — catching schema drift before it reaches production. It also enables consumer-driven contract tests, where consumers publish their required schema and producers must not break it.

Name three Kafka test scenarios that a Testcontainers-based integration test covers that an in-memory Kafka mock cannot.

1. Consumer group rebalancing — only a real Kafka broker manages consumer group coordination; mocks do not reproduce partition rebalancing behaviour. 2. Schema Registry enforcement — a real Confluent Schema Registry validates Avro schema compatibility; most mocks bypass schema validation entirely. 3. DLQ routing configuration — the error handler that routes poison pills to the DLQ topic is a Kafka Streams or Spring Kafka configuration that only takes effect against a real broker. Mocks typically call the consumer directly, skipping this layer.

14 ISTQB Mapping

ISTQB CTAL-TA (Certified Tester Advanced Level — Test Analyst)

Chapter 3 — Structure-Based Testing: Integration testing patterns, interface testing, system integration
Chapter 4 — Defect-Based Testing: Fault injection, error handling verification

ISTQB CTAL-DevOps (Advanced DevOps Tester)

Chapter 2 — Continuous Testing: Async system test automation, integration testing in CI pipelines
Chapter 5 — Test Environments: Containerised test environments (Testcontainers), infrastructure-as-code for test brokers

ISTQB CT-TAS (Test Automation Strategies)

Test isolation patterns, test data management for event-driven systems, contract testing strategies

← GitHub Actions for QA Back to Specialised →