20 min read · 9 self-checks · Updated June 2026

Structural / Integration · CTFL 4.0, CTAL-TA

Webhook Testing

Test asynchronous event-driven APIs where external services call your endpoints to notify you of events. Webhooks are unreliable by nature — messages can be delayed, retried, or lost entirely. You need to test accordingly.

Senior Test Lead ISTQB CTFL 4.0 · CTAL-TA

1 The Hook

A Wellington courier startup integrates a payment provider. When a customer pays, the provider POSTs a charge.succeeded webhook to the startup’s endpoint, which marks the parcel as paid and books the pickup. In testing everything looks perfect: pay, parcel goes green, pickup booked. Ship it.

A fortnight later, drivers start turning up to collect parcels that customers swear they never paid for — and other customers who definitely paid are stuck on “awaiting payment”. The cause: under real network conditions the provider’s endpoint occasionally timed out waiting for a 200, so it retried. The handler had no idempotency, so a single payment booked two pickups. And when the handler was slow, the provider gave up and the paid parcel never went green at all. None of this showed up in the tidy happy-path test, because that test fired one webhook, once, with no delay and no retry.

This is the trap with webhooks: the feature works the first time, in order, instantly — and almost never does in production. The defects live in the retries, the duplicates, the out-of-order arrivals, and the timeouts. A test that sends one clean event can never see them.

💬

Senior Engineer Insight

The mistake I see most often is teams that test idempotency by sending the webhook twice in sequence and checking the outcome once. That is not idempotency testing — that is serial deduplication. The real failure mode is concurrent delivery: your handler is still processing the first request when the retry arrives because your endpoint ran slow and the provider timed out. I have seen this take down a NZ insurance platform mid-batch because the handler was doing synchronous document generation before returning 200. Two threads, same order, both past the idempotency check simultaneously. Return 200 in under two seconds, always. Put everything else on a queue. Test the concurrent case explicitly, not just the sequential one.

2 The Rule

A webhook is an unreliable, externally-controlled event — so never test only the single clean delivery. Test the duplicate, the out-of-order arrival, the slow handler, the invalid signature, and the retry, because that is where the real defects live.

3 The Analogy

Analogy

A courier leaving a parcel on the doorstep.

The courier (the external system) doesn’t wait for you to confirm you got the parcel — they drop it and drive off. If the tracking app doesn’t register the delivery, they come back and leave another one, so now you have two. If you’re out, the parcel might sit there for hours before you notice. And anyone walking past could leave a fake parcel on your step that looks just like a real one. You don’t control when it arrives, whether it arrives twice, or whether it’s genuine.

Testing webhooks is checking what your household does with that doorstep: do you take in two identical parcels as one order, or two? Do you accept a parcel with no sender label? Do you still notice the one that’s been sitting there since the morning? The single tidy delivery is never the interesting case.

What it is

A webhook is a callback — when Event X happens in an external system (payment processed, file uploaded, subscription cancelled), that system makes an HTTP POST request to your endpoint to notify you. Unlike APIs where you pull data on demand, webhooks push data to you when something important happens.

Webhook testing verifies that your system correctly receives, processes, and acts upon webhook notifications. It’s an integration test: you’re checking that an external system and your system synchronise correctly even when messages are delivered out of order, delayed, or duplicated.

Webhooks are inherently unreliable. Networks fail. Your endpoint goes down. Retries can result in duplicate messages. The external system may not know whether the webhook was delivered, so it keeps retrying. Testing must account for all these failure modes.

Why webhooks are tricky to test

Asynchronous timing: You send a request to create a resource in an external system. The system returns 200 immediately. But the webhook is delivered 100ms later, or 5 seconds later, or after a retry following a timeout. Your test must wait for the event, not assume it happened instantly.

External control: You can’t trigger the webhook directly — the external system controls when it fires. You can trigger the action (charge a card, upload a file), but you must wait and observe the webhook arriving.

Duplicate delivery: If your endpoint is slow to respond, the external system may retry. Your system may now see the same webhook twice. Is your application idempotent? Does it create two records or one?

Order of delivery: Three webhooks may arrive out of order due to retries and network delays. Does your system handle them correctly?

Webhook flow: trigger → POST → handler → verification

A typical webhook flow has three stages:

Webhook anatomy

Stage	What happens	Test focus
1. Trigger	Event occurs in external system (payment gateway processes a charge)	Can you trigger the webhook from the external system? Does it fire at all?
2. Delivery	External system POSTs JSON payload to your endpoint (e.g. https://yourapp.co.nz/webhooks/payment)	Does your endpoint receive it? Is the HTTP status code 200? Is the payload what you expected?
3. Processing	Your system parses the payload, validates it, and takes action (update order status, send confirmation email)	Did the right thing happen? Order status updated? Email sent? Database record created?
4. Acknowledgement	Your endpoint returns 200 to signal success, or 5xx to signal failure (triggering retry)	Does your endpoint return the right status? Too slow to respond?

Test scenarios every webhook system needs

Happy path

Webhook arrives, system processes it correctly, and acts accordingly. User sees the expected outcome (order confirmed, payment recorded, notification sent).

Retry and duplicate handling

Your endpoint times out or returns 5xx. The external system retries 3 times over the next hour. Only one action should happen, not three. This requires idempotency — usually via a unique webhook ID that your system tracks.

Test: Send the same webhook payload twice (with the same ID). Verify that the side effect (charge, update, send email) happens exactly once.

Invalid payload handling

Webhook arrives but is malformed: missing required fields, wrong data types, signature invalid. Your system should log the error and return 400 or 422, not crash or silently ignore it.

Test: Send a webhook with a missing field, a negative price, or a corrupted signature. Verify the error is logged and the system doesn’t process it.

Out-of-order delivery

Three webhooks arrive in the wrong order: webhook 3, webhook 1, webhook 2. Does your system handle them correctly? Usually requires a timestamp or sequence number.

Timeout and slow processing

Your endpoint takes 8 seconds to process the webhook. The external system has a 5-second timeout. The system retries. You now have concurrency: the first request is still processing while the retry arrives.

Test: Add artificial delays to your webhook handler and verify that retries don’t cause race conditions.

Tools and setup

Webhook testing platforms

webhook.site (free) — creates a unique URL that captures all HTTP requests sent to it. Inspect the request method, headers, body, and timing. No code required. Good for quick exploration and seeing what data your payment gateway actually sends.

RequestBin (free) — similar to webhook.site; older but reliable. Captures and displays webhook payloads.

ngrok (free tier available) — tunnels your local development server to a public HTTPS URL. Allows external systems to POST to your local machine without deploying. Essential for testing during development.

ngrok setup example

Suppose your local webhook endpoint is http://localhost:8080/webhooks/payment. To make it accessible to a payment gateway:

Install ngrok: brew install ngrok (or download from ngrok.com)
Run: ngrok http 8080
ngrok outputs a public URL like https://abc123.ngrok.io
Register https://abc123.ngrok.io/webhooks/payment with the payment gateway
When the gateway sends a webhook, ngrok tunnels it to your local http://localhost:8080/webhooks/payment
You see every request in the ngrok dashboard and in your local logs

This is invaluable during development — you can trigger events in the external system and watch the webhook arrive in your local app in real time, without pushing to staging.

NZ worked example: payment processor webhook

A NZ e-commerce site uses Stripe (or a similar payment processor) to handle card payments. When a charge succeeds, Stripe sends a webhook to your system:

Stripe payment.success webhook test scenario

Test	Setup	Expected outcome
Happy path: payment succeeds	Charge a test card in Stripe dashboard. Webhook fires: `event.type: "charge.succeeded"`	Order status changes to "paid". Confirmation email sent. Inventory decremented. Payment ID stored in database.
Duplicate webhook	Intercept the webhook (with ngrok logs). Manually send it again with the same webhook ID.	Order status is already "paid". Email not sent again. Inventory not decremented again. Idempotency key prevents double-processing.
Webhook arrives before payment API call	Webhook sent. Your code queries the Stripe API for payment details. Timeout.	System retries the query or falls back gracefully. Order is not left in a limbo state.
Invalid signature	Send a webhook with a tampered body or corrupted HMAC signature.	Webhook rejected with 401/403. Error logged. Order not updated.

Security testing: HMAC, replay attacks, TLS

HMAC signature verification

Webhooks must be verified to ensure they came from the external system, not an attacker. Most webhook systems (Stripe, PayPal, etc.) include an HMAC signature in a header (e.g. X-Stripe-Signature). Your endpoint must verify this signature before processing.

Test: Send a webhook with a valid payload but an invalid signature. Your endpoint should reject it.

Test: Send a webhook with a valid signature but a tampered body (e.g. changed the amount). Your endpoint should reject it.

Replay attack prevention

An attacker captures a webhook (e.g. payment received) and resends it repeatedly. Without replay protection, your system charges the customer multiple times.

Test: Capture a webhook with a network sniffer or your logs. Replay it 10 times. Verify that idempotency (via webhook ID) prevents duplicate processing.

TLS/SSL validation

Ensure your webhook endpoint is HTTPS-only. If the external system can reach your endpoint via HTTP, an attacker can intercept the webhook.

Test: Register your webhook endpoint as http:// (not https://). Some systems will reject it. Others might not. Check the documentation.

Common bugs in webhook systems

No signature verification — endpoint accepts any POST, allowing attackers to forge notifications
No idempotency — duplicate webhooks cause duplicate orders, charges, or emails
Slow endpoint timeout — webhook handler takes 10 seconds to complete, but external system times out at 5 seconds, causing unnecessary retries
No logging — webhook arrives, is processed incorrectly, and there’s no audit trail of what happened
Wrong HTTP status — endpoint returns 200 even when processing failed, so the external system doesn’t retry
Race condition on first webhook — webhook arrives before the order is fully created in your database, causing a foreign key error

Tips

Use webhook.site first. Before writing any test code, paste a webhook.site URL into the external system and trigger an event. See what data actually arrives. Often the payload structure is different from what you expected, or fields are missing.

Log every webhook — even if processing succeeds. Log the webhook ID, timestamp, event type, and outcome. This is invaluable for debugging.
Test with ngrok during dev — don’t wait until staging. Catch webhook issues early when you can see them in your local logs.
Implement idempotency via webhook ID — track webhook IDs in your database and skip processing if you’ve seen this ID before
Return 200 quickly — do lightweight validation and return 200 to the webhook immediately. Put heavy processing (email sending, API calls) in a background queue.
Test with intentional delays — wrap your webhook handler in a sleep(5000) and trigger the webhook. Does the external system retry? Does your app handle it?

4 Industry Reality

🏭 What you actually encounter on the job

Webhooks are rarely documented accurately. The payload you receive in production often differs from what the API docs show — extra fields appear, promised fields are missing, field names use inconsistent casing. Always point webhook.site at the real system before writing a single test, and treat the live payload as the source of truth.
Idempotency is almost never implemented first time. Development teams build the happy path, ship it, and only add idempotency logic after a customer complains about a duplicate charge. As a tester you will regularly find this gap — budget time to explore it even when the requirement says nothing about it.
ngrok is fine for dev; it does not exist in CI. In practice, webhook integration tests get skipped or manually triggered because nobody has wired up a stable tunnel or mock server in the pipeline. Senior testers push for a local mock (e.g. a simple Express route or WireMock) so the tests can run headlessly in CI without ngrok.
NZ teams using LedgerNZ, Stripe, or POLi see webhook bugs constantly. LedgerNZ webhooks carry a lastModifiedUtcDateTime that many handlers ignore, causing race conditions. POLi notifies on redirect, so the webhook can fire before the browser redirect completes — a common source of “payment confirmed but order still pending” bugs in NZ e-commerce.
Time pressure kills the failure-mode tests. Under sprint pressure, duplicate-delivery and out-of-order tests are the first to be cut. A senior tester negotiates their inclusion as acceptance criteria before the sprint starts, not after — because getting them in post-sprint is almost impossible once the feature is declared done.

5 When to Use It — and When Not To

⚡ Decision guide

✓ Use it when

Your system receives async notifications from an external platform (payment gateway, shipping provider, government API, LedgerNZ, Stripe, Shopify)
The external system has a retry policy — duplicate delivery is possible, so idempotency must be verified
The webhook handler triggers irreversible side effects: charging a card, sending an email, booking a job, dispatching a courier
The feature has security implications — unauthenticated webhooks allow attackers to forge payment confirmations or trigger free fulfilment
You are building on a platform that documents its retry behaviour (most do): Stripe retries up to 3 days, GitHub retries for 3 days, LedgerNZ retries for 72 hours. You need tests for all those windows.

✗ Skip it when

Your system only sends webhooks outbound and the receiving system is fully owned by another team — test your send logic, not their receive logic
The integration is a one-off internal event with no retry, no external party, and no irreversible side effect — a simple API call or queue message is cheaper to test
The external system provides a reliable testing SDK that already handles delivery simulation — no need to recreate what the SDK already mocks
The “webhook” is actually a polling loop in disguise — if your team checks for new events every 30 seconds via GET, test it as a polling API, not a push scenario
You are under extreme time pressure and the webhook only carries non-critical notifications (e.g. a Slack digest with no financial or state-change implications) — basic happy-path coverage is acceptable, but document the risk explicitly

6 Context Guide — Where This Technique Fits

Webhook testing effort and priority vary significantly by organisation type. Use this table to calibrate how deeply to invest based on your context.

Context	Priority	Why
Fintech / payments e.g. Harbour Bank NZ, Stripe integrations, POLi	Essential	Every webhook failure mode has a direct financial consequence — duplicate charges, missed payments, forged disbursements. Signature verification, idempotency, and slow-handler testing are non-negotiable. POLi’s redirect-then-notify pattern means the webhook can arrive before the browser completes, so race conditions are endemic without explicit testing.
Government / public sector e.g. CoverNZ, Revenue NZ, Benefits NZ integrations	Essential	Webhook side effects (disbursing CoverNZ compensation, filing tax data with Revenue NZ) are irreversible and auditable. Duplicate delivery or forged notifications carry regulatory and legal risk. Full failure-mode testing — including concurrent duplicate delivery — should be in acceptance criteria before the sprint starts, not added post-ship.
E-commerce / retail e.g. Shopify NZ merchants, LedgerNZ-integrated retailers	High	Order fulfilment, stock decrement, and customer emails are all triggered by webhooks. Duplicates cause double fulfilment or double emails; missed webhooks leave orders in limbo. LedgerNZ’s `lastModifiedUtcDateTime` field is commonly ignored, causing race conditions in reconciliation. Idempotency and out-of-order testing are high-value investments.
SaaS / platform integrations e.g. NZ SaaS teams using GitHub, Slack, HubSpot webhooks	Medium	Side effects are real but often recoverable (a duplicate Slack notification or a double CRM entry is annoying but not catastrophic). Signature verification is still required — forged GitHub webhooks can trigger CI pipelines. Idempotency and slow-handler tests are worth including for any webhook that creates records or triggers external actions.
Internal / non-critical notifications e.g. internal monitoring alerts, team Slack digests	Low	No financial or state-change implications. A duplicate webhook just means a second Slack message. Happy-path delivery confirmation is sufficient, but document that the failure-mode tests were explicitly descoped and record the risk owner — so the decision is visible, not invisible.
Utilities / infrastructure e.g. Meridian, Vector smart-meter or SCADA integrations	High	Billing records and meter readings derived from webhook events are difficult to correct after the fact. Duplicate meter readings without idempotency produce incorrect bills. Out-of-order readings corrupt usage history. Slow-handler tests matter because batch meter-data uploads can stall handler threads mid-delivery.

Trade-offs

What you gain and what you give up when you choose Webhook Testing.

Advantage	Disadvantage	Use instead when…
Catches the defects that actually occur in production — retries, duplicates, forged payloads, and race conditions are invisible to any other test type.	Requires infrastructure: a live or sandbox external system, ngrok or a mock server, and careful timing to reproduce async behaviour reliably.	Your system only sends webhooks outbound and does not receive them — test your send logic with unit or integration tests instead.
Validates security properties — signature verification, replay prevention — that no unit test can verify because they require the real HTTP handshake and header inspection.	Slow to run end-to-end. Network round-trips, provider sandbox latency, and retry windows (minutes to hours) make full webhook regression suites impractical in rapid CI pipelines.	You need fast feedback on every commit — run a local mock server that fires deterministic webhooks at your handler instead of depending on a live provider sandbox.
Forces idempotency to be designed in early — raising duplicate-delivery tests in sprint planning is far cheaper than retrofitting idempotency logic after a production double-charge incident.	Hard to reproduce timing-sensitive failures deterministically. Concurrent duplicate delivery (two identical webhooks arriving within milliseconds) is the most dangerous failure mode and requires deliberate tooling or load injection to trigger reliably.	You are testing a pure polling integration where your app calls the external system on a schedule — use API integration testing rather than webhook testing; push/pull semantics are different.
Provides an audit trail that supports incident investigation — a webhook test suite that asserts the log entry alongside the side effect gives you a forensic record when a customer disputes a charge or a booking.	Scope creep risk — webhook failure modes are numerous, and without a prioritised test strategy teams either over-invest in low-risk notifications or under-invest in high-risk payment webhooks.	Message shape and schema correctness is your primary concern (rather than runtime delivery behaviour) — use contract testing with a tool like Pact to verify field names, types, and structure between producer and consumer.

Enterprise reality

How Webhook Testing changes at 200–300-developer scale in NZ enterprise

Governance replaces individual judgement. At small-team scale a senior tester decides which webhook failure modes to cover. At enterprise scale — KiwiFirst Bank, CloudBooks, or Revenue NZ running 20+ squads — that decision is captured in a mandatory test policy: every webhook that carries a financial or state-change side effect must have idempotency, signature-verification, and slow-handler tests in its definition of done. Squads that ship without them block the release gate, not just raise a risk comment.
The Privacy Act 2020 and NZISM create hard logging requirements. Revenue NZ’s webhook integrations with third-party tax software must produce an immutable audit trail of every notification received, because the Privacy Act 2020 and the NZ Information Security Manual (NZISM) require demonstrable evidence that personal data was handled correctly. Enterprise webhook test suites assert the audit log entry alongside the side effect — a webhook that processed correctly but left no log record is a test failure.
Tooling shifts from ngrok to platform-managed mock infrastructure. ngrok is a developer tool; it does not exist in an enterprise CI pipeline. Organisations at this scale run WireMock or Mockoon instances in their Kubernetes cluster, or use provider-specific test harnesses (Stripe’s CLI, GitHub’s webhook replay API). These fire deterministic, parameterised webhook payloads — including timed duplicates for concurrency tests — against every PR. Teams at TechServNZ and TeleNZ standardise on a shared webhook-mock service so test parity exists across all squads consuming the same upstream events.
Cross-squad coordination turns webhook failures into incidents, not bugs. When 10+ squads consume the same event stream — common in Harbour Bank’s open banking platform or HealthNZ’s patient-event bus — a schema change in the upstream webhook silently breaks every consumer simultaneously. Enterprise teams mandate consumer-driven contract tests (Pact) published to a shared broker, so any breaking payload change is caught before deployment rather than as a cascade of production failures across squads that didn’t know the schema had changed.

⚖ Judgment — How Experienced Testers Decide

Three scenarios drawn from NZ organisations — and the reasoning behind each call.

Scenario 1 — CoverNZ claims portal, payment.disbursed webhook

Situation: A CoverNZ contractor asks you to test a new webhook handler. The sprint ticket says “verify happy path.” Side effects are irreversible: disbursing CoverNZ compensation once it fires cannot be undone. The government payment system’s SLA documents a 72-hour retry window.

I would push back on the sprint scope before the sprint starts, not after. I would raise duplicate delivery (idempotency), HMAC signature rejection, and slow-handler behaviour as minimum acceptance criteria. “Happy path only” for an irreversible CoverNZ disbursement is not a risk the team can accept silently — and a 72-hour retry window means duplicates are a documented certainty, not a theoretical edge case. I would frame this as a business risk conversation, not a technical one: “if this handler disburses the same payment twice, who owns that?”

Scenario 2 — LedgerNZ-integrated Wellington retailer, invoice reconciliation webhook

Situation: A mid-size Wellington retailer’s accounting team uses LedgerNZ. The dev team has wired up a invoice.updated webhook that syncs invoice state into the retailer’s ERP. The developer says “LedgerNZ only sends each webhook once, so idempotency isn’t needed.” You are reviewing the test plan three days before go-live.

I would ask the developer to show me LedgerNZ’s documented delivery guarantee. LedgerNZ’s own developer docs state at-least-once delivery and a retry window of 72 hours — the developer’s assumption is factually incorrect. Three days before go-live, I would not try to retrofit idempotency into the existing handler (too risky). Instead, I would add a documented known risk to the test plan, raise it to the product owner, and agree a monitoring plan: alert on duplicate invoice sync events for 30 days post-launch so the team can detect and manually correct any duplicates before they accumulate.

Scenario 3 — Meridian Energy smart-meter billing, meter-reading-submitted webhook

Situation: You are testing a smart-meter data pipeline. The provider sends a meter-reading-submitted webhook; the handler creates a billing record and queues a nightly usage calculation. The product manager says full failure-mode testing is out of scope because “we can just correct any billing errors manually.” The volume is 400,000 meter readings per night.

I would escalate with numbers, not principles. At 400,000 readings per night, even a 0.1% duplicate rate produces 400 incorrect billing records nightly — 12,000 per month. “Correct manually” at that scale is not a plan; it is a customer service catastrophe and a regulatory risk under the Electricity Industry Act. I would quantify the correction cost, present it to the product manager alongside the estimated time to implement idempotency (usually two to four hours for a standard webhook ID deduplication table), and ask them to make the call explicitly. Most product managers accept the test coverage once the true volume of downstream errors is visible.

Bottom line: The judgment call is almost always about side-effect severity and retry probability, not about testing effort. When side effects are irreversible (disbursements, billing, fulfilment) and the provider documents retries (they all do), full webhook failure-mode testing is not optional — it is the minimum responsible baseline. When side effects are recoverable and volume is low, a conscious descope with documented risk acceptance is a legitimate call. The mistake is descoping without the conversation.

9 Best Practices

✓ What experienced testers do

✓ Capture the real payload before writing any test. Use webhook.site or ngrok inspector to capture the actual payload from the live (or sandbox) system. Verify field names, nesting, and data types against the API docs — discrepancies are common and will invalidate your entire test suite if undetected.
✓ Always test duplicate delivery with the same webhook ID. Replay the exact same payload twice. Assert that every side effect (charge, email, stock movement, record creation) happens exactly once. This is the single most common production bug in webhook systems.
✓ Test the rejection path before the happy path. Send a webhook with a bad HMAC signature first. If your endpoint processes it, the happy path passing means nothing from a security standpoint — the gate is open.
✓ Add a deliberate slow-handler test. Wrap the handler in an artificial delay that exceeds the provider’s documented timeout. Verify that the retry does not produce a race condition or a duplicate side effect. This is rarely covered in sprint tickets but catches serious production bugs.
✓ Assert the audit log, not just the outcome. Check that every webhook arriving at your endpoint is logged with its ID, timestamp, event type, and processing outcome. The audit log is the only thing that saves you when a customer disputes a charge or a duplicate booking.
✓ Use a local mock server in CI, not ngrok. ngrok is a development tool. For automated tests in CI, run a lightweight HTTP server (WireMock, Mockoon, or a minimal Node/Python route) that fires test webhooks at your handler deterministically, without relying on network tunnels.
✓ Test out-of-order delivery for anything with a sequence. If the external system can send order.created followed by order.updated, reverse the order in your test. The handler must use the event’s own timestamp, not arrival time, to determine final state.
✓ Verify the HTTP response code your handler returns. A handler that returns 200 when processing fails will never get retried — the external system thinks it succeeded. A handler that returns 500 on success will be retried indefinitely. Both are bugs worth explicitly testing.
✓ Include a timestamp-window test for replay attack prevention. Most providers include a timestamp in the signed payload and document a tolerance window (e.g. Stripe rejects webhooks older than 5 minutes). Test that your handler rejects an otherwise-valid webhook with a timestamp outside that window.
✓ Document the expected retry schedule in your test plan. Stripe, LedgerNZ, GitHub, and POLi all have different retry windows (minutes to days). Your tests should cover at least one retry cycle, and your test plan should state which retry scenarios are out of scope and why.

10 Common Misconceptions

❌ Myth: “If my endpoint returns 200, the webhook is handled correctly.”

Reality: Returning 200 tells the external system to stop retrying — it says nothing about whether your application actually processed the event correctly. A handler can return 200 immediately and then silently fail while updating the database, sending the email, or booking the job. You must assert the downstream state (order status, email sent, audit log entry) separately from the HTTP response code. A 200 with no verified side effect is just a silent failure.

❌ Myth: “We tested it with a real event in staging, so webhook testing is done.”

Reality: A single real event in staging is the equivalent of a single happy-path unit test. It verifies the plumbing works once, in order, with no delay. It tells you nothing about duplicate delivery, signature forgery, slow handlers, out-of-order arrivals, or what happens when your database is temporarily unavailable when the webhook lands. Staging smoke tests are a starting point, not a substitute for structured webhook failure-mode testing.

❌ Myth: “Idempotency is the external system’s problem — they should only send the webhook once.”

Reality: Every major webhook provider explicitly documents that they will retry on timeout or 5xx, and that your endpoint must be idempotent. Stripe, LedgerNZ, GitHub, Shopify — all say the same thing: expect duplicates, handle them. Your handler owns idempotency. Relying on the external system to only fire once is a design defect, and it will eventually cause a production incident involving duplicate charges, double bookings, or duplicate emails going out to NZ customers.

Senior engineer insight

The moment that changed how I think about webhook testing was watching a concurrent duplicate delivery take down a payment system mid-batch — not because the code was wrong, but because it had never been tested against two identical webhooks arriving within milliseconds of each other. We always test idempotency sequentially; production delivers duplicates concurrently. Treat every webhook handler as if the external system will retry the instant your server goes over two seconds, because it will.

The most common mistake: testing idempotency by sending the same webhook twice in sequence, a second apart, and calling it done. That is serial deduplication, not idempotency. The real failure is two threads both past the idempotency check simultaneously — which only happens when the handler is still running when the retry arrives.

From the field

A Wellington retailer switched to Windcave for card processing mid-year. Their developers wired up the payment.complete webhook, ran a happy-path test in the Windcave sandbox, and shipped it. Three weeks into production, customer services started getting calls: some orders were fulfilled twice, others never at all. The duplicates came from a 6-second packing-slip PDF generation that ran synchronously before the 200 response — Windcave's 5-second timeout fired and it retried. The missing ones came from a race condition where the webhook arrived before the order row was fully committed to the database. The lesson that generalises: webhook bugs are timing bugs. They cannot be found by a single synchronous test in a fast sandbox. You need explicit slow-handler tests, explicit concurrency tests, and you need to test against the provider's documented timeout value — not against whatever your local machine responds in.

11 Now You Try

Three graded exercises — spot, fix, then build. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot: name the missing scenarios

A Dunedin power retailer receives a meter-reading-submitted webhook from a smart-meter provider; the handler creates a billing record. The only test is: “fire one valid webhook, confirm one billing record is created.” List the webhook-specific scenarios this test misses and, for each, say what could go wrong in production.

Show model answer

The single happy-path test misses the scenarios where webhook bugs actually live:

1. Duplicate delivery (retry) — the provider times out waiting for a 200 and resends the same reading. Without idempotency the retailer creates two billing records for one reading, double-billing the customer.
2. Out-of-order delivery — yesterday's reading arrives after today's. If the handler just takes the latest write, the bill is calculated from a stale reading.
3. Invalid / unsigned payload — a malformed reading, a missing field, or an unverified signature. The handler should reject and log, not crash or silently create a bad record.
4. Slow handler / timeout — if creating the billing record takes longer than the provider's timeout, the provider gives up and retries, creating concurrency between the still-running first request and the retry.

A senior would also note: the handler should return 200 fast and do heavy work (billing calc, notifications) on a background queue, and every webhook should be logged with its ID for audit.

🔧 Exercise 2 of 3 — Fix: repair a flawed webhook handler design

A developer describes their order-paid webhook handler below. It has three serious webhook flaws. Identify each flaw and state the fix.

Flawed handler:
“When the webhook arrives we accept any POST to the URL, send the customer a confirmation email and decrement stock, then spend about 8 seconds generating a packing slip before returning 200. We don’t store the webhook ID anywhere.”

Identify the flaws and give the fix for each:

Show model answer

Three flaws:

1. No signature verification — "accept any POST" means an attacker can forge a payment notification and get free goods.
   Fix: verify the HMAC signature in the header before processing; reject (401/403) and log if it fails.

2. No idempotency — the handler doesn't store the webhook ID, so a retried duplicate sends a second confirmation email and decrements stock twice.
   Fix: record each webhook ID; if the ID has been seen before, skip the side effects and return 200.

3. Slow synchronous processing — 8 seconds to build a packing slip before returning 200 will blow the provider's timeout and trigger unnecessary retries (which, combined with flaw 2, multiply the damage).
   Fix: do lightweight validation, return 200 immediately, and move the packing-slip generation onto a background queue.

🏗️ Exercise 3 of 3 — Build: design the test set for a refund webhook

An Auckland online retailer’s payment provider sends a refund.completed webhook; the handler should mark the order refunded, restock the item, and email the customer. Design a full set of test cases covering the webhook failure modes (happy path, duplicate, out-of-order, invalid signature, timeout/slow handler). For each, give the setup and the expected outcome.

Show model answer

refund.completed webhook — test set:

1. Happy path — Setup: trigger a real refund in the provider so one valid, correctly-signed webhook arrives. Expected: order marked "refunded", item restocked by one, one refund email sent, endpoint returns 200.

2. Duplicate delivery — Setup: capture the webhook (e.g. via ngrok logs) and resend it with the same webhook ID. Expected: idempotency key recognised; order stays "refunded", stock NOT incremented a second time, NO second email, endpoint returns 200.

3. Out-of-order — Setup: send a later event (e.g. a correction) before the original refund.completed, or two refund events with timestamps reversed. Expected: handler uses the timestamp/sequence, not arrival order; final state reflects the genuinely latest event, not whatever landed last.

4. Invalid signature — Setup: send a webhook with a tampered body or corrupted HMAC. Expected: rejected with 401/403, error logged, order NOT changed, no email, no restock.

5. Slow handler / timeout — Setup: add an artificial delay so the handler exceeds the provider's timeout, prompting a retry. Expected: no race condition between the in-flight request and the retry; the refund is applied exactly once; ideally the handler returns 200 fast and queues the heavy work.

A senior would add: log every webhook with its ID and outcome, and assert the audit trail is complete.

Why teams fail here

Testing only the happy path — a single clean delivery in a fast sandbox never exercises the retry, duplicate, or out-of-order conditions where webhook bugs live
Skipping signature verification — accepting any POST to the endpoint means an attacker can forge payment confirmations or trigger free fulfilment; this is a critical security gap, not just a test gap
Implementing idempotency as an afterthought — teams ship the happy path, then add idempotency logic only after a production double-charge or double-booking incident has already hit customers
Doing slow work before returning 200 — synchronous email sends, PDF generation, or third-party API calls inside the handler blow the provider's timeout and cause retry storms that idempotency cannot fully contain

Key takeaway

A webhook test suite that only sends one clean event has not tested webhooks — it has tested that HTTP POST works; the real tests are the duplicate, the forged signature, the slow handler, and the retry.

How this has changed

The field moved. Here is how Webhook Testing evolved from its origins to current practice.

Pre-2010

Webhooks exist but are inconsistently implemented. Testing webhook delivery means setting up a server to receive callbacks and manually verifying receipt. No tooling exists. Teams skip webhook testing and discover delivery failures in production.

2011

GitHub, Stripe, and Twilio popularise webhooks as the standard mechanism for async event notification. The webhook pattern becomes ubiquitous. Testing demand grows. RequestBin (2012) provides a simple webhook inspection tool — the first purpose-built webhook testing aid.

2015

ngrok enables tunnelling local development environments to public URLs — allowing webhook delivery to local test servers without deployment. Webhook testing becomes practical during development, not just in staging.

2018

Stripe CLI and similar vendor-specific tools allow webhook event simulation locally. Webhook delivery management platforms (Hookdeck, Convoy) add retry, logging, and delivery guarantee features — creating new testable behaviours around retry logic and idempotency.

Now

Webhook testing is standard in any integration involving Stripe, GitHub Actions, Shopify, or CRM systems. Testing covers: delivery confirmation, payload validation, signature verification (HMAC), retry handling, and duplicate delivery idempotency. AI event-driven systems increasingly use webhooks for async callbacks — extending the test surface.

Self-Check

Click each question to reveal the answer.

Interview Questions

What NZ hiring managers ask about Webhook Testing — and what strong answers look like.

What test cases would you include in a webhook integration test for a Stripe payment webhook handler?

Strong answer: Delivery validation (handler returns 200 for a valid event), signature verification (handler returns 401 for an event with an invalid or missing Stripe-Signature header), event type routing (payment_intent.succeeded routes to the payment confirmation handler; charge.failed routes to the failure handler), idempotency (processing the same event ID twice produces the same result without duplicate side effects), retry handling (Stripe retries on non-200 — does the handler remain idempotent across retries?), payload validation (what happens with a valid signature but missing required fields?), and timeout behaviour (if the handler takes more than 30 seconds, Stripe will retry — does the handler handle this gracefully?).

Mid/Senior

How do you test webhook delivery in a local development environment?

Strong answer: I use ngrok to create a public tunnel to my local development server, then configure the webhook provider (Stripe, GitHub, Shopify) to deliver to the ngrok URL. This lets me receive real webhook payloads on my local machine without deploying to staging. For automated tests, I use WireMock or the provider's CLI tool (Stripe CLI can trigger and replay webhook events locally without needing ngrok). For CI environments with no network access to external services, I record real webhook payloads and replay them against the handler using a test fixture. I always verify both the delivery mechanics and the business logic triggered by the event.

Junior/Mid

Q1: Why is a single happy-path webhook test almost worthless on its own?

Because webhook defects live in conditions a single clean delivery never creates — retries, duplicates, out-of-order arrivals, slow handlers, and forged payloads. One event fired once, in order, with no delay behaves correctly even when the system is badly broken for the real failure modes.

Q2: What is idempotency in a webhook handler, and why does it matter?

Idempotency means processing the same webhook more than once has the same effect as processing it once — usually achieved by tracking the unique webhook ID and skipping side effects if it has been seen before. It matters because external systems retry on timeout, so duplicates are normal; without it you get double charges, double emails, or double stock movements.

Q3: Why should a webhook endpoint return 200 quickly and do heavy work elsewhere?

The external system has a short timeout (often around 5 seconds). If your handler does slow work — emails, API calls, document generation — before responding, the system assumes failure and retries, causing concurrency and unnecessary load. Validate lightly, return 200 fast, and push heavy processing onto a background queue.

Q4: How do you verify a webhook genuinely came from the expected sender?

Verify the HMAC signature included in a header (e.g. X-Stripe-Signature) against the raw request body before processing. Reject any webhook with a missing, invalid, or mismatched signature, and test both a tampered body and a corrupted signature to confirm rejection.

Q5: What tools let you see what a webhook actually sends and test it during development?

webhook.site or RequestBin give you a throwaway URL that captures and displays the real payload, headers, and timing — useful before writing any code. ngrok tunnels your local endpoint to a public HTTPS URL so an external system can POST to your machine and you can watch the webhook arrive in your local logs.

Q6: Your team is testing a CoverNZ claims portal that receives a payment.disbursed webhook from a government payment system. The sprint ticket only asks you to verify the happy path. What additional webhook scenarios would you push to include in the acceptance criteria before the sprint starts, and why?

A: At minimum push for duplicate delivery (idempotency), invalid HMAC signature rejection, and slow-handler behaviour — because a government payment system will have a documented retry policy and the side effects here (disbursing CoverNZ compensation) are irreversible. A duplicate webhook without idempotency could disburse the same payment twice. A missing signature check lets anyone forge a disbursement notification. Getting these into acceptance criteria before the sprint starts is far easier than after the feature is declared done — raise them in sprint planning, not in the review.

Q7: What is the key difference between webhook testing and contract testing, and when would you use each for a Revenue NZ integration?

A: Contract testing verifies that a provider and consumer agree on the shape of the message — field names, types, and structure — typically before the integration is live, using a broker like Pact. Webhook testing verifies runtime behaviour: delivery, retries, idempotency, signature verification, and timing. For a Revenue NZ integration you would use contract testing to confirm the payload schema matches what your handler expects (preventing schema-drift bugs), and webhook testing to confirm your handler correctly processes duplicates, rejects forged payloads, and returns 200 quickly without triggering unnecessary retries from the Revenue NZ system.

Q8: A developer says “We don’t need to test the retry scenario because our KiwiSaver provider guarantees at-most-once delivery.” What is wrong with this reasoning and how do you respond?

A: No major webhook provider genuinely guarantees at-most-once delivery in practice — they guarantee at-least-once, meaning duplicates are expected on timeout or 5xx. Even if the provider documentation claims at-most-once, network conditions, provider-side retries on their own infrastructure, or human resubmission through an admin console can all cause duplicate delivery. The correct response is to ask the developer to show you the exact retry policy in the provider SLA, then point out that idempotency is cheap to implement and removes an entire class of production incident. Relying on an external system’s delivery guarantee is a design defect, not a testing decision.

Q9: When would you decide NOT to apply full webhook failure-mode testing to an integration, and what should you document if you make that call?

A: Skip the full failure-mode suite when the webhook carries only non-critical notifications with no irreversible side effects — for example, a Slack alert about a new support ticket where a duplicate just means a second message in a channel. You might also reduce scope when the external system provides a comprehensive testing SDK that already simulates failure modes, or when the “webhook” is actually a polling loop in disguise. If you reduce scope, document the specific scenarios you are omitting, the justification (low risk, no financial/state-change impact), and the risk owner who accepted that decision — so it is a deliberate, visible trade-off rather than a silent gap.

Related: See API Testing for testing synchronous endpoints, and Security Testing for HMAC and TLS verification details.

Webhook Testing

1 The Hook

2 The Rule

3 The Analogy

What it is

Why webhooks are tricky to test

Webhook flow: trigger → POST → handler → verification

Test scenarios every webhook system needs

Happy path

Retry and duplicate handling

Invalid payload handling

Out-of-order delivery

Timeout and slow processing

Tools and setup

Webhook testing platforms

ngrok setup example

NZ worked example: payment processor webhook

Security testing: HMAC, replay attacks, TLS

HMAC signature verification

Replay attack prevention

TLS/SSL validation

Common bugs in webhook systems

Tips

4 Industry Reality

5 When to Use It — and When Not To

✓ Use it when

✗ Skip it when

6 Context Guide — Where This Technique Fits

Trade-offs

⚖ Judgment — How Experienced Testers Decide

9 Best Practices

10 Common Misconceptions

11 Now You Try

How this has changed

Self-Check

Interview Questions

Prerequisites

Related Techniques

What to Learn Next

Also in Bootcamp

Webhook Testing

1 The Hook

2 The Rule

3 The Analogy

What it is

Why webhooks are tricky to test

Webhook flow: trigger → POST → handler → verification

Test scenarios every webhook system needs

Happy path

Retry and duplicate handling

Invalid payload handling

Out-of-order delivery

Timeout and slow processing

Tools and setup

Webhook testing platforms

ngrok setup example

NZ worked example: payment processor webhook

Security testing: HMAC, replay attacks, TLS

HMAC signature verification

Replay attack prevention

TLS/SSL validation

Common bugs in webhook systems

Tips

4 Industry Reality

5 When to Use It — and When Not To

✓ Use it when

✗ Skip it when

6 Context Guide — Where This Technique Fits

Trade-offs

⚖ Judgment — How Experienced Testers Decide

9 Best Practices

10 Common Misconceptions

11 Now You Try

How this has changed

Related techniques

Self-Check

Interview Questions

Related techniques

Prerequisites

Related Techniques

What to Learn Next

Also in Bootcamp