20 min read · 9 self-checks · Updated June 2026

Hybrid · Behaviour + Partial Internals

Grey-Box Testing

Test through the UI like a real user, but design your cases — and check your results — using partial knowledge of what sits underneath: the database schema, the API contracts, the architecture, the logs. You see enough of the inside to test the right things, and to know whether the system actually did them.

Junior Senior ISTQB CTFL v4.0 — 4.1 Test techniques overview

1 The Hook

A tester at a KiwiFirst Bank-style retail bank runs a transfer test. They log in, move $250 from the everyday account to the savings account, and the screen says “Transfer successful.” Green tick. Pass. They move on.

Two weeks later, reconciliation fails. The $250 left the everyday account but never arrived in savings — the credit leg of the transaction silently rolled back, yet the UI had already optimistically shown success. The money sat in limbo. A pure black-box test, watching only the screen, could never have caught it: the screen lied.

A grey-box tester would have done one more thing. After the “success” message, they would have run a query against the ledger table and confirmed two rows: a debit of $250 against the source account and a matching credit of $250 against the destination, both tied to the same transaction reference, both committed. The message is the claim; the ledger row is the proof. Knowing the schema turns a hopeful pass into a verified one.

💬

Senior Engineer Insight

The failure mode nobody tells you about is phantom grey box — teams that query the database once, see a row, and call it done. What they never check is the negative space: the rows that should not be there. On a government portability transfer project I worked on, the happy-path DB assertion passed every time — the correct record existed. What we missed for six weeks was that a retry bug was silently creating a second record on slow connections. Totals reconciled per row; the duplicate just sat there accruing. The fix: always assert a count, not just existence. One row where you expect one, zero rows in any adjacent table that should be clean. Proving the right thing happened is half the job. Proving nothing else happened is the other half.

2 The Rule

Drive the system the way a user does, but design and verify with partial knowledge of the inside — check the database row, the API response, and the log line, not just the message on the screen.

3 The Analogy

Analogy

A restaurant health inspector with a kitchen pass.

A food critic (black box) only ever tastes the plate that arrives at the table. A line cook (white box) knows every recipe and every pot. The health inspector sits between them: they order from the menu like any diner, but they also hold a pass that lets them step into the kitchen, open the fridge, and read the temperature log. They do not rebuild the kitchen — they just check that the fridge is actually at 4°C and that the chicken was logged as cooked to temperature.

Grey-box testing is the inspector. You experience the system as a customer through the UI, but you carry a pass into the database, the API layer, and the logs — just enough internal access to confirm the meal was made the way the menu claimed, not to cook it yourself.

What it is

Grey-box testing (also spelled gray-box) is a hybrid. You test the application’s behaviour from the outside — through the UI or the public API — the way black-box techniques do, but you bring partial knowledge of the internals to the table. You might know the database schema, the API contracts, the message queues, the service architecture, or where the logs land. You do not have, or do not need, full source-code access the way white-box techniques require.

That partial knowledge does two jobs. First, it helps you design better cases — you know which fields map to which columns, which endpoints fire behind a button, and where a transaction spans two services. Second, it lets you verify internal state — you confirm the system genuinely did what it claimed, rather than trusting a UI message that might be wrong.

Where it sits between black and white box

The three approaches are best understood as a spectrum of how much of the inside you can see:

Black box: you know the specification and the inputs/outputs. You judge purely by observable behaviour. No internal knowledge.
Grey box: you know the specification and some of the structure — schema, contracts, architecture, logs. You test through the outside but verify on the inside.
White box: you have the source and design tests against the code structure — statements, branches, paths. Coverage is measured against the code itself.

Grey box is not “black box plus a peek at the code.” It is a distinct stance: you keep the user’s viewpoint as your test driver, and you use internal knowledge only as a design aid and an oracle for checking results.

Sources of internal knowledge

What counts as the “grey” in grey box? In practice it is one or more of these:

Database schema: tables, columns, constraints, and relationships. Lets you assert that a row was created, updated, or correctly left alone.
API contracts: request/response shapes, status codes, error formats (often an OpenAPI or similar spec). Lets you trace which calls a UI action triggers and assert on the payloads.
Architecture: which services, queues, and caches the request flows through. Lets you find the gaps between components where data can get lost.
Logs and traces: application logs, audit trails, distributed traces. Lets you confirm an event was recorded, an error was handled, and a request reached the service you expected.

Verifying state, not just the screen

The defining habit of a grey-box tester is the extra assertion after the visible result. The UI says one thing; you go and check whether the system agrees with itself underneath. Three common moves:

Database asserts: after an action, run a query to confirm the expected rows exist with the expected values — and that nothing unexpected changed.
Log checks: confirm the action wrote the audit or event log you expected, with the right user, timestamp, and outcome.
API tracing: watch the network or service calls behind a UI action to confirm the right endpoint was hit with the right payload and returned the right status.

Real-world NZ Example: a KiwiFirst Bank-style ledger transfer

You transfer $250 between two of your own accounts and the screen shows “Transfer successful.” A black-box test stops there. A grey-box test continues:

DB assert: query the ledger table for the transaction reference — expect a debit row of −$250 on the source account and a credit row of +$250 on the destination, both committed and both linked to the same reference.
Balance check: confirm the source balance dropped by exactly $250 and the destination rose by exactly $250, so the totals still reconcile.
Log check: confirm an audit entry recorded the transfer with the correct user, amount, and timestamp.

The bug to fear is a half-committed transaction: money leaves one account but the credit leg fails and the UI still says success. Only the ledger row, not the screen, proves the transfer truly happened.

Worked example

A NZ retailer’s checkout applies a 10% loyalty discount when a signed-in member completes an order. You place an order through the UI and the confirmation page shows the discounted total. Here is the grey-box test set, layering an internal assertion onto each outside-in step.

Loyalty discount checkout — grey-box test cases

UI action (black box)	Internal check (grey box)	Expected
Add $100 of items, apply member discount	Query `order_lines` sum + `orders.discount` column	Discount = $10, total = $90
Confirmation page shows “Order placed”	Assert one row in `orders` with status `CONFIRMED`	Exactly one confirmed row
Click “Pay now”	Trace the call to `POST /payments` and its 2xx response	Payment endpoint returns 201
Apply discount as a signed-out guest	Assert `orders.discount` = 0 and a denied entry in the log	No discount; denial logged

Why each row needs both columns: the UI column tells you what a user would see; the internal column tells you what the system actually recorded. A discount that displays correctly but never reaches the orders.discount column will under-charge or over-charge once the visible figure and the stored figure drift apart. Grey box catches that drift.

ISTQB mapping

ISTQB groups test techniques into black-box, white-box, and experience-based families. Grey box is not a separate syllabus category — it is a practical approach that combines black-box test design with structural knowledge used as an oracle. The mapping below shows the pieces a grey-box tester draws on.

ISTQB CTFL v4.0 reference

Syllabus ref	Topic	Level
4.1	Test techniques overview (black-box, white-box, experience-based)	CTFL Foundation
4.2	Black-box techniques — the design basis for grey-box cases	CTFL Foundation
4.3	White-box techniques — the structural knowledge a grey-box tester borrows	CTFL Foundation

Common mistakes

✗ Trusting the UI message as proof

A “success” toast is a claim, not evidence. If you stop at the screen you are doing black box, not grey box. Add the database or log assertion that confirms the system actually did it.

✗ Drifting into white-box testing

Grey box uses partial knowledge to design and verify. The moment you start writing tests against private functions and chasing code coverage, you have crossed into white box — a different goal with different tooling.

✗ Asserting on internals that are not contracted

If you assert on a column or log format that is purely incidental and may change without notice, your tests become brittle. Anchor internal checks to stable contracts — documented schema, published API responses, defined audit events.

✗ Checking only what changed, not what should not have

A transfer that credits the right account but also touches an unrelated row is still a bug. Assert the expected change and that nothing else moved — especially balances, statuses, and totals that must reconcile.

✗ Letting test data and prod-like state collide

Querying a shared database to verify state can give false passes if another run left rows behind. Scope your asserts to the transaction reference or test account you created, not to a broad table count.

4 Industry Reality

🏭 What you actually encounter on the job

Access is negotiated, not automatic. You rarely get a root-level database connection handed to you on day one. In most NZ enterprise projects you raise a request, get read-only credentials scoped to specific tables, and sometimes the request takes days. Plan for it — don’t assume unfettered access.
Logs are often incomplete or inconsistent. The audit trail you want to assert on may be missing a field, truncated after 90 days, or swallowed by a log aggregator whose retention policy changed last quarter. Senior testers keep a note of what the log actually captures vs. what the spec says it should.
Schemas change without ceremony. The ERD the dev shared in the sprint kickoff is a month old. Column names get renamed, nullable constraints flip, new FK tables get added. Grey-box assertions that rely on a specific column fail silently if nobody tells you. Keep your DB asserts anchored to the API contract or a shared spec, not a screenshot of a whiteboard schema.
Time pressure collapses the technique to UI-only. Under sprint crunch, testers revert to checking the screen because standing up a DB connection takes setup effort. Senior testers have query snippets and test-data scripts ready before the sprint starts, so grey-box checks are one paste away rather than a half-day setup task.
State isolation is the real challenge. Shared test environments mean your DB assert for “exactly one order row” competes with three other testers running the same flow. The pattern that actually works: use a unique test reference (UUID prefix, test email domain) in every run and scope all queries to that reference.

5 When to Use It — and When Not To

⚡ Decision guide

✓ Use it when

The visible result and the stored result can diverge — payments, ledger entries, subscription state, stock levels, audit trails.
A “success” message is the only signal the UI gives you and you cannot trust it without a secondary check.
The flow spans multiple services or queues, where a failure in a downstream step is invisible at the UI layer.
You are testing financial, compliance, or safety-critical flows — anywhere the Privacy Act 2020 or Commerce Commission rules make silent data corruption legally significant.
You have a published API contract (OpenAPI spec) or a stable database schema you can safely anchor assertions to.

✗ Skip it when

The internal assertion you want to write has no stable contract — if the column or log format changes frequently, your test will break on harmless refactors rather than real bugs.
You are doing pure exploratory charter work and the overhead of setting up DB access breaks your flow; note the check and schedule it as a scripted follow-up.
Full white-box coverage is already in place for the same behaviour — adding a grey-box DB check on top of a unit-tested transaction is redundant effort, not defence in depth.
The environment gives you no access to the database, logs, or API traces — pure UI black-box is the only viable option and spending time on workarounds has negative ROI.
The feature is a pure display or layout change with no persistence or side-effects — grey box adds setup cost with zero additional signal.

Context guide

How the right level of grey-box testing effort changes based on project context.

Context	Priority	Why
Financial transactions — e.g. Harbour Bank or KiwiFirst Bank payment flows	Essential	Money can leave one account without arriving in another while the UI shows success. DB assertions on both ledger legs are the only reliable oracle. Reserve Bank settlement rules and AML obligations make undetected half-commits a compliance event, not just a bug.
Government benefit and entitlement systems — e.g. Benefits NZ or Revenue NZ portals	Essential	A half-written entitlement row or a failed async persist can leave a client without income with no visible error. Privacy Act 2020 obligations and the risk of judicial review make silent data-integrity failures unacceptable. Grey-box DB checks are non-negotiable on every write path.
Health records and clinical systems — e.g. HealthNZ patient portals	Essential	Medication doses, allergy flags, and immunisation records must persist exactly as entered. UI confirmation alone cannot verify that a critical record was committed and replicated to the clinical read store. Grey box catches the gap between the optimistic front-end and the authoritative back-end write.
SaaS subscription and billing flows — e.g. CloudBooks or Spark Business	High use	Plan upgrades, invoice generation, and proration calculations involve multiple table writes. Grey-box API and DB assertions confirm the subscription row, the invoice record, and the webhook event all landed correctly — protecting revenue integrity and reducing support escalations.
Land title and property transactions — e.g. LandNZ e-dealing or TransitNZ registration	High use	A property title change or vehicle registration must be atomically committed and correctly reflected across the authoritative register. Grey-box assertions against the title or registration table confirm the deal-settling state, not just the portal confirmation message.
Pure display and layout changes with no persistence — e.g. a static content update on Pacific Air’s marketing pages	Low	There is no stored state to verify. Adding a DB assertion here introduces setup cost with zero additional signal. Black-box visual or functional testing is the right tool; grey box would be over-engineering.

Trade-offs

What you gain and what you give up when you choose grey-box testing.

Advantage	Disadvantage	Use instead when…
Catches half-committed writes and silent back-end failures that a UI-only test can never detect — the most expensive class of production bugs in payment and data-integrity systems.	Requires database or API access that must be formally requested in most NZ enterprise environments. Setup time can run to days; testers who leave this until the last sprint often never get the credentials before go-live pressure hits.	The schema changes frequently mid-sprint with no migration files, making stable assertions impossible. Pure black-box testing is safer until the schema settles.
Produces better-designed test cases by using API contracts and schema knowledge to identify the exact boundaries and edge cases a purely UI-driven tester would never think to probe.	Assertions anchored to undocumented internals become brittle — a column rename or log-format change breaks tests on harmless refactors rather than real regressions, eroding team trust in the test suite.	The feature has no persistence or side-effects (a cosmetic change on an CoverNZ portal page, for example). Black-box testing gives you full coverage with none of the overhead.
Maintains the user perspective as the test entry point, so tests remain meaningful to stakeholders and trace directly to user journeys — unlike white-box tests that only developers can interpret.	State isolation in shared test environments is harder: if multiple testers query the same table, one run can corrupt another’s assertions. Scoping every query to a unique test reference adds discipline overhead that teams under time pressure often skip.	Equivalent white-box unit tests already cover the same logic with full branch coverage. Layering a grey-box DB check on top duplicates the signal without adding defence in depth.
Scales naturally to multi-service and async architectures: when a queue or downstream service can silently swallow a write, a grey-box log or event assertion is the only practical way to confirm end-to-end delivery without full white-box access to every service.	For async flows, timing between the UI confirmation and the back-end persist means assertions run too early may give false negatives. Testers need to build in a deliberate wait-or-poll pattern, which is error-prone if not handled consistently.	Exploratory testing is the current goal and the overhead of establishing DB access breaks the flow. Note the check and schedule it as scripted follow-up; forced grey box during charters costs more than it finds.

Enterprise reality

How grey-box testing changes when you are one of 200–300 developers shipping to production at an NZ enterprise.

Automation replaces the manual DB query. At small-team scale a tester pastes a SQL snippet after each run. At TeleNZ or Coastal Bank scale those same assertions are baked into Playwright or REST-Assured test suites that execute on every merge to main — CI pipelines enforce the grey-box check, not individual discipline. If it is not in the automation framework it effectively does not exist.
Compliance raises the stakes for every unverified write. Under the Privacy Act 2020 and the NZ Information Security Manual (NZISM), government and financial-sector teams must demonstrate data integrity through audit evidence, not just passing test runs. Revenue NZ, for instance, requires that any system handling taxpayer records can produce a verifiable audit trail for each mutation — which means your grey-box log assertions must themselves be captured in a test report and retained, not run and discarded. PCI DSS and HISF add equivalent obligations for payment card and health data respectively.
Tooling choices get locked in early and at volume. Mature NZ enterprise teams converge on a small set: Playwright for browser-level grey-box runs, Postman / Newman or REST-Assured for API contract assertions, and Datadog or Splunk as the log oracle (rather than raw SSH access). Getting read-only database credentials piped into a test environment via a secrets manager (HashiCorp Vault is common at TechServNZ-run infrastructure projects) is a platform problem that must be solved in project inception, not week eight of testing.
Cross-squad coordination determines what you can assert on. With 10–15 squads sharing a platform, the schema and API contracts you rely on for grey-box assertions are owned by another team. At this scale, contract testing (Pact or similar) becomes the formal mechanism: each consumer squad publishes the assertions it depends on and provider squads run them before merging. Grey-box testing without contract governance at this size produces brittle suites that break on every cross-squad deploy.

◆ What I would do

Professional judgment — when to reach for grey-box testing, when to skip it, and what to watch for.

Scenario 1 — Revenue NZ income tax return portal

Situation

A taxpayer submits a voluntary income disclosure through the Revenue NZ myIR portal. The screen confirms “Your disclosure has been lodged.” The team is deciding whether to add grey-box assertions. The schema is documented in the data dictionary and the team has been granted read-only credentials to the assessment tables.

I would

Insist on grey-box assertions before sign-off. The IR3 or VD row in the assessments table is the legal record; the portal message is decoration. I would query the table for the taxpayer’s Revenue NZ number, assert the row exists with the correct period, amounts, and a LODGED status, and confirm the audit log recorded the submission timestamp. A disclosure that shows lodged on screen but never committed would expose Revenue NZ to penalties under the Tax Administration Act 1994 and leave the taxpayer without a paper trail. The stable schema and available credentials mean the setup cost is low and the risk is extreme — grey box is the only acceptable choice here.

Scenario 2 — CoverNZ injury claim management system

Situation

A sprint has two stories: (1) a handler updates a claimant’s weekly compensation amount through the case management UI; (2) a developer renames a display label on the claim summary page from “Weekly benefit” to “Weekly entitlement.” Both are in scope for testing this sprint. The claim system’s compensation table schema is stable and documented.

I would

Apply grey box to story (1) and black box only to story (2). For the compensation update I would query compensation.weekly_amount after the UI save, assert the new value is stored with the correct effective date, and confirm no prior entitlement row was accidentally closed off. For the label rename I would verify the new text appears on screen — there is nothing to persist, so a DB query would add setup cost with zero signal. Treating both stories identically would either over-engineer the label test or under-engineer the financial write. The rule I apply: grey box whenever stored state can diverge from displayed state; black box when there is genuinely nothing to write.

Scenario 3 — Benefits NZ Jobseeker Support online application

Situation

A new applicant submits a Jobseeker Support application through the Benefits NZ MyMSD portal. The team wants grey-box assertions but has discovered the applications table schema is changing every sprint as new fields are added, and migration files are inconsistent. Database credentials have been requested but the security team says approval will take three weeks.

I would

Skip DB assertions for now and pivot to API-level grey box. If an internal POST /applications call can be traced through the network tab or a test proxy, I can assert on the request payload and the 201 response without needing database credentials or a stable column list. I would document the planned DB assertions in the test case as a deferred step, flag the schema volatility to the tech lead as a risk to assertion durability, and revisit once credentials arrive and the schema settles. Grey box is not all-or-nothing — API-contract assertions are a meaningful partial step that earns most of the value while the harder access problem is resolved in the background.

The bottom line: Grey-box testing earns its overhead on any flow where a UI success message could be wrong and stored state matters legally or financially. Reach for it by default on payment, benefit, health, and compliance paths; skip it on pure display changes; and when full DB access is blocked, API-contract assertions are a legitimate partial substitute that still beats trusting the screen.

6 Best Practices

✓ What experienced testers do

Treat the UI message as a claim, never as proof. Every “success” toast is the starting point of verification, not the end of it. Follow up with the DB query or API trace that confirms the system agrees with what the screen said.
Assert both sides of every write. On a transfer, check the debit and the credit. On an order, check the order row and the inventory decrement. On a deletion, check the row is gone and the audit record was created. Checking only one side misses half-committed failures.
Verify totals reconcile. After any financial or stock operation, assert that the sums add up. The individual rows can look correct while a rounding error or a duplicated write makes the total wrong.
Scope every assert to a unique test identifier. Use a UUID-prefixed reference, a dedicated test email domain, or a per-run customer ID. Never assert on a row count for the whole table in a shared environment — another tester’s run will break your count.
Keep query snippets in a team repo. Commonly needed DB asserts — “confirm both ledger legs”, “check audit entry”, “verify subscription active” — should live in version control as parameterised scripts, not as one-off queries typed from memory each sprint.
Anchor to contracts, not incidentals. Base your assertions on documented schema (ERD or migration files), published OpenAPI specs, or defined audit events. If you assert on an undocumented column that a dev plans to rename, you create test noise, not test value.
Check what must not have changed, not only what should. After a targeted operation, query the rows that the operation should have left untouched and confirm they are unchanged. This is how you catch unintended side-effects and cascading update bugs.
Pair with API tracing for multi-service flows. For flows that cross service boundaries, intercept the internal API calls (network tab, a proxy like Proxyman, or test-harness request capturing) and assert on payload and status, not just the final UI outcome.
Request read-only DB access early in the project. In most NZ enterprise environments (banks, insurers, government agencies) database access requires a formal request. Raise it in project inception, not the week before go-live.
Document grey-box steps explicitly in the test case. A test case that just says “transfer $250 and verify” is a black-box test until you add the explicit step “query ledger table with reference X and assert debit −$250 + credit +$250 both committed.” Make the internal check a numbered step, not an afterthought.

7 Common Misconceptions

❌ Myth: Grey-box testing just means “black-box testing where you also look at the database sometimes.”

Reality: Grey box is a deliberate stance, not an occasional extra step. It means consistently using partial internal knowledge — schema, API contracts, architecture, logs — to design your cases and to verify internal state. If you only glance at the DB when something looks suspicious, you are doing ad-hoc debugging, not grey-box testing. The technique requires that every test case with a write path has a planned internal assertion before you execute.

❌ Myth: Grey-box testing requires source-code access, just less of it than white-box.

Reality: Grey box requires structural knowledge, not source code. The database schema, an OpenAPI spec, an architecture diagram, and an application log format are all grey-box inputs — and none of them require you to read a line of application code. Many testers in NZ agencies and financial services use grey box their entire careers without ever having repository access; they work from published data dictionaries and API contracts instead.

❌ Myth: If the screen says it worked, the test passes — adding DB checks is gold-plating.

Reality: The most expensive production bugs in payment and data-integrity systems happen precisely because the UI reported success optimistically while the underlying write failed, rolled back, or partially committed. The KiwiFirst Bank-style half-committed transfer is not a contrived example — it is a real class of failure that appears regularly in financial-system retrospectives. The DB assertion is not gold-plating; it is the actual pass criterion for any flow where stored state matters.

8 Now You Try

Three graded exercises — spot, fix, then build. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot: classify the approach

For each of the four test descriptions below, say whether it is black box, grey box, or white box, and give a one-line reason. (1) Place an order and confirm the total on screen. (2) Place an order, then query the orders table to confirm the stored discount matches the displayed one. (3) Write a unit test that forces an exception branch inside the discount calculator. (4) Submit a transfer, then check the audit log recorded the right amount and user.

Show model answer

1. On-screen total only — BLACK BOX. Judged purely by observable UI output, no internal knowledge used.

2. Order + DB query for the stored discount — GREY BOX. Driven through the UI like a user, but verified with knowledge of the schema (the orders table / discount column).

3. Unit test forcing an exception branch — WHITE BOX. Designed against the code structure to exercise a specific branch; needs source access and measures coverage.

4. Transfer + audit-log check — GREY BOX. The action is outside-in, but the verification uses internal knowledge of where and how the event is logged.

The dividing line: grey box drives from the outside and verifies on the inside; white box designs the test from the code structure itself.

🔧 Exercise 2 of 3 — Fix: repair a weak transfer test

A tester wrote the test below for a KiwiFirst Bank-style $250 transfer. It is too weak: it trusts the UI, verifies only one side of the ledger, and never checks that totals reconcile. Rewrite it as a proper grey-box test with the internal assertions a senior would expect.

Weak test:
1. Transfer $250 from everyday to savings.
2. Confirm the screen says “Transfer successful”.
3. Query the savings account for a +$250 credit row.
4. Pass if the credit row exists.

Rewrite as a proper grey-box test:

Show model answer

Proper grey-box test for a $250 transfer:

Step 1: Record the starting balances of both accounts, then transfer $250 from everyday to savings through the UI.
Step 2: Query the ledger for the transaction reference and assert BOTH legs: a debit row of −$250 on the everyday account AND a credit row of +$250 on the savings account, both committed and sharing the same reference.
Step 3: Assert the totals reconcile — everyday dropped by exactly $250, savings rose by exactly $250, and no other account moved.
Step 4: Confirm an audit-log entry recorded the transfer with the correct user, amount, and timestamp. Only then treat the on-screen "success" message as confirmed.

What was missing in the original:
- It trusted the UI "success" message instead of treating it as a claim.
- It checked only the credit leg — a half-committed transaction (money left but never arrived) would still pass.
- It never confirmed the totals reconcile, so it could not catch money landing in the wrong place or being double-counted.

🏗️ Exercise 3 of 3 — Build: design grey-box checks for a webhook flow

A NZ SaaS billing system sends an invoice email after a customer upgrades their plan in the UI. Behind the scenes it calls POST /subscriptions, writes a row to the subscriptions table, and fires a webhook to an email service that logs a EMAIL_SENT event. Design a grey-box test: name the outside-in action and the internal assertions across the API, database, and logs. Note what you would check that the UI alone cannot show.

Show model answer

Grey-box test for the plan-upgrade billing flow:

Outside-in action: Sign in as a customer and upgrade the plan through the UI; the confirmation page shows the new plan.

API assertion: Trace the POST /subscriptions call behind the upgrade button — assert it was sent with the correct plan id and customer id, and returned a 201 with the expected response body.

Database assertion: Query the subscriptions table for the customer — assert exactly one active row with the new plan, the correct effective date, and the previous plan correctly closed off (no two active rows).

Log / webhook assertion: Confirm the email service logged an EMAIL_SENT event tied to this customer and invoice, proving the webhook fired and was accepted — not just queued.

What the UI alone cannot show: whether the subscription row was actually written (the UI can show success optimistically), whether the old plan was closed cleanly, and whether the invoice email webhook actually reached the email service. A senior would also assert that no duplicate subscription row or duplicate EMAIL_SENT event was created on a retry.

Senior engineer insight

The real power of grey-box testing only clicked for me when I stopped thinking of it as “black box with a DB query tacked on” and started treating the internal assertion as the actual test — the UI action is just the stimulus. Once that shift happened, I started writing the DB assertion first, before the test steps, because it forces you to articulate exactly what “pass” means in stored state. That discipline catches ambiguity in requirements faster than any review meeting. On API-first projects in New Zealand — where most business logic is buried behind REST layers — knowing which endpoint fires and what the response contract guarantees is often the difference between a test that proves something and one that just exercises a happy path.

Most common mistake: teams add one DB assertion to a flow, see a row, and mark it green — without checking the row count, the negative space, or whether any adjacent tables stayed clean. Existence is not correctness.

From the field

On an integration project for a NZ government agency that managed benefit entitlements via an API-first architecture, the frontend team was consuming a third-party benefit calculation service through a thin internal API layer. The visible confirmation screen was built to show success optimistically — the UI displayed “Entitlement confirmed” as soon as the internal API returned 200, before the downstream write to the entitlements table had completed asynchronously. We had been passing the happy path for three sprints. When I added a grey-box assertion that queried the entitlements table a second after the UI confirmation, we found a 15% failure rate on slow connections: the async write was losing the race with the optimistic UI response. The lesson that generalises: on any async or multi-layer stack, the UI confirmation event and the persistence event are not the same thing — grey-box assertions are the only way to prove they both happened.

Self-Check

Click each question to reveal the answer.

Interview Questions

What NZ hiring managers ask about Grey-Box Testing — and what strong answers look like.

What distinguishes grey-box testing from black-box testing, and when would you use each?

Strong answer: Black-box testing uses only the external interface — inputs, outputs, and documentation. Grey-box testing uses partial internal knowledge: API contracts, database schemas, log formats, event payloads, or architecture diagrams, without reading application source code. I use grey-box when I have access to the system's internal structure but not its code — API testing against a known schema, database testing against the schema, or integration testing where I know which services are called. Grey-box lets me design better test cases (testing the database state after an API call, not just the HTTP response) without requiring developer-level code knowledge.

Junior/Mid

You are testing a payment processing service. What grey-box information would you request, and how would you use it?

Strong answer: I would request: the API contract (OpenAPI spec) to understand all endpoints and required fields; the database schema to write assertions against persisted state; the message queue schema (Kafka topics, SQS message format) to verify events are published correctly; the error code taxonomy to understand what each error means; and the downstream integration documentation (bank gateway, Revenue NZ). I use the API contract to design boundary and equivalence partition tests. I use the database schema to verify that a successful payment creates the correct records. I use the message queue schema to verify that reconciliation events are published with the correct fields.

Mid/Senior

Q1: In one sentence, what makes grey-box testing different from black-box testing?

Both drive the system from the outside like a user, but grey box adds partial knowledge of the internals — schema, API contracts, logs — to design smarter cases and to verify internal state rather than trusting the UI message alone.

Q2: Name three sources of internal knowledge a grey-box tester typically uses.

Any three of: the database schema (tables, columns, constraints), API contracts (request/response shapes and status codes), the architecture (services, queues, caches), and logs or traces (audit trails, distributed traces). These are partial, contracted views — not full source access.

Q3: A transfer screen says “success” but only the debit leg committed. Which approach catches this, and how?

Grey box. After the UI message you query the ledger for the transaction reference and assert both legs — the debit and the matching credit — plus that the totals reconcile. The screen lied; the database row is the proof.

Q4: When does a grey-box test slip into being a white-box test?

When you stop driving from the outside and start designing tests against the code structure itself — targeting private functions, specific branches, or statement/path coverage. At that point you need full source access and your goal is structural coverage, not behaviour verification.

Q5: Why should internal assertions be anchored to contracts rather than incidental details?

Because asserting on an undocumented column name or a log format that may change without notice makes tests brittle — they break on harmless internal refactors. Anchoring to stable, documented schema, published API responses, or defined audit events keeps the checks meaningful and durable.

Q6: Your team is testing the Benefits NZ benefit payment portal. A tester verifies a new payment entitlement by checking the confirmation screen and the applicant’s payment history tab. Is this grey-box testing, and what single step would make it definitively grey box?

A: As described, it is black box — both checks are UI surfaces that the system itself controls and can display incorrectly. To make it grey box, add a direct assertion against the payments database: query the entitlements or scheduled_payments table using the applicant’s client reference and confirm the correct amount, start date, and payment frequency are stored as committed rows. The screen shows the claim; the database row is the evidence. In a government benefits context this matters doubly — a half-written entitlement that displays correctly but never persists could leave a client without income with no visible error trail.

Q7: What is the key difference between grey-box testing and API testing, given that both can assert on API responses?

A: API testing focuses entirely on the API layer — it drives the system through API calls, asserts on response shapes and status codes, and treats the API as the system under test. Grey-box testing uses API tracing as one verification tool among several (alongside database queries and log checks) while still driving the system from the user’s perspective through the UI. In grey box the API assertion is the oracle, not the entry point; the primary action remains outside-in. A tester clicking “Pay” in the TransitNZ licensing portal and then checking that the underlying POST /payments returned a 201 is grey box; a tester calling POST /payments directly and asserting the response body is API testing.

Q8: When should you NOT apply grey-box testing, even on a flow that writes data to a database?

A: Skip grey box when the internal assertions you need have no stable contract to anchor to. If the database schema is actively evolving mid-sprint without migration files, or the log format is undocumented and changes per release, any DB assertion you write will break on refactors rather than real bugs — creating noise, not signal. Also skip it when the overhead of getting database access outweighs the risk: a pure display change with no persistence side-effects, a low-risk cosmetic update to a KiwiSaver dashboard label, or an environment where read-only credentials are months away all make black-box testing the pragmatic choice. Grey box earns its cost on high-stakes write flows; it is not a default to apply everywhere.

Q9: A developer says, “Grey-box testing is just white-box lite — you’re still looking at the internals, so it’s the same thing with less access.” What is wrong with this, and how do you respond?

A: The statement confuses the source of knowledge with the purpose and entry point of the test. White-box testing designs cases from the code structure itself — targeting branches, statements, or paths — and measures coverage against the code. The code is the test design input. Grey-box testing keeps the user’s perspective as its entry point and uses structural knowledge (schema, API contracts, logs) only as a design aid and a verification oracle. You never drive grey-box tests from the source code, and you do not chase coverage metrics against it. The practical response: “In grey box I click through the Revenue NZ myIR flow as a taxpayer and then query the ledger table to confirm the payment row committed. I am not reading the payment-service source code or targeting any branch inside it — I do not need to. The database schema gives me enough signal to verify correctness without touching the code.”

Why teams fail here

Treating the UI “success” message as the test oracle rather than as a claim to be verified against the database or API layer beneath it.
Checking existence only — asserting that a row was created but never checking the row count, adjacent tables, or totals reconciliation, so half-written or duplicated writes slip through.
Anchoring assertions to undocumented internals — column names, log formats, or internal status codes that developers can rename without notice — which makes tests break on harmless refactors rather than real regressions.
Leaving database access setup to the last week of a project, so grey-box checks never get written because the credentials request is still pending when go-live pressure hits.

Key takeaway

Grey-box testing turns a hopeful “success” message into a provable fact — because the database row, the API response, and the audit log are evidence, and the screen is just a claim.

How this has changed

The field moved. Here is how Grey-Box Testing evolved from its origins to current practice.

Pre-2000

Testing is categorised as black-box (no code access) or white-box (full code access). The two are treated as separate disciplines practised by different people — testers do black-box, developers do white-box. Collaboration between the two perspectives is not formalised.

Early 2000s

The term "grey-box testing" emerges to describe testing that uses partial internal knowledge — API contracts, database schemas, log outputs — without reading source code. Integration testers and DBAs practice grey-box naturally without naming it.

2010s

Agile and DevOps break down the wall between tester and developer. Testers gain access to CI pipelines, API schemas, database queries, and monitoring dashboards. Pure black-box testing becomes less common as testers work more closely with the systems they test.

2015

Grey-box testing becomes the dominant approach for API testing, microservices testing, and integration testing. Testers use OpenAPI specs, database schemas, and event payloads to inform their test design without reading application code.

Now

The distinction between black, grey, and white-box is less important than understanding what information is available and how to use it effectively. AI systems are inherently grey-box targets — testers can see inputs and outputs but not the model weights, making grey-box thinking the natural approach for AI testing.

Interview Prep

“What is grey-box testing, and when would you choose it over black box?”

Grey box drives the system from the outside like a user but uses partial knowledge of the internals — schema, API contracts, logs — to design cases and verify state. I choose it whenever the visible result and the stored result can diverge: payments, ledgers, subscriptions, stock, audit trails. A black-box test trusts the “success” message; grey box confirms the database row, the API response, and the log line agree with it.

“Give a concrete example of an internal assertion you would add to a UI test.”

For a bank transfer, after the “Transfer successful” message I query the ledger for the transaction reference and assert both legs — the debit on the source and the matching credit on the destination — plus that the totals reconcile and an audit entry was written. That catches a half-committed transaction where money leaves one account but never arrives, which a screen-only test cannot see.

“How do you keep grey-box tests from becoming brittle?”

Anchor internal assertions to stable contracts — documented schema, published API responses, defined audit events — not incidental column names or log formats that may change on a refactor. I also scope database checks to the specific transaction reference or test account I created, so a leftover row from another run cannot cause a false pass or fail.

Grey box builds on black-box techniques for case design and borrows from white-box coverage for structural knowledge — sitting deliberately between the two.

The internal checks themselves draw on API testing for tracing calls and asserting on payloads, and on database testing for writing the queries that verify rows and reconcile totals.

Where grey box shines: any flow where the visible result and the stored result can diverge — payments and ledgers, subscriptions, stock levels, audit trails, and multi-service transactions. If a “success” message could be wrong, grey box is the technique that proves it.

Grey-Box Testing

1 The Hook

2 The Rule

3 The Analogy

What it is

Where it sits between black and white box

Sources of internal knowledge

Verifying state, not just the screen

Worked example

ISTQB mapping

Common mistakes

4 Industry Reality

5 When to Use It — and When Not To

✓ Use it when

✗ Skip it when

Context guide

Trade-offs

◆ What I would do

6 Best Practices

7 Common Misconceptions

8 Now You Try

Self-Check

Interview Questions

How this has changed

Interview Prep

Prerequisites

Related Techniques

What to Learn Next

Also in Bootcamp

Grey-Box Testing

1 The Hook

2 The Rule

3 The Analogy

What it is

Where it sits between black and white box

Sources of internal knowledge

Verifying state, not just the screen

Worked example

ISTQB mapping

Common mistakes

4 Industry Reality

5 When to Use It — and When Not To

✓ Use it when

✗ Skip it when

Context guide

Trade-offs

◆ What I would do

6 Best Practices

7 Common Misconceptions

8 Now You Try

Self-Check

Interview Questions

How this has changed

Related techniques

Interview Prep

Related techniques

Prerequisites

Related Techniques

What to Learn Next

Also in Bootcamp