Execution & Evidence
The first build of the Tūāpapa portal has landed in the test environment. Now you run the cases you designed last week — and record each result with enough evidence that anyone could trust it without looking over your shoulder.
1 The Hook
Build 1.0 of the Tūāpapa portal is deployed to the test environment. You sit down with the test suite you designed in Week 2 and start running it. By lunchtime you have a row of green ticks and a couple of reds. You message Aroha: “Mostly passing, a few fails, looking okay.”
Two days later a developer picks up one of your fails — the 5% rate being accepted — tries it on their machine, and it works fine for them. They reply: “Can’t reproduce. Which build? What data? What did you actually see?” You scroll back through your memory. You think it was on build 1.0, with the Test01 member, but you are not certain, and you did not capture what the screen showed. The defect bounces back to you as “cannot reproduce”, and a real bug sits unfixed because your result was a vibe, not a record.
A test result is only worth what its evidence proves. “It failed” with nothing behind it is an opinion a developer can wave away. “On build 1.0, member Test01, setting 5%, the portal accepted it and saved 5% — here is the screen and the audit-log row” is a fact a developer has to act on. This week is about turning execution into evidence: running deliberately, recording precisely, and never relying on memory.
2 The Rule
An untraceable result is no result. Every test you run must record what you ran it against — the build, the environment, the data — what you did, what you expected, and what you actually saw, with evidence. If a developer cannot reproduce it from your record alone, you have not finished the test.
3 The Analogy
A nurse charting observations on a ward.
When a nurse at Te Whatu Ora takes a patient’s observations, they do not just remember that the blood pressure “seemed high.” They chart the exact reading, the time, the patient, and who took it. The next person on shift reads the chart and acts on it without having been in the room. The chart is trusted because it is specific, timed, and attributed.
Your test results are that chart. A developer, a test lead, or an auditor reads them later without having watched you test, and acts on them — but only if they are specific, timed, and tied to a build and a data set. “Looked okay” is a nurse writing “patient seemed fine” and going home. It helps no one and it is not safe.
4 This Week’s Brief
You are executing the TUA-103 suite from Week 2 against build 1.0 in the SIT environment, using a known test member. Here is what you actually observe as you run three of the cases.
TC-103-01 (set 6%): rate saved, confirmation shown naming 6% — matches expected.
TC-103-07 (attempt 5%): the portal accepted 5% and saved it; no error, no allowed-rates message. Audit log shows old 3% → new 5%.
TC-103-11 (accessibility): the rate selector cannot be reached by keyboard alone — Tab skips straight past it to the Save button. Could not complete the screen-reader check because the control could not be focused.
One clean pass, one clear fail, and one case you literally could not finish. This week you learn to record all three correctly — including the one that is neither pass nor fail.
5 Pass, Fail, and Blocked
Every executed test ends in one of three states, and using the right one is part of the discipline.
Pass
The actual result matched the expected result exactly. Not “close enough” — if the confirmation was supposed to name the effective date and it did not, that is not a pass even though the rate saved. A partial match is a fail or a defect, not a generous pass.
Fail
You ran the test fully and the actual result did not match the expected. TC-103-07 is a fail: you could run it, and the portal did the wrong thing — it accepted 5%. A fail means the feature behaved incorrectly, and it produces a defect (Week 4).
Blocked
You could not run the test to completion through no fault of the feature under test — or because a defect upstream stopped you. TC-103-11 is interesting: the accessibility check is partly a fail (the selector is not keyboard-reachable) and the screen-reader half is blocked by that same defect. Record what you proved (keyboard failure) and mark the rest blocked, naming what blocked it. “Blocked” without a reason is just a gap.
6 What Counts as Evidence
Evidence is what lets someone who was not there reach your conclusion. For a test result, the minimum useful evidence is:
- Build and environment: exactly which build (1.0.0-rc1) in which environment (SIT). A result with no build is unreproducible — the bug may already be fixed or may be environment-specific.
- Test data and starting state: the member used and their state (Test01, rate 3%). The same step gives different results from a different starting point.
- What you did: the actual steps taken, especially if they deviated from the written case.
- Expected vs actual: stated side by side, so the gap is unmistakable — “expected: 5% rejected; actual: 5% accepted and saved.”
- An artefact: a screenshot, the audit-log row, a console error, a network trace — something a developer can look at, not just your words.
The audit log is your friend here. For TC-103-07, the row showing old 3% → new 5% is proof the system did the wrong thing, independent of your screenshot — the kind of evidence the FMA would also want to see.
7 Recording a Result
A recorded result has a fixed shape, so nothing important is left to memory. Here is TC-103-07 recorded properly:
Run date/time: 2026-06-03 11:42 NZST
Build: 1.0.0-rc1
Environment: SIT
Tester: (you)
Test data: Member Test01, current rate 3%, attempted rate 5%
Status: FAIL
Expected: 5% rejected; allowed rates shown; rate stays 3%; no call sent.
Actual: 5% accepted and saved; no error; confirmation shown as if valid.
Evidence: Screenshot confirm-5pct.png; audit-log row id 88421 (3% → 5%).
Defect raised: (to be logged — Week 4)
This record reproduces itself. A developer reads it and knows the build, the data, the exact gap, and has an artefact to look at — no follow-up message needed. That is the standard every result should meet, pass or fail.
8 Execution Discipline
Running tests well is a habit as much as a skill. A few rules keep your results trustworthy across a whole suite:
- Test what is there, not what you wrote. If the build differs from your expected result, that is a finding — record the actual behaviour, do not quietly “fix” reality to match your case.
- Reset state between cases. A member left at 5% from the previous test will skew the next one. Know and restore your starting state.
- Re-test on the right build. When a fix lands, confirm which build it is in before you re-run — a pass on the wrong build proves nothing.
- Keep a run log. Date, build, and result per case. The board, the regulator, and your Week 6 release report all draw on it — this is the raw material for the executive summary you will write later.
Discipline here is what makes Weeks 4, 5 and 6 possible. The defect you log next week, the regression you plan after, and the release call you make all stand on the quality of these records.
9 Common Mistakes
🚫 Recording a result without the build number
Why it happens: Everyone “knows” which build is current right now, so noting it feels redundant.
The fix: Builds get redeployed and fixes land between your run and the developer’s. Without the exact build, a fail can’t be reproduced and a pass can’t be trusted. Capture the build on every single result.
🚫 Marking a near-miss as “pass”
Why it happens: The core action worked, so the missing detail feels minor and not worth a red mark.
The fix: If any part of the expected result did not happen — the effective date missing, the audit row absent — it is not a pass. A generous pass hides a real defect and erodes trust in your whole run.
🚫 Leaving a test that couldn’t be run with no status
Why it happens: A test you couldn’t complete feels like nothing happened, so it gets skipped silently.
The fix: Mark it Blocked and name what blocked it — a missing environment, a dependency down, or an upstream defect. A silent gap looks like coverage that does not exist; an honest “blocked, because X” is information the team needs.
🚫 Relying on memory instead of capturing evidence
Why it happens: The failure is obvious in the moment, so stopping to screenshot feels like a waste.
The fix: By the time a developer questions the result, the moment is gone and the build may have changed. Capture the screenshot, the audit row, or the console error as you see it — evidence you didn’t take is evidence you don’t have.
10 Now You Try
Three graded exercises on executing and recording results. Write your answer, run it for feedback from your test lead, then compare to the model answer.
A teammate recorded the result below for a Tūāpapa test run. Identify 3 problems that would stop a developer acting on it, and say what is missing for each.
Test: contribution rate. Status: pass-ish. Notes: tried a few rates, mostly fine, one of them was a bit weird but probably okay. Looked good overall.
List 3 problems and what is missing for each:
Show model answer
This record is a vibe, not a result. Any three of these earn full marks. 1. No build or environment — "tried a few rates" against which build, in which environment? Without 1.0.0-rc1 / SIT the result can't be reproduced or trusted. 2. No specific test data or per-case status — "a few rates" and "pass-ish" are not results. Each case needs its own ID, the exact rate used, and a clear Pass/Fail/Blocked. "Pass-ish" hides the "a bit weird" case, which sounds like a real fail. 3. No expected vs actual and no evidence — there is no statement of what should have happened versus what did, and no screenshot or audit-log row. A developer has nothing to look at and nothing to act on. Bonus: no run date/time and no tester. The fix is to record each case as: ID, date/time, build, environment, data, status, expected, actual, evidence — the "a bit weird" rate is almost certainly the 5% fail and must be logged, not softened to "probably okay".
Rewrite the vague record below as a complete, reproducible result record with the fields: Test ID, Run date/time, Build, Environment, Test data, Status, Expected, Actual, Evidence. It is the accessibility case (TC-103-11) where the rate selector can’t be reached by keyboard, so the screen-reader check couldn’t be completed.
“Accessibility didn’t really work, couldn’t test it properly. Marking as fail I guess.”
Rewrite as a complete result record:
Show model answer
Test ID: TC-103-11 (accessibility — keyboard & screen reader) Run date/time: 2026-06-03 14:10 NZST Build: 1.0.0-rc1 Environment: SIT Test data: Member Test01; navigation by keyboard only, then screen reader Status: FAIL (keyboard) + BLOCKED (screen-reader sub-check, blocked by the keyboard defect) Expected: The rate selector is reachable and operable by keyboard, and the selector, confirmation and any error are announced by a screen reader (NZ Government Web Accessibility Standard). Actual: Tab order skips the rate selector entirely — focus jumps from the page heading straight to the Save button. The selector cannot be focused or operated by keyboard. The screen-reader announcement check could not be completed because the control could not receive focus. Evidence: Screen recording kbd-tab-order.mp4 showing focus skipping the selector; note of the Tab sequence observed. What makes this complete: it separates what was proven (keyboard failure) from what was blocked (screen-reader check) and names what blocked it, with a build, environment, data and an artefact. The original gave a verdict with no build, no data, no evidence, and conflated "fail" with "couldn't test".
Produce an execution run log for the three cases in this week’s brief (TC-103-01 set 6% — passed; TC-103-07 attempt 5% — accepted/saved wrongly; TC-103-11 accessibility — keyboard fail, screen reader blocked). For each, give ID, status, expected, actual, and evidence. Then add a one-line run summary (counts of pass/fail/blocked).
Show model answer
Build: 1.0.0-rc1 | Environment: SIT | Member: Test01 (current rate 3%) TC-103-01 (set 6%) | Status: PASS | Expected: 6% saved; confirmation names 6% and effective date; sent to fund administration system | Actual: 6% saved; confirmation shown naming 6% and effective date | Evidence: screenshot confirm-6pct.png; audit row 88420 (3% -> 6%) TC-103-07 (attempt 5%) | Status: FAIL | Expected: 5% rejected; allowed rates shown; rate stays 3%; no call sent | Actual: 5% accepted and saved with no error or warning | Evidence: screenshot confirm-5pct.png; audit row 88421 (3% -> 5%) TC-103-11 (accessibility) | Status: FAIL (keyboard) + BLOCKED (screen reader, blocked by the keyboard defect) | Expected: selector reachable by keyboard and announced by screen reader | Actual: Tab skips the selector; it cannot be focused, so the screen-reader check could not be completed | Evidence: recording kbd-tab-order.mp4 Run summary: 1 passed, 2 failed, 1 blocked (the screen-reader sub-check). Two defects to raise next week: the 5% acceptance and the keyboard-focus failure. A strong log gives each case its own status, a precise expected-vs-actual, and a concrete artefact, then rolls up to honest counts. A weak log writes "mostly fine" and loses the two real defects.
11 Self-Check
Click each question to reveal the answer.
Q1: Why must every test result record the build number?
Because builds get redeployed and fixes land between your run and a developer’s investigation. Without the exact build, a fail cannot be reproduced and a pass cannot be trusted — the behaviour may already have changed or be specific to one build.
Q2: What is the difference between Fail and Blocked?
Fail means you ran the test fully and the feature behaved incorrectly. Blocked means you could not complete the test through no fault of the feature, or because an upstream defect stopped you. The accessibility case fails on keyboard and is blocked on the screen-reader check — record both honestly and name what blocked it.
Q3: A rate saves correctly but the confirmation omits the effective date the requirement asked for. Pass or fail?
Fail. Any part of the expected result that did not happen makes it a fail, not a generous pass. A near-miss recorded as a pass hides a real defect and undermines trust in the whole run.
Q4: What is the minimum evidence a fail should carry?
The build and environment, the test data and starting state, what you did, expected versus actual stated side by side, and an artefact a developer can look at — a screenshot, an audit-log row, a console error or a network trace. Enough for someone who wasn’t there to reproduce and act on it.
Q5: Why record results in the moment rather than at the end of the day?
Memory degrades fast and the build can be redeployed under you. Capturing the build number, the data, and a screenshot while they are still true is the only way to keep the evidence accurate — evidence you didn’t take is evidence you don’t have.
12 Interview Prep
Real questions asked in NZ QA interviews. Read the model answers, then practise your own version.
“A developer says they can’t reproduce your bug. How do you respond?”
First I check whether they were on the same build and data I recorded — most “cannot reproduce’ cases are a build or environment difference. Then I walk them through my recorded result: the exact build, the member and starting state, the steps, and the evidence I captured — the screenshot and the audit-log row. If my record is complete, the conversation moves from “did this happen” to “why does it happen on this build and not yours”, which is the useful question. If I find my own record was thin, that is a lesson to capture evidence in the moment next time.
“What’s the difference between a failed test and a blocked test, and why does it matter?”
A fail means I ran the test and the feature did the wrong thing — it produces a defect. Blocked means I couldn’t complete the test, usually because of an environment problem or an upstream defect. It matters because they tell the project different things: a pile of fails says the build has bugs, a pile of blocked says we can’t even assess coverage yet. Marking a blocked test as a fail, or leaving it with no status at all, gives a false picture of where the build actually stands.
“How do you make your test results trustworthy to an auditor?”
I treat every result like a record someone will read without me in the room. That means each one is tied to a specific build and environment, names the test data and starting state, states expected versus actual, and carries an artefact — a screenshot or, even better, an independent trace like the audit-log row. For a regulated NZ financial product the audit log is gold, because it proves what the system did regardless of my screenshot. A run log built from records like that is something the FMA, or my own release report, can stand on.