Capstone · Week 3 of 7

Execution & Evidence

The first build of the Tūāpapa portal has landed in the test environment. Now you run the cases you designed last week — and record each result with enough evidence that anyone could trust it without looking over your shoulder.

Capstone Week 3 of 7 — Execution ~30 min read · ~70 min with exercises

1 The Hook

Build 1.0 of the Tūāpapa portal is deployed to the test environment. You sit down with the test suite you designed in Week 2 and start running it. By lunchtime you have a row of green ticks and a couple of reds. You message Aroha: “Mostly passing, a few fails, looking okay.”

Two days later a developer picks up one of your fails — the 5% rate being accepted — tries it on their machine, and it works fine for them. They reply: “Can’t reproduce. Which build? What data? What did you actually see?” You scroll back through your memory. You think it was on build 1.0, with the Test01 member, but you are not certain, and you did not capture what the screen showed. The defect bounces back to you as “cannot reproduce”, and a real bug sits unfixed because your result was a vibe, not a record.

A test result is only worth what its evidence proves. “It failed” with nothing behind it is an opinion a developer can wave away. “On build 1.0, member Test01, setting 5%, the portal accepted it and saved 5% — here is the screen and the audit-log row” is a fact a developer has to act on. This week is about turning execution into evidence: running deliberately, recording precisely, and never relying on memory.

Senior engineer insight

Testers who thrive in Week 3 treat every result as a handoff document: they write it as if they won’t be in the building when a developer reads it. Those who struggle stay in observer mode — they run the tests and assume the result is obvious, then scramble to reconstruct evidence when a defect bounces back. The specific pressure unique to this phase is time compression: a build arrives, stakeholders want a signal fast, and the urge to rush the recording to get back to executing is exactly when evidence gaps sneak in.

Most common capstone mistake at this stage: recording all three cases at the end of a session from memory, leaving the defect with no build number and no artefact — then discovering the build was redeployed overnight.

From the field

On a payments compliance project for a Wellington financial services firm, the team assumed the first build was close to ready — the feature leads had tested it informally and called it “looking good.” When formal execution started, the first two scripted cases passed cleanly and the mood stayed optimistic. Then an exploratory session on day two, sitting with the test data in a way the developers hadn’t tried, surfaced three boundary defects that had been sitting undetected. The audit log showed the wrong effective dates had been persisted on roughly 15% of transactions — data that would have gone to a regulator. What changed their approach was committing to run at least one unscripted exploratory charter per build alongside the scripted suite, and sharing the run log in the team Slack channel at end of day so that developers, BAs and the Wellington client lead could all see status in real time rather than waiting for a formal report.

2 The Rule

An untraceable result is no result. Every test you run must record what you ran it against — the build, the environment, the data — what you did, what you expected, and what you actually saw, with evidence. If a developer cannot reproduce it from your record alone, you have not finished the test.

3 The Analogy

Analogy

A nurse charting observations on a ward.

When a nurse at HealthNZ takes a patient’s observations, they do not just remember that the blood pressure “seemed high.” They chart the exact reading, the time, the patient, and who took it. The next person on shift reads the chart and acts on it without having been in the room. The chart is trusted because it is specific, timed, and attributed.

Your test results are that chart. A developer, a test lead, or an auditor reads them later without having watched you test, and acts on them — but only if they are specific, timed, and tied to a build and a data set. “Looked okay” is a nurse writing “patient seemed fine” and going home. It helps no one and it is not safe.

4 This Week’s Brief

You are executing the TUA-103 suite from Week 2 against build 1.0 in the SIT environment, using a known test member. Here is what you actually observe as you run three of the cases.

Environment: SIT · Build: 1.0.0-rc1 · Member: Test01 (current rate 3%)

TC-103-01 (set 6%): rate saved, confirmation shown naming 6% — matches expected.
TC-103-07 (attempt 5%): the portal accepted 5% and saved it; no error, no allowed-rates message. Audit log shows old 3% → new 5%.
TC-103-11 (accessibility): the rate selector cannot be reached by keyboard alone — Tab skips straight past it to the Save button. Could not complete the screen-reader check because the control could not be focused.

One clean pass, one clear fail, and one case you literally could not finish. This week you learn to record all three correctly — including the one that is neither pass nor fail.

5 Pass, Fail, and Blocked

Every executed test ends in one of three states, and using the right one is part of the discipline.

Pass

The actual result matched the expected result exactly. Not “close enough” — if the confirmation was supposed to name the effective date and it did not, that is not a pass even though the rate saved. A partial match is a fail or a defect, not a generous pass.

Fail

You ran the test fully and the actual result did not match the expected. TC-103-07 is a fail: you could run it, and the portal did the wrong thing — it accepted 5%. A fail means the feature behaved incorrectly, and it produces a defect (Week 4).

Blocked

You could not run the test to completion through no fault of the feature under test — or because a defect upstream stopped you. TC-103-11 is interesting: the accessibility check is partly a fail (the selector is not keyboard-reachable) and the screen-reader half is blocked by that same defect. Record what you proved (keyboard failure) and mark the rest blocked, naming what blocked it. “Blocked” without a reason is just a gap.

Pro tip: Never mark a test “pass” because it nearly worked, and never leave a test silently unrun. A truthful “blocked, because X” protects you and the project far better than an optimistic green tick that falls over in production.

6 What Counts as Evidence

Evidence is what lets someone who was not there reach your conclusion. For a test result, the minimum useful evidence is:

Build and environment: exactly which build (1.0.0-rc1) in which environment (SIT). A result with no build is unreproducible — the bug may already be fixed or may be environment-specific.
Test data and starting state: the member used and their state (Test01, rate 3%). The same step gives different results from a different starting point.
What you did: the actual steps taken, especially if they deviated from the written case.
Expected vs actual: stated side by side, so the gap is unmistakable — “expected: 5% rejected; actual: 5% accepted and saved.”
An artefact: a screenshot, the audit-log row, a console error, a network trace — something a developer can look at, not just your words.

The audit log is your friend here. For TC-103-07, the row showing old 3% → new 5% is proof the system did the wrong thing, independent of your screenshot — the kind of evidence the FMA would also want to see.

7 Recording a Result

A recorded result has a fixed shape, so nothing important is left to memory. Here is TC-103-07 recorded properly:

Test ID:        TC-103-07

Run date/time:  2026-06-03 11:42 NZST

Build:          1.0.0-rc1

Environment:    SIT

Tester:         (you)

Test data:      Member Test01, current rate 3%, attempted rate 5%

Status:         FAIL

Expected:       5% rejected; allowed rates shown; rate stays 3%; no call sent.

Actual:         5% accepted and saved; no error; confirmation shown as if valid.

Evidence:       Screenshot confirm-5pct.png; audit-log row id 88421 (3% → 5%).

Defect raised:  (to be logged — Week 4)

This record reproduces itself. A developer reads it and knows the build, the data, the exact gap, and has an artefact to look at — no follow-up message needed. That is the standard every result should meet, pass or fail.

Pro tip: Record the result the moment you see it, not at the end of the day. Memory degrades fast and the build may be redeployed under you — capture the build number and a screenshot while they are still true.

8 Execution Discipline

Running tests well is a habit as much as a skill. A few rules keep your results trustworthy across a whole suite:

Test what is there, not what you wrote. If the build differs from your expected result, that is a finding — record the actual behaviour, do not quietly “fix” reality to match your case.
Reset state between cases. A member left at 5% from the previous test will skew the next one. Know and restore your starting state.
Re-test on the right build. When a fix lands, confirm which build it is in before you re-run — a pass on the wrong build proves nothing.
Keep a run log. Date, build, and result per case. The board, the regulator, and your Week 6 release report all draw on it — this is the raw material for the executive summary you will write later.

Discipline here is what makes Weeks 4, 5 and 6 possible. The defect you log next week, the regression you plan after, and the release call you make all stand on the quality of these records.

9 Common Mistakes

🚫 Recording a result without the build number

Why it happens: Everyone “knows” which build is current right now, so noting it feels redundant.
The fix: Builds get redeployed and fixes land between your run and the developer’s. Without the exact build, a fail can’t be reproduced and a pass can’t be trusted. Capture the build on every single result.

🚫 Marking a near-miss as “pass”

Why it happens: The core action worked, so the missing detail feels minor and not worth a red mark.
The fix: If any part of the expected result did not happen — the effective date missing, the audit row absent — it is not a pass. A generous pass hides a real defect and erodes trust in your whole run.

🚫 Leaving a test that couldn’t be run with no status

Why it happens: A test you couldn’t complete feels like nothing happened, so it gets skipped silently.
The fix: Mark it Blocked and name what blocked it — a missing environment, a dependency down, or an upstream defect. A silent gap looks like coverage that does not exist; an honest “blocked, because X” is information the team needs.

🚫 Relying on memory instead of capturing evidence

Why it happens: The failure is obvious in the moment, so stopping to screenshot feels like a waste.
The fix: By the time a developer questions the result, the moment is gone and the build may have changed. Capture the screenshot, the audit row, or the console error as you see it — evidence you didn’t take is evidence you don’t have.

10 Now You Try

Three graded exercises on executing and recording results. Write your answer, run it for feedback from your test lead, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Recording Problems

A teammate recorded the result below for a Tūāpapa test run. Identify 3 problems that would stop a developer acting on it, and say what is missing for each.

Recorded result
Test: contribution rate. Status: pass-ish. Notes: tried a few rates, mostly fine, one of them was a bit weird but probably okay. Looked good overall.

List 3 problems and what is missing for each:

Show model answer

This record is a vibe, not a result. Any three of these earn full marks.

1. No build or environment — "tried a few rates" against which build, in which environment? Without 1.0.0-rc1 / SIT the result can't be reproduced or trusted.

2. No specific test data or per-case status — "a few rates" and "pass-ish" are not results. Each case needs its own ID, the exact rate used, and a clear Pass/Fail/Blocked. "Pass-ish" hides the "a bit weird" case, which sounds like a real fail.

3. No expected vs actual and no evidence — there is no statement of what should have happened versus what did, and no screenshot or audit-log row. A developer has nothing to look at and nothing to act on.

Bonus: no run date/time and no tester. The fix is to record each case as: ID, date/time, build, environment, data, status, expected, actual, evidence — the "a bit weird" rate is almost certainly the 5% fail and must be logged, not softened to "probably okay".

🔧 Exercise 2 of 3 — Fix the Result Record

Rewrite the vague record below as a complete, reproducible result record with the fields: Test ID, Run date/time, Build, Environment, Test data, Status, Expected, Actual, Evidence. It is the accessibility case (TC-103-11) where the rate selector can’t be reached by keyboard, so the screen-reader check couldn’t be completed.

Original (too vague):
“Accessibility didn’t really work, couldn’t test it properly. Marking as fail I guess.”

Rewrite as a complete result record:

Show model answer

Test ID: TC-103-11 (accessibility — keyboard & screen reader)
Run date/time: 2026-06-03 14:10 NZST
Build: 1.0.0-rc1
Environment: SIT
Test data: Member Test01; navigation by keyboard only, then screen reader
Status: FAIL (keyboard) + BLOCKED (screen-reader sub-check, blocked by the keyboard defect)
Expected: The rate selector is reachable and operable by keyboard, and the selector, confirmation and any error are announced by a screen reader (NZ Government Web Accessibility Standard).
Actual: Tab order skips the rate selector entirely — focus jumps from the page heading straight to the Save button. The selector cannot be focused or operated by keyboard. The screen-reader announcement check could not be completed because the control could not receive focus.
Evidence: Screen recording kbd-tab-order.mp4 showing focus skipping the selector; note of the Tab sequence observed.

What makes this complete: it separates what was proven (keyboard failure) from what was blocked (screen-reader check) and names what blocked it, with a build, environment, data and an artefact. The original gave a verdict with no build, no data, no evidence, and conflated "fail" with "couldn't test".

🏗️ Exercise 3 of 3 — Build the Execution Run Log

Produce an execution run log for the three cases in this week’s brief (TC-103-01 set 6% — passed; TC-103-07 attempt 5% — accepted/saved wrongly; TC-103-11 accessibility — keyboard fail, screen reader blocked). For each, give ID, status, expected, actual, and evidence. Then add a one-line run summary (counts of pass/fail/blocked).

Show model answer

Build: 1.0.0-rc1 | Environment: SIT | Member: Test01 (current rate 3%)

TC-103-01 (set 6%) | Status: PASS | Expected: 6% saved; confirmation names 6% and effective date; sent to fund administration system | Actual: 6% saved; confirmation shown naming 6% and effective date | Evidence: screenshot confirm-6pct.png; audit row 88420 (3% -> 6%)

TC-103-07 (attempt 5%) | Status: FAIL | Expected: 5% rejected; allowed rates shown; rate stays 3%; no call sent | Actual: 5% accepted and saved with no error or warning | Evidence: screenshot confirm-5pct.png; audit row 88421 (3% -> 5%)

TC-103-11 (accessibility) | Status: FAIL (keyboard) + BLOCKED (screen reader, blocked by the keyboard defect) | Expected: selector reachable by keyboard and announced by screen reader | Actual: Tab skips the selector; it cannot be focused, so the screen-reader check could not be completed | Evidence: recording kbd-tab-order.mp4

Run summary: 1 passed, 2 failed, 1 blocked (the screen-reader sub-check). Two defects to raise next week: the 5% acceptance and the keyboard-focus failure.

A strong log gives each case its own status, a precise expected-vs-actual, and a concrete artefact, then rolls up to honest counts. A weak log writes "mostly fine" and loses the two real defects.

Why teams fail here

Execution speed outpaces recording discipline — testers run through the suite quickly to show coverage, then try to reconstruct results from memory at the end of the day, by which point build numbers are uncertain and evidence is gone.
Optimistic status calls erode the run log — “pass-ish” and “probably fine” verdicts hide real defects, making the pass rate look healthy while blockers accumulate unseen until UAT or production.
Blocked cases are silently skipped rather than explicitly recorded — the gap appears as missing coverage in the release report, and nobody knows whether those features were ever assessed.
Scripted execution crowds out exploratory testing — teams tick through every written case but never sit with the build and probe it, so defects that live outside the happy path go undiscovered until a real user finds them.

Key takeaway

Week 3 teaches you that execution without evidence is just activity — and that the discipline of recording what you actually saw, the moment you saw it, is what separates a tester a team can trust from one they have to follow up.

11 Self-Check

Click each question to reveal the answer.

Q1: Why must every test result record the build number?

Because builds get redeployed and fixes land between your run and a developer’s investigation. Without the exact build, a fail cannot be reproduced and a pass cannot be trusted — the behaviour may already have changed or be specific to one build.

Q2: What is the difference between Fail and Blocked?

Fail means you ran the test fully and the feature behaved incorrectly. Blocked means you could not complete the test through no fault of the feature, or because an upstream defect stopped you. The accessibility case fails on keyboard and is blocked on the screen-reader check — record both honestly and name what blocked it.

Q3: A rate saves correctly but the confirmation omits the effective date the requirement asked for. Pass or fail?

Fail. Any part of the expected result that did not happen makes it a fail, not a generous pass. A near-miss recorded as a pass hides a real defect and undermines trust in the whole run.

Q4: What is the minimum evidence a fail should carry?

The build and environment, the test data and starting state, what you did, expected versus actual stated side by side, and an artefact a developer can look at — a screenshot, an audit-log row, a console error or a network trace. Enough for someone who wasn’t there to reproduce and act on it.

Q5: Why record results in the moment rather than at the end of the day?

Memory degrades fast and the build can be redeployed under you. Capturing the build number, the data, and a screenshot while they are still true is the only way to keep the evidence accurate — evidence you didn’t take is evidence you don’t have.

12 Interview Prep

Real questions asked in NZ QA interviews. Read the model answers, then practise your own version.

“A developer says they can’t reproduce your bug. How do you respond?”

First I check whether they were on the same build and data I recorded — most “cannot reproduce’ cases are a build or environment difference. Then I walk them through my recorded result: the exact build, the member and starting state, the steps, and the evidence I captured — the screenshot and the audit-log row. If my record is complete, the conversation moves from “did this happen” to “why does it happen on this build and not yours”, which is the useful question. If I find my own record was thin, that is a lesson to capture evidence in the moment next time.

“What’s the difference between a failed test and a blocked test, and why does it matter?”

A fail means I ran the test and the feature did the wrong thing — it produces a defect. Blocked means I couldn’t complete the test, usually because of an environment problem or an upstream defect. It matters because they tell the project different things: a pile of fails says the build has bugs, a pile of blocked says we can’t even assess coverage yet. Marking a blocked test as a fail, or leaving it with no status at all, gives a false picture of where the build actually stands.

“How do you make your test results trustworthy to an auditor?”

I treat every result like a record someone will read without me in the room. That means each one is tied to a specific build and environment, names the test data and starting state, states expected versus actual, and carries an artefact — a screenshot or, even better, an independent trace like the audit-log row. For a regulated NZ financial product the audit log is gold, because it proves what the system did regardless of my screenshot. A run log built from records like that is something the FMA, or my own release report, can stand on.

← Week 2 — Test Design Next: Week 4 — Defects →

Execution & Evidence

1 The Hook

2 The Rule

3 The Analogy

4 This Week’s Brief

5 Pass, Fail, and Blocked

Pass

Fail

Blocked

6 What Counts as Evidence

7 Recording a Result

8 Execution Discipline

9 Common Mistakes

10 Now You Try

11 Self-Check

Related techniques

12 Interview Prep