BDD & Three Amigos · Lesson 1

Three Amigos & Specification by Example

Most defects start as a misunderstanding, not a coding mistake. A short conversation that pins a vague story down to concrete examples — before any code is written — catches those misunderstandings when they are still free to fix. This lesson teaches that conversation.

BDD & Three Amigos Behaviour-Driven Development — Lesson 1 of 2 ~30 min read · ~70 min with exercises

1 The Hook

A team at a fictional KiwiSaver provider, Tahi Wealth, picked up a story in sprint planning: “As a member, I want to change my contribution rate so that I can save more.” Everyone nodded. It was small, it was clear, and they sized it at three points. The developer built it. The tester wrote tests. It passed. It shipped.

Then the calls started. A member on a casual contract had set her rate to 3%, the legal minimum. The next pay run, her rate quietly reset to 10%, because nobody had decided what the system should do when an employer’s payroll file disagreed with the member’s chosen rate. The business analyst had assumed the member’s choice always won. The developer had assumed the latest payroll file always won. The tester had tested the screen, not the conflict, because no one had told her the conflict existed.

No one wrote a bug. The code did exactly what it was built to do. The defect was older than the code — it was a gap in shared understanding that three people carried into the work, each filling it in differently and silently. The screen worked. The behaviour was wrong.

Here is the part that matters: that gap was findable in about ten minutes, in a room, before a line was written. All it took was someone asking “what happens when the member’s rate and the payroll file disagree?” That single question would have surfaced the conflict, forced a decision, and turned it into a concrete example everyone agreed on. Behaviour-Driven Development is the practice of asking that question on purpose, every time, with the right people in the room.

This lesson teaches the two halves of that practice: Specification by Example, which replaces vague rules with concrete examples, and the Three Amigos session, which is where the examples get agreed.

2 The Rule

A user story is a placeholder for a conversation, not a specification. The real specification is the set of concrete examples the business, development, and testing voices agree on together — before code is written. Shared understanding, captured as examples, is the output of BDD. The documents and the automation are just a record of it.

3 The Analogy

Analogy

Briefing a builder before the concrete is poured.

You tell a builder “I want a deck off the back of the house.” The builder nods. If you both stop there, you might get a deck a metre off the ground with no rail, when you pictured a low platform with steps to the lawn. The fix once the posts are in the ground is expensive and grumpy. So a good builder asks first: how high? Stairs or no stairs? Does it need a handrail under the Building Code? Treated pine or hardwood? Each answer is a concrete example of what “a deck” actually means to you.

The Three Amigos session is that briefing, and the examples are the agreed measurements. You are far better off arguing about the handrail over a coffee than discovering the disagreement once the concrete has set. In software, the concrete sets the moment the code is built — so you ask the questions while the answers are still cheap.

4 Shift-Left and the Cost of a Gap

“Shift-left” means moving testing activity earlier in the work — to the left on a timeline that runs from idea to release. The reason is cost. A requirement gap gets more expensive to fix the later it is found, and the curve is steep.

In the conversation — someone asks a question, the team decides, the example is written. → cost: minutes.
In development — the developer hits the gap, stops, chases an answer, waits. → cost: hours and a context switch.
In testing — the tester finds the behaviour is wrong, raises a defect, the work goes back. → cost: a rebuild and a re-test.
In UAT — the business owner says “that is not what I meant”. → cost: a redesign, late in the sprint.
In production — a real member’s contribution rate resets without consent. → cost: a customer, a complaint, possibly a regulator.

BDD is a deliberate shift-left. It does not add a testing phase — it moves the most valuable testing thinking, the part that asks “what should this actually do, and what about the edges?”, all the way to the start, before anyone commits code. The tester stops being the last line of defence and becomes the first.

Pro tip: Testing is not a phase, it is an activity. The most powerful test a tester ever runs is the question they ask in the Three Amigos session, because it is the only test that can prevent a defect instead of catching one.

5 Specification by Example

Specification by Example (SBE) is a simple idea with a sharp edge: do not specify behaviour with abstract rules, specify it with concrete examples. People nod along to abstract rules while quietly picturing different things. They cannot do that with a concrete example — an example forces the disagreement into the open.

Take a rates-rebate eligibility rule at Auckland Council. The abstract version: “Low-income ratepayers may qualify for a rebate.” Everyone agrees. No one has agreed on anything. Now make it concrete:

Example A: A ratepayer with an income of $32,000 and no dependants → rebate of $290.
Example B: A ratepayer with an income of $32,000 and two dependants → rebate of $410 (dependants raise the threshold).
Example C: A ratepayer whose income is exactly on the threshold → ? — nobody knows; the rule never said whether the threshold is inclusive.

Example C is the whole point. The abstract rule hid an undecided boundary. The moment you write concrete numbers, the gap is visible and the team must decide. SBE is a gap-finding machine: real values, real expected outcomes, one row per case — including the edges, the negatives, and the “what if both are true” cases. Those agreed examples become the specification, and later they become the Gherkin scenarios in Lesson 2.

Pro tip: Reach for boundary and negative examples first, not the happy path. The happy path is the one everyone already agrees on. The disagreement — and the defect — lives at the threshold, the empty field, and the “both conditions true” case.

6 The Three Amigos Session

The Three Amigos is a short, focused session held before a story is built. It brings together three perspectives — not three job titles, three viewpoints — so that the gaps each one would otherwise miss get caught while they are cheap.

The business voice

Asks: is it valuable? Is this what we actually need? Usually a business analyst or product owner. They hold the intent — why the story exists and what success looks like for the customer. They decide the answers when a question of business rules comes up: is the threshold inclusive, does the member’s choice win, what is the rebate at $32,000.

The development voice

Asks: is it feasible? What does it take to build, and what does the system already do? Usually a developer. They surface technical constraints and existing behaviour the business may not know about — “the payroll file already overwrites the rate every fortnight, so we have to decide which one wins.” They keep the examples buildable.

The testing voice

Asks: what could go wrong? What about the edges, the negatives, the conflicts? Usually a tester. They are the professional sceptic — the one who asks about the threshold, the empty field, the contractor with no fixed income, the two rules that collide. In the Three Amigos, the tester’s scepticism prevents defects instead of finding them.

The session is short — often 15 to 30 minutes per story — and it does not need the whole team. Three people, one story, concrete examples written down as you go. The output is not a long document. It is a shared understanding, captured as a handful of agreed examples, and a list of any questions that need an answer before the work can start.

Pro tip: Run it just before the work, not weeks ahead. Examples agreed too early go stale; agreed too late, the developer has already guessed and built. The sweet spot is during refinement or at the start of the sprint, on stories about to be picked up.

7 A Story Becomes Examples

Here is the Tahi Wealth story from the Hook, run through a Three Amigos session. Watch the vague story turn into agreed examples as each amigo asks their question.

The story: “As a member, I want to change my contribution rate so that I can save more.”

Testing voice: What rates are allowed? → Business: 3%, 4%, 6%, 8%, 10% — the legal options.
Testing voice: What if they enter 5%? → Business: rejected, with a message listing the valid rates.
Development voice: The payroll file sets a rate too, every fortnight. Which wins? → Business: the member’s chosen rate wins until they change it again. (This is the defect from the Hook, caught here for free.)
Testing voice: What if the change is made mid-pay-cycle? → Business: it applies from the next pay run, not the current one.
Business voice: And they must be sent confirmation of the new rate.

That session produces four or five concrete examples, including the conflict and the boundary, plus one confirmation rule. None of them existed in the original story. The team now agrees on the behaviour, and the developer builds it once, correctly. In Lesson 2, each of these examples becomes a Gherkin scenario — a precise, automatable, readable record of what was agreed here.

8 The Tester’s Job in the Room

A tester walking into a Three Amigos session is not there to write tests. They are there to find the questions no one else thought to ask. The whole value a tester brings is structured scepticism — a habit of probing the cases that vague stories quietly skip over. A few prompts to keep in your pocket:

  • The boundary: “What happens at exactly the threshold? Is it inclusive?” — the rates-rebate gap, the contribution minimum.
  • The negative: “What happens when they enter something invalid, or leave it blank?” — the 5% rate, the empty IRD number field.
  • The conflict: “What if two rules both apply, or two sources disagree?” — the member’s choice versus the payroll file.
  • The unhappy path: “What if the downstream service is down, or RealMe verification fails?” — the case the happy story never mentions.
  • The actor: “Does this work the same for a contractor with no fixed income, or someone with two employers?” — the under-represented user.

You do not need every answer in the room. A question you cannot resolve becomes a recorded action — “confirm with payroll team whether mid-cycle changes are allowed” — and the story is not built until it is answered. An open question parked in writing is worth far more than a silent assumption built into code.

Pro tip: If a Three Amigos session produces no new questions and no new examples, it was not a real session — it was a status update. A healthy session always surfaces at least one thing the story did not say.

9 Common Mistakes

🚫 Treating the user story as the specification

Why it happens: The story is written down and looks complete, so the team builds straight from it.
The fix: A story is a placeholder for a conversation, not a spec. “Change my contribution rate” says nothing about valid rates, conflicts, or timing. The specification is the agreed examples that come out of the conversation — build from those, not from the one-line story.

🚫 Running the session with only the happy path

Why it happens: The happy path is the easy case everyone already pictures, so the session feels done once it is covered.
The fix: The happy path almost never hides the defect. Spend the session on boundaries, negatives, and conflicts — the threshold, the invalid input, the two rules that collide. That is where the disagreements and the defects live.

🚫 Holding the session, then ignoring what came out of it

Why it happens: The meeting becomes a ritual; the examples are written somewhere and never looked at again.
The fix: The agreed examples are the deliverable. They drive the build and become the scenarios that get automated. If the examples do not shape the code and the tests, the session was theatre.

🚫 Thinking BDD is a tool you install

Why it happens: Teams equate BDD with a Gherkin runner and assume adopting the tool means adopting the practice.
The fix: BDD is the conversation that produces shared understanding. The tooling only records what was agreed. A team can do real BDD with a whiteboard and no automation, and a team can run a Gherkin tool while doing no BDD at all.

10 Now You Try

Three graded exercises that walk a vague NZ story from spotting the gaps, to agreed acceptance criteria, to a set of examples. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Gaps

Read the ambiguous user story below for a fictional IRD myIR password-reset feature. As the testing voice in a Three Amigos session, list at least 4 questions you would ask to surface missing behaviour, and for each say which kind of gap it probes (boundary, negative, conflict, unhappy path, or actor).

Story: “As a myIR user, I want to reset my password so that I can get back into my account.”
Acceptance criteria: “User enters their email, gets a reset link, sets a new password, and can log in.”

List your questions and the gap each one probes:

Show model answer
There are many good questions; any four that are genuine gaps, correctly categorised, earn full marks. Strong examples:

Negative — What if the email entered is not registered to any myIR account? Do we reveal that, or show the same message either way (an account-enumeration concern)?

Unhappy path — What if the reset link is clicked after it has expired, or used a second time? What is the expiry window?

Boundary — What are the new-password rules (minimum length, complexity), and what happens at exactly the minimum?

Conflict — What if a reset is already in progress and the user requests another link? Does the new link invalidate the old one?

Actor — Does this work for a user whose account is locked, or who only ever logs in via RealMe rather than a password?

Negative/security — How many reset attempts before the account or IP is rate-limited?

The trap to avoid: questions that only restate the happy path ("does the user get an email?"). The value is in the boundary, negative, conflict, and actor cases the story is silent on. Note that for an IRD system, the enumeration and rate-limiting gaps are the highest-value ones to surface.
🔧 Exercise 2 of 3 — Rewrite with Acceptance Criteria

The story below for a fictional Waka Kotahi online vehicle-licence (rego) renewal is too vague to build. Rewrite it with clear, testable acceptance criteria that cover the happy path plus at least three edge or negative cases. Make every criterion concrete enough that a tester could pass or fail it.

Original (too vague):
“As a vehicle owner, I want to renew my rego online so that I do not have to go to an agent. It should be quick and easy and let me pay.”

Rewrite the story and list testable acceptance criteria:

Show model answer
Story (rewritten): As a vehicle owner with a registered vehicle, I want to renew my licence online and pay the fee, so that my vehicle stays road-legal without visiting an agent.

Acceptance criteria:
1. (Happy path) Given a vehicle with a current or recently expired licence and a passed WoF, when the owner enters the plate number and pays the correct fee by card, then the licence is renewed for the chosen term (3, 6, or 12 months) and a confirmation is shown and emailed.

2. (Negative — no current WoF) Given a vehicle without a current Warrant of Fitness, when the owner tries to renew, then renewal is blocked with a message explaining a current WoF is required first.

3. (Edge — expired more than the allowed window) Given a licence that lapsed beyond the period online renewal allows, when the owner tries to renew, then the system blocks online renewal and directs them to an agent.

4. (Negative — payment fails) Given a valid renewal, when the card payment is declined, then no licence is issued, the owner is told payment failed, and they can retry without re-entering plate details.

5. (Boundary — term selection) Given a renewal, the owner can choose only 3, 6, or 12 month terms; any other value is rejected.

What makes this strong: each criterion has a concrete trigger and a checkable outcome (a tester can make it pass or fail), the edges (no WoF, lapsed too long, declined payment) are covered, and the terms are named rather than left as "choose a term". A weak answer restates "it should be quick and easy" — that is not testable.
🏗️ Exercise 3 of 3 — Build the Examples Table

Using Specification by Example, build a table of at least 5 concrete examples for a fictional MSD Winter Energy Payment eligibility rule. The rule: a person qualifies if they receive a main benefit OR NZ Super. Include real values and an expected outcome per row, and make sure at least two rows are boundary or negative cases that would force the Three Amigos to decide something the rule did not say.

Show model answer
| Example | Inputs | Expected outcome | Why this row matters |
| 1 | Receives Jobseeker Support (a main benefit), age 40 | Eligible | Happy path — main benefit qualifies |
| 2 | Receives NZ Super, age 67 | Eligible | Happy path — the OR branch (NZ Super) qualifies |
| 3 | Receives no benefit and not on NZ Super, age 50, low income | Not eligible | Negative — low income alone does not qualify; forces the team to confirm income is NOT a separate path |
| 4 | Receives BOTH a main benefit AND NZ Super (rare overlap) | Eligible, paid once only | Conflict — forces a decision on double payment the rule never addressed |
| 5 | Benefit was granted partway through winter | Eligible from the date the benefit started? Or the whole season? | Boundary — the rule is silent on partial-period eligibility; the Three Amigos must decide |
| 6 (bonus) | Benefit cancelled mid-winter | Payment stops from when? | Boundary/negative — another timing gap the rule did not state |

What makes this strong: rows 1 and 2 cover both sides of the OR; rows 3–6 are exactly the kind of boundary and conflict cases that expose what the one-line rule never decided (double payment, partial periods, mid-season changes). That is the purpose of SBE — concrete rows make the team decide the things the abstract rule left undecided. A weak answer gives five happy-path rows that all just confirm "benefit = eligible" and surface no decision.

11 Self-Check

Click each question to reveal the answer.

Q1: What is the real output of a Three Amigos session — the document, or something else?

Shared understanding, captured as a handful of agreed concrete examples (plus any open questions). The documents and the automation are only a record of that understanding. A team can do real BDD on a whiteboard with no tooling, and run a Gherkin tool while doing no BDD at all.

Q2: Name the three voices in a Three Amigos session and the question each one asks.

Business — is it valuable, is this what we need? Development — is it feasible, what does the system already do? Testing — what could go wrong, what about the edges and conflicts? They are viewpoints, not necessarily three separate people.

Q3: Why does Specification by Example use concrete examples instead of abstract rules?

Because people nod along to abstract rules while privately picturing different things. A concrete example with real values and an expected outcome forces any disagreement into the open — the threshold case, the conflict, the boundary the abstract rule never decided. Examples are a gap-finding machine.

Q4: Why is shifting testing left cheaper, and what does the tester become?

A requirement gap gets steeply more expensive the later it is found — minutes in the conversation, a rebuild in test, a customer in production. Shifting left moves the “what should this do?” thinking to the start, so the tester becomes the first line of defence (preventing defects) rather than the last (catching them).

Q5: A Three Amigos session ends with no new questions or examples. What does that tell you?

It was probably a status update, not a real session. A healthy session always surfaces at least one thing the story did not say — a boundary, a negative, or a conflict. If nothing new came out, the scepticism was missing, usually because only the happy path was discussed.

12 Interview Prep

Real questions asked in NZ QA interviews for agile and BDD roles. Read the model answers, then practise your own version.

“What actually is BDD — is it just writing tests in Given/When/Then?”

No — that is the tooling, not the practice. BDD is building software around behaviour the team has agreed on first. The core of it is a conversation, usually a Three Amigos session, where a business voice, a development voice, and a testing voice turn a vague story into concrete examples before any code is written. Given/When/Then is just how we later record those agreed examples so they are precise and automatable. I have seen teams write perfect Gherkin while doing no real BDD, because they skipped the conversation. The value is the shared understanding; the syntax is the receipt.

“You are the tester in a Three Amigos session. What do you bring that the BA and developer do not?”

Structured scepticism. The BA holds the intent and the developer holds the feasibility; my job is to probe the cases the story is silent on — the boundary (“what happens at exactly the threshold, is it inclusive?”), the negative (“what if the field is blank or invalid?”), the conflict (“what if two rules both apply?”), and the unusual actor (“does this work for a contractor with no fixed income?”). Each question either gets decided on the spot and becomes an example, or becomes a recorded action before the story is built. That is testing preventing a defect rather than catching one.

“Give me an example where a Three Amigos session would have stopped a real defect.”

A KiwiSaver contribution-rate change. The story was “let a member change their rate”, and it shipped working — but no one had decided what happens when the member’s chosen rate and the fortnightly payroll file disagree. The BA assumed the member’s choice won; the developer assumed the latest file won; the tester only tested the screen. A member’s rate silently reset. One question in a Three Amigos session — “which source wins when they conflict?” — would have surfaced it in minutes and turned it into an agreed example, instead of a production complaint. The defect was a gap in shared understanding, and that is exactly the gap a Three Amigos session is designed to close.