Usability Testing
A system that passes every functional test can still be so confusing that users give up. Usability testing is how you catch the gap between "it works" and "people can use it".
1 The Hook
The NZ Ministry of Social Development launched a redesigned online application form for a benefit. Functional testing passed completely — every field saved correctly, every validation fired, every submission processed and reached the back-end system. The sign-off report had no open defects.
During the first week of live operation, completion rates were 31% lower than the previous form. Exit surveys told a consistent story: users were confused by the document upload step. The new design required selecting a document category before the upload button appeared. The button was hidden behind a dropdown that first-time users didn't recognise as a prerequisite step. UX had tested the design internally with team members who already knew how the form worked.
No usability test was run with actual benefit recipients before launch. MSD had to add supporting instructional text, reorganise the page layout, and ship a patch three weeks after go-live. The rework cost significantly more than a half-day think-aloud session with five users would have.
2 The Rule
Functional correctness and usability are separate properties. Test them separately. A page can be pixel-perfect and completely unusable.
3 The Analogy
Nielsen's 10 heuristics are a building code for software.
A building can pass a structural inspection — the foundation is sound, the roof won't collapse — and still fail a usability inspection. The staircase is in the wrong place. The exit signs aren't marked. The bathroom door opens inward in a tiny cubicle. The building works structurally but people struggle to use it daily. Functional testing is the structural inspection. Heuristic evaluation is the usability inspection. You need both.
4 Watch Me Do It
Heuristic evaluation of a fictitious NZ government portal — MyService NZ. Walk through each of Nielsen's 10 heuristics and document findings.
A heuristic evaluation like this takes 2–3 hours and produces a structured defect list before a single user session is run. Raise each finding as a defect with severity (cosmetic / minor / major / critical) and the specific heuristic violated.
SUS (System Usability Scale). Run a 10-question Likert survey with users after any usability session. Each question scores 1–5. Odd questions: subtract 1 from the score. Even questions: subtract the score from 5. Sum all adjusted scores, multiply by 2.5. Result is 0–100. A score above 68 is above average. Below 68 is statistically likely to result in poor adoption and elevated support load. Present the raw score and the percentile benchmark to stakeholders — not just "users found it confusing".
5 When to Use It
Run usability testing when: a new feature has complex multi-step interaction flows; before public launch of any government-facing service; drop-off or abandonment rates are unexpectedly high; accessibility testing has passed but users with disabilities or low digital literacy still struggle; a redesign has replaced a familiar interface.
Heuristic evaluation (no users needed) is the right choice when: you need rapid feedback with no budget for sessions; you want to front-load defect discovery before user research; the product isn't stable enough to put in front of real users yet.
When you can skip it: internal admin tools used exclusively by trained power users who already know the domain. Even then, a quick heuristic pass is worth an hour of anyone's time before a major release.
6 Common Mistakes
🚫 Treating usability as UX's problem, not QA's
I used to think: if UX designed it, usability is their responsibility, not mine.
Actually: testers are uniquely positioned to run heuristic evaluations because we read interfaces critically and we know the business rules. UX owns the design. QA validates that the design actually works for users. These are different activities and both need to happen.
🚫 Thinking usability testing needs an expensive lab setup
I used to think: usability testing requires specialist UX researchers and dedicated lab facilities.
Actually: five users in a think-aloud session over Zoom catches 85% of usability issues. You can run one in an afternoon. You need: a task list, a Zoom call with recording permission, and enough discipline not to help users when they get stuck.
🚫 Dismissing SUS as a satisfaction survey
I used to think: SUS is just a quick satisfaction check with no real analytical value.
Actually: SUS is a validated psychometric scale developed by John Brooke at DEC in 1986 and refined over thousands of studies. A score below 68 is statistically correlated with poor adoption and high support overhead. Cite the benchmark when you report the score, not just the number.
7 Now You Try
Evaluate this user story against Nielsen's 10 heuristics: "As a first-time user of the RealMe login page, I need to create an account." List 3 potential usability issues and the heuristic each violates. For each issue, suggest a specific fix. Then ask the AI to identify any additional issues you may have missed.