Mutation Testing
Mutation testing systematically breaks your code and checks whether your tests catch the breaks. A high mutation score means your tests are actually testing the logic, not just executing code. Code coverage is a vanity metric; mutation score is real quality.
What it is
Mutation testing is a technique for measuring test quality. A mutation tool (like PIT or Stryker) automatically modifies your code in small ways — changing operators, boundary conditions, return values — and re-runs your test suite. If a test fails after the mutation, the test “kills” the mutant and is doing its job. If no test fails, the mutant “survives,” meaning your tests didn’t catch the change.
The key insight: 100% code coverage is meaningless if your tests don’t actually assert anything. A test that just calls a function without checking the result “covers” that function but doesn’t test it. Mutation testing forces you to write tests that have teeth.
Coverage is not quality. A test suite with 100% code coverage but 40% mutation score is barely testing. The coverage number is useless; the mutation score tells you the truth.
Code coverage vs. mutation score
| Coverage claim | Example | Mutation truth |
|---|---|---|
| 100% line coverage | int x = calculatePrice(item); — test runs this line but never asserts the value of x |
The test doesn’t actually verify calculatePrice works. Mutation score: near 0%. |
| 100% branch coverage | if (age >= 18) { ... } else { ... } — test hits both branches but never asserts correctness in each |
Changing >= to > is not caught. Mutation score: weak. |
| All methods called | Test calls calculateTotal() but doesn’t assert the return value is correct |
Changing the calculation (+ to -, * to /) is not caught. Mutation score: low. |
How mutation testing works
- Mutation engine scans your code and identifies places to mutate: operators, conditionals, return values, etc.
- For each mutation location, the tool makes a single small change (e.g., change
>=to>). - The tool runs your entire test suite against the mutated code.
- If a test fails, the mutant is killed (your test caught the change — good).
- If all tests pass, the mutant survives (your test suite did not catch the change — gap in your testing).
- The tool reports a mutation score: (killed mutants / total mutants) * 100%.
Common mutations
| Mutation | Example | Catches |
|---|---|---|
| Conditional boundary | >= becomes > |
Off-by-one errors; tests that don’t check boundary values |
| Arithmetic operator | + becomes - |
Tests that don’t verify the actual calculation |
| Logical operator | && becomes || |
Tests that don’t cover both branches of complex conditions |
| Return value | return true; becomes return false; |
Tests that don’t assert the return value |
| Constant | return 10; becomes return 0; |
Tests that don’t verify specific values |
| Statement deletion | Remove an entire line of code | Tests that don’t verify all side effects |
Interpreting mutation results
Mutation score
The mutation score is a percentage: (killed mutants / total mutants) * 100%. What does each range mean?
- 0–30%: Critical gaps. Your tests are barely testing anything. You need more assertions.
- 30–60%: Weak tests. You cover happy paths but miss edge cases, boundary conditions, error handling.
- 60–80%: Decent. Most logic is tested, but there are still gaps (usually error paths, rare branches).
- 80%+: Strong. Your tests have teeth. Nearly all mutations are caught.
- 90%+: Excellent. Your tests are thorough and rigorous.
Survivors (surviving mutants)
The mutation tool reports which mutants survived (which mutations your tests did not catch). For each survivor, ask: is it legitimate, or do I need to write a better test?
- Legitimate survivor: The mutation produces code that is functionally equivalent. E.g., changing
if (x) { y(); return 1; } else { return 1; }toif (x) { y(); } return 1;— both branches do the same thing, so it’s a false positive. Ignore it or configure the tool to skip. - Illegitimate survivor: The mutation breaks logic, but your test didn’t catch it. You need a better test or an additional test case.
When to use mutation testing
- Code quality audit — when you inherit a codebase or suspect test quality is weak, run mutation testing to get an honest assessment.
- Critical code — for business logic, payment calculations, security code, run mutation testing to ensure tests are thorough.
- After refactoring — if you refactored code, run mutation tests to ensure your tests still catch the old bugs (i.e., the tests, not the code structure, are what matters).
- Not for continuous CI — mutation testing is slow (often 10–50x slower than running tests normally). Run it locally, before commit, or nightly, not on every CI run.
Worked example: Testing a discount calculator
Consider a simple method:
public BigDecimal applyDiscount(BigDecimal price, int discountPercent) {
if (discountPercent > 100) {
throw new IllegalArgumentException("Discount cannot exceed 100%");
}
if (discountPercent < 0) {
throw new IllegalArgumentException("Discount cannot be negative");
}
BigDecimal discountAmount = price.multiply(
BigDecimal.valueOf(discountPercent).divide(BigDecimal.valueOf(100))
);
return price.subtract(discountAmount);
}
Weak test suite (50% coverage, but likely 20% mutation):
@Test
public void testDiscount() {
BigDecimal result = applyDiscount(BigDecimal.valueOf(100), 10);
// Oops: test runs the code but never asserts the result
// Mutation: if the calculation changed to +discountAmount instead of -discountAmount,
// this test would pass (because it doesn't assert anything).
}
@Test
public void testDiscountOverHundred() {
applyDiscount(BigDecimal.valueOf(100), 101); // Just calls the method, never checks if exception thrown
}
Strong test suite (50% coverage, 85% mutation):
@Test
public void testValidDiscount() {
BigDecimal result = applyDiscount(BigDecimal.valueOf(100), 10);
assertEquals(BigDecimal.valueOf(90), result); // Assert the calculation
}
@Test
public void testZeroDiscount() {
BigDecimal result = applyDiscount(BigDecimal.valueOf(100), 0);
assertEquals(BigDecimal.valueOf(100), result); // Boundary: no discount
}
@Test
public void testMaxDiscount() {
BigDecimal result = applyDiscount(BigDecimal.valueOf(100), 100);
assertEquals(BigDecimal.valueOf(0), result); // Boundary: full discount
}
@Test
public void testDiscountOverHundred() {
assertThrows(IllegalArgumentException.class,
() -> applyDiscount(BigDecimal.valueOf(100), 101));
}
@Test
public void testNegativeDiscount() {
assertThrows(IllegalArgumentException.class,
() -> applyDiscount(BigDecimal.valueOf(100), -1));
}
Mutation results: The strong test suite kills mutations like:
- Changing
> 100to>= 100(caught by testMaxDiscount). - Changing
< 0to<= 0(would fail testZeroDiscount). - Changing
subtracttoadd(caught by testValidDiscount and boundary tests). - Removing the exception throws (caught by testDiscountOverHundred and testNegativeDiscount).
Tools
- PIT (Pitest) — Java; the gold standard. Integrates with JUnit, produces detailed HTML reports.
- Stryker — JavaScript/TypeScript; very good HTML reports and dashboard integration.
- Mutiny — Python; simpler, good for getting started.
- cargo-mutants — Rust; built into the Rust ecosystem.
- Mutagen — .NET; mutation testing for C# and F#.
Common pitfalls
- Running mutation tests in CI for every commit — it’s too slow. Run locally or nightly.
- Ignoring equivalent mutants — some mutations produce equivalent code. The tool will report them as survivors; verify they’re truly equivalent before ignoring.
- Chasing 100% mutation score — some code (e.g., getters, logging) produces lots of equivalent mutants. Aim for 80%+ on critical code; 60%+ on non-critical.
- Testing too broadly — mutation test only the code you want to improve. Testing third-party libraries or generated code wastes time.
- Not fixing weak tests — once you see survivors, don’t ignore them. Write tests to kill them or document why they’re equivalent.
Best practice: Use mutation testing during development to drive better tests, not as a metric to hit. If your mutation score is 60%, the goal is not “get to 80%” — the goal is “write tests that actually verify the logic.” The score is a side effect of better testing.