Senior · AI Quality Assurance

AI Model Validation & ML Testing

Testing machine learning is fundamentally different from testing traditional software. Models are probabilistic, data-dependent, and can fail in ways that conventional testing never catches. This chapter teaches you how to test for fairness, drift, robustness, and adversarial edge cases.

Senior CT-GenAI Ch 4 — ML Quality Assurance ~18 min read + exercises

1 The Hook

An NZ insurance company deployed a machine learning model to score mortgage applications. In testing, the model's accuracy was 94%. It correctly approved good applicants and rejected risky ones. The team shipped it to production.

Six months later, external auditors noticed something: 89% of rejected applicants had Māori or Pacific surnames. The model was not explicitly programmed to discriminate — it learned the pattern from historical data. The historical data reflected decades of human bias in lending. The model amplified it.

The company had tested accuracy. They had not tested fairness. They had not checked whether the model's decisions differed systematically across demographic groups. And they had not built in monitoring for drift — the model's decisions had shifted over time as market conditions changed, but no one tracked it.

This is a real story (anonymised), and it is common. Model testing is not just about accuracy. It is about fairness, robustness, and understanding when and why the model might fail.

Senior engineer insight

The thing that changed how I think about ML testing: the model is never the problem — the framing is. Every catastrophic AI failure I have investigated started with a team asking "does this model perform well?" instead of "what does this model optimise for, and who does that harm?" Once you reframe testing as an adversarial audit of the objective function, you stop looking for bugs and start finding the design decisions that made those bugs inevitable.

Most common mistake: treating fairness testing as a compliance checkbox at the end of a sprint instead of a first-class requirement that shapes the test dataset from day one.

From the field

A central government agency in Wellington deployed a benefit eligibility model to help case workers prioritise reviews. The team ran full accuracy testing — 91% on the holdout set, well above the internal threshold. What they did not test was performance disaggregated by iwi affiliation or rurality, which were both inferable from postcode. When an external audit against the NZ Algorithm Charter transparency requirements surfaced the outputs eighteen months later, it turned out the model was flagging rural Maori applicants for review at 2.3x the rate of urban Pakeha applicants with equivalent financial profiles. The model had learned a proxy from postcode. The fix required rebuilding the feature set, re-auditing twelve months of flagged cases, and a public disclosure. The lesson that generalises: any feature that correlates with ethnicity, gender, or disability becomes a bias vector — and in NZ that list is longer than most engineers assume because our demographic geography is distinctive.

2 The Rule

ML models are non-deterministic, data-dependent, and vulnerable to bias amplification. Testing a model requires validation of accuracy (does it work?), fairness (is it equitable?), robustness (does it handle edge cases?), and drift (does it degrade over time?). Testing only accuracy is testing blindly.

3 The Analogy

Analogy

Testing a model is like testing a doctor's diagnostic judgment, not a calculator.

A calculator has deterministic rules: 2 + 2 always equals 4. A doctor's diagnosis is probabilistic: they see symptoms, draw on experience, and reach a conclusion that can be correct or incorrect. You cannot test a doctor by running the same inputs twice and expecting identical outputs. Instead, you test them across diverse patient cases, you look for patterns in their mistakes, and you check whether they treat similar patients fairly. ML models work the same way.

4 Why ML Testing is Different

Non-determinism

Running a model twice on the same input may produce slightly different outputs (depending on the framework and random seed). You cannot test by re-running and comparing outputs — you test by validating the distribution of outputs or by freezing the random seed.

Data dependency

A model's behaviour depends entirely on its training data. Good training data + good algorithm = good model. Bad training data + good algorithm = broken model. You must test the data before you test the model.

Bias amplification

If training data reflects historical bias (e.g., mortgage rejections biased against certain groups), the model learns and amplifies that bias. Testing for bias is not an afterthought — it is essential testing.

Drift

Models degrade over time when the real-world data distribution changes. A fraud detection model trained on 2023 transactions may fail on 2025 transactions with different patterns. Monitoring drift is as important as initial accuracy.

Edge case brittleness

Models often handle common cases well but fail on rare or adversarial inputs. An image classifier trained on dogs and cats might fail catastrophically on a dog wearing sunglasses. You must test edge cases deliberately.

5 Model Quality Metrics

Accuracy

Percentage of correct predictions (both true positives and true negatives). Formula: (TP + TN) / (TP + TN + FP + FN). Warning: Accuracy alone is misleading on imbalanced datasets. A model that always predicts "no fraud" is 99% accurate on a dataset with 1% fraud, but useless.

Precision

Of all positive predictions, how many were correct? Formula: TP / (TP + FP). High precision = few false alarms. Important for applications where false positives are costly (e.g., cancer screening — false positives cause unnecessary worry).

Recall (Sensitivity)

Of all actual positives, how many did the model catch? Formula: TP / (TP + FN). High recall = few false negatives. Important for applications where missing positives is costly (e.g., fraud detection — missing fraud costs money).

F1-Score

Harmonic mean of precision and recall. Formula: 2 * (Precision * Recall) / (Precision + Recall). Useful when you want to balance precision and recall, and the cost of false positives and false negatives is similar.

Fairness Metrics

Demographic parity: Positive prediction rate should be the same across demographic groups. Equalized odds: True positive rate and false positive rate should be the same across groups. Calibration: If the model predicts 80% confidence, it should be right 80% of the time within each demographic group. Test explicitly for demographic disparities.

AUC-ROC (Area Under Curve)

Measures the model's ability to distinguish between classes across all thresholds. Values 0–1; higher is better. AUC of 0.5 = random guessing, AUC of 1.0 = perfect classification. Useful for imbalanced datasets.

Pro tip: Never use accuracy alone. Always report precision, recall, and fairness metrics. For the mortgage example: the model's 94% accuracy masked 20% disparities in reject rates between demographic groups. A fairness metric would have caught this immediately.

6 Data Quality Testing

Training data representativeness

Is the training data representative of real-world data the model will encounter? If the model is trained on mortgage data from wealthy suburbs only, it will fail on rural areas. Test: Check the distribution of training data. Verify that all demographic groups, regions, and income levels are represented proportionally to real-world data.

Data labeling accuracy

Machine learning models learn from labels (annotations). Bad labels = bad model. If a training dataset is labeled "fraud" and "not fraud" by overworked contractors with no domain expertise, the labels will be noisy. Test: Spot-check labels. Have domain experts review a sample of labels for accuracy. Check inter-annotator agreement if multiple people labeled the data.

Data contamination

If test data leaks into training data, the model's accuracy on test data is artificially inflated. Test: Verify that training and test sets are completely separate. Check for duplicate records or data points that might have leaked between sets.

Missing or corrupted data

Real-world datasets have missing values, corrupted fields, and outliers. Test: Check that the model handles missing data appropriately (it should not crash, and it should flag uncertainty). Test extreme values and outliers to ensure the model does not fail.

Historical bias in training data

Historical datasets encode past biases. Mortgage rejection data reflects decades of human bias. Test: Analyse the training data for disparities. Are certain demographic groups underrepresented in positive labels? If yes, the model will learn to disfavour those groups.

7 Model Behavior Testing

Adversarial examples

Small, intentional perturbations to inputs that cause the model to fail. A stop sign with a black sticker can fool image classifiers. Slightly reworded text can change sentiment analysis results. Test: Create adversarial examples and verify the model does not fail catastrophically (it should flag low confidence or abstain, not give wrong predictions with high confidence).

// Python example: perturb an image slightly
from keras.preprocessing.image import load_img
import numpy as np

img = load_img('dog.jpg')
img_array = np.array(img)
# Add random noise
perturbed = img_array + np.random.normal(0, 0.1, img_array.shape)
# Clip to valid range
perturbed = np.clip(perturbed, 0, 255)
# Test: does model still recognize dog?

Fairness across demographics

Calculate precision, recall, and false positive rate separately for each demographic group. Do results differ significantly? If a model has 95% recall overall but only 70% recall for one demographic group, it is unfair. Test: Use fairness libraries (Fairness Indicators, AI Fairness 360) to compute demographic disparities.

// Example: measure fairness across groups
disparities = {}
for group in ['male', 'female', 'non-binary']:
  group_data = test_data[test_data['gender'] == group]
  precision = compute_precision(model, group_data)
  recall = compute_recall(model, group_data)
  disparities[group] = {'precision': precision, 'recall': recall}
# Check if max disparity > 5% — flag as unfair

Robustness to input variations

Small variations in real-world data should not cause big changes in predictions. If slightly different phrasings of the same customer question produce wildly different sentiment, the model is brittle. Test: Generate variations (rephrase sentences, rotate images, add noise) and verify outputs remain stable.

Explainability and interpretability

Can you explain why the model made a decision? A mortgage rejection with no explanation is not only unfair — it is legally risky (NZ Privacy Act, EU AI Act require explainability). Test: Use SHAP, LIME, or similar tools to generate explanations for model predictions. Verify that explanations are accurate and actionable.

8 Integration Testing

Model + application interaction

The model might be accurate in isolation, but fail when integrated into the application. Test:

Does the application handle model errors gracefully? (model predictions are probability distributions, not certainties)
Are outputs formatted correctly for downstream systems?
Does latency meet requirements? (a model that takes 10 seconds to score a customer might be slow for a live application)

Graceful degradation

If the model fails or is offline, does the application degrade gracefully or crash? Test: Simulate model failures and verify the application either falls back to a simpler decision path or safely abstains from making decisions.

Output handling

Does the application correctly interpret and act on model outputs? Test: Verify that confidence scores are respected — low-confidence predictions do not trigger high-stake decisions without human review.

9 Production Monitoring

Drift detection

Monitor the distribution of real-world data vs. training data. If real-world data has shifted (concept drift) or the distribution of model outputs has changed (prediction drift), the model may be degrading. Monitor: Track mean, std dev, and percentiles of key features and predictions. Alert if metrics shift significantly.

Performance metrics in production

Accuracy in the lab may differ from accuracy in the wild. Monitor: Precision, recall, and fairness metrics on live data. If a model's recall drops from 92% to 85% over 3 months, something has drifted — retraining may be needed.

Fairness in production

Biases may emerge or worsen in production. Monitor: Demographic disparities in predictions. If false positive rate for one group is 2x higher than another, escalate for investigation.

User feedback and errors

Collect and analyse user feedback. When a user says "the model's prediction was wrong," investigate. Monitor: Error patterns — are errors concentrated in certain scenarios, demographics, or time periods? This signals where the model needs retraining.

Pro tip: Use MLflow, Evidently, or similar tools to track model performance over time. Production monitoring is not a one-time test — it is an ongoing practice. Set alerts for drift and fairness issues and respond quickly to retraining or rollback.

10 Common Mistakes

Mistake 1: Testing only accuracy

Why it happens: Accuracy is the easiest metric to compute and most familiar to engineers.
The fix: Always report precision, recall, F1, and fairness metrics. Accuracy can hide serious problems, especially on imbalanced datasets. For high-stakes decisions (hiring, lending, health), fairness metrics are mandatory.

Mistake 2: Not testing the training data

Why it happens: Teams assume training data is clean and representative.
The fix: Audit training data before building the model. Check for labeling errors, class imbalance, and demographic representation. Bad training data is the root cause of most model failures.

Mistake 3: Ignoring data leakage

Why it happens: Training and test data get mixed, inflating accuracy metrics.
The fix: Strictly separate training, validation, and test sets. Check for duplicate records. Be especially careful with time-series data — temporal leakage is common.

Mistake 4: No monitoring in production

Why it happens: The model passes testing, ships, and the team assumes it will keep working.
The fix: Deploy monitoring from day one. Track accuracy, fairness, and drift continuously. Drift is not a theoretical concern — it happens regularly in real-world systems.

Mistake 5: Not testing edge cases or adversarial inputs

Why it happens: Testing focuses on common cases; edge cases are deprioritised.
The fix: Deliberately test rare inputs, noisy data, and adversarial examples. Models often fail catastrophically on edge cases. Verify that failures are safe (e.g., low confidence, human review) rather than confidently wrong.

Why teams fail here

They inherit the data scientist's test split and never question whether it is representative — the held-out set is often drawn from the same skewed population as the training set, so high holdout accuracy tells you nothing about real-world fairness.
They scope testing to pre-deployment and treat production as finished — drift is not a theoretical edge case, it is the default trajectory of every model, and without continuous monitoring the model you shipped is silently degrading.
They skip explainability testing because the model "just works" — but if a decision cannot be explained (as required by the NZ Privacy Act and ISO 42119 transparency requirements), the system is not actually fit for purpose regardless of accuracy.
They conflate statistical significance with practical significance — a 93% vs 91% accuracy gap between demographic groups sounds small until you multiply it by hundreds of thousands of decisions; senior testers think in impact, not percentages.

Key takeaway

A model that passes accuracy testing but has never been audited for fairness, drift, and explainability has not been tested — it has been approved to cause harm at scale.

11 Self-Check

Click each question to reveal the answer.

Q1: Why is accuracy alone an insufficient metric for model testing?

Accuracy measures overall correctness but hides important details. On imbalanced datasets, a model can be highly accurate while failing for minority classes. Example: a fraud detector with 99% accuracy might have only 10% recall for fraud (missing most actual fraud). For fairness evaluation, accuracy also masks disparities — the model might be 95% accurate overall but unfair to specific demographic groups.

Q2: What is data contamination and why does it matter for model testing?

Data contamination is when test data leaks into training data. The model learns the test data during training, making test accuracy artificially high. This hides real-world performance issues. Example: if a training set accidentally includes some rows also in the test set, the model "memorises" those rows, achieving unrealistic accuracy. Always separate training and test sets completely and check for duplicate records.

Q3: What is model drift and how do you detect it?

Model drift is when the model's performance degrades over time because real-world data distribution has changed (concept drift) or the model's output distribution has shifted (prediction drift). Detect it by monitoring accuracy, precision, recall, and fairness metrics on live data. If these metrics drop significantly, the model needs retraining. Also monitor the distribution of input features and predictions — significant shifts signal drift.

Q4: How do you test a model for bias and fairness?

Calculate metrics separately for each demographic group. Compare precision, recall, false positive rate, and false negative rate across groups. If any metric differs by more than a defined threshold (commonly 5–10%), flag it as unfair. Use fairness libraries (Fairness Indicators, AI Fairness 360) to compute demographic parity and equalized odds. Test with data representative of real-world demographics, not just the majority group.

Q5: What are adversarial examples and why test for them?

Adversarial examples are small, intentional perturbations to inputs that cause the model to produce incorrect outputs. Example: a stop sign with a black sticker might fool an image classifier, or a sentence with typos might flip a sentiment classifier. Test by generating adversarial examples and verifying the model does not fail catastrophically (it should either abstain or flag low confidence rather than confidently give a wrong answer).

12 Interview Prep

Real questions from AI/ML testing interviews.

"Have you tested machine learning models? What was your testing strategy?"

Yes, I tested a recommendation model for an e-commerce platform. My strategy had three parts: (1) data quality testing — I checked that training data was representative, labeling was accurate, and there was no data leakage. (2) Model validation — I computed accuracy, precision, recall, and F1-score on a held-out test set and checked fairness metrics across user demographics. (3) Integration and monitoring — I verified the model integrated well with the application, handled errors gracefully, and deployed monitoring to track accuracy and fairness over time.

"How would you test for bias in a hiring recommendation model?"

I would: (1) Audit the training data — check whether rejected candidates differ systematically by gender, age, or ethnicity. (2) Calculate fairness metrics separately for each demographic group — precision, recall, false positive rate. (3) Look for disparities — if the model rejects women at 20% but men at 10%, that is a 2x disparity and a red flag. (4) Create test cases with identical CVs except for gender/name and verify the model does not systematically favour one group. (5) Deploy monitoring to track fairness in production — hiring recommendations change over time and biases can emerge.

"A model's test accuracy is 98%, but production accuracy dropped to 85%. What could cause this and how would you investigate?"

This is likely model drift — the real-world data distribution has changed since the model was trained. I would: (1) Compare the distribution of features in test data vs. production data — have the input distributions shifted? (2) Check for concept drift — have the underlying relationships changed? (3) Calculate accuracy on recent data only — if it is higher, drift is recent. (4) Analyse error patterns — are certain input types or demographics performing worse? (5) Recommend retraining on recent data or deploying a new model. This is why continuous monitoring is essential.

← Back to Senior Learning Next: AI Model Testing →

AI Model Validation & ML Testing

1 The Hook

2 The Rule

3 The Analogy

4 Why ML Testing is Different

Non-determinism

Data dependency

Bias amplification

Drift

Edge case brittleness

5 Model Quality Metrics

6 Data Quality Testing

Training data representativeness

Data labeling accuracy

Data contamination

Missing or corrupted data

Historical bias in training data

7 Model Behavior Testing

Adversarial examples

Fairness across demographics

Robustness to input variations

Explainability and interpretability

8 Integration Testing

Model + application interaction

Graceful degradation

Output handling

9 Production Monitoring

Drift detection

Performance metrics in production

Fairness in production

User feedback and errors

10 Common Mistakes

11 Self-Check

Related techniques

12 Interview Prep