Database Testing · Lesson 3

Migration & Schema-Change Testing

A migration rewrites the data you cannot afford to lose, often once, often with no easy way back. This lesson teaches you to prove every row arrived intact — before the old copy is gone.

Database Testing Database & Backend — Lesson 3 of 4 ~30 min read · ~75 min with exercises

1 The Hook

A fictional NZ KiwiSaver provider, Rimu Wealth, moved its member records from an old system to a new platform over a weekend. The migration ran overnight, the counts looked sensible the next morning, and the new platform went live on the Monday. Members logged in, saw their balances, and everything appeared fine.

Weeks later, a member queried their balance during a complaint. It was short by a few hundred dollars. The investigation found that the old system stored balances with four decimal places of precision, but the new balance column had been defined with only two. The migration had silently rounded every member’s balance on the way across. Multiplied over tens of thousands of members, the provider was now holding records that did not add up to the money it actually had.

No row was lost. The row count matched exactly — every member came across, which is why the morning check passed. The corruption was inside the values, not in the count. A column-type change had quietly truncated data, and the only check anyone ran was “did we get the same number of rows?”. That question cannot see a value that changed.

Here is the lesson hidden in that story. A migration is the one operation where you deliberately rewrite the data you most need to keep, and often the old copy is decommissioned soon after. “The counts match” proves nobody was dropped; it says nothing about whether the values survived intact. Migration testing is the discipline of proving, before the old system is gone, that every row arrived and every value is unchanged — and that you can get back if it did not.

2 The Rule

A matching row count proves nobody was dropped; it does not prove the values survived. A migration must be reconciled on both counts and content — the same number of rows, and the same data inside them, verified by checksum against the source. And you do not migrate without a tested rollback, because a migration is often the point of no return.

3 The Analogy

Analogy

Shifting a whole house’s contents with a removal truck.

When you move house, counting the boxes off the truck tells you all forty arrived. It does not tell you that the box marked “kitchen” still has the plates in one piece — some could be shattered inside a box that is present and correct on the count. A removal company that only counts boxes has not checked your china. You only find the breakage when you unpack, by which time the old house is sold.

Rimu Wealth counted forty boxes and called it done. The plates were chipped inside — the balances rounded — and nobody opened a box to check. Migration testing is opening the boxes: not just “did everything arrive?” but “is what arrived identical to what left?”. And you keep the old house’s keys until you have confirmed it — that is your rollback.

Senior engineer insight

The moment that changed how I test migrations was watching a Flyway-managed schema change pass every pre-production check, then silently corrupt a NUMERIC(10,2) column on the way into production — because the production database collation differed from staging and the implicit type coercion rules were different. Since then I treat the database engine configuration itself as a first-class test variable: collation, timezone, ANSI_NULLS, and numeric precision settings must match between environments, not just the schema DDL. A Flyway migration script that is byte-for-byte identical can produce different data in two environments if the engine underneath is configured differently.

The most common mistake: teams version-control their migration scripts with Flyway but never version-control the database engine configuration — so the migration is repeatable but the environment it runs in is not.

From the field

A Wellington-based government agency was migrating a legacy Oracle system to PostgreSQL — a common NZ pattern as agencies move off expensive proprietary licences. The team used Flyway for the migration scripts and validated each one in a staging environment that had been carefully cloned from production. Every Flyway migration ran green. The team assumed that meant the data was clean. What they had not accounted for was that Oracle stores DATE with a time component and PostgreSQL does not — so roughly 40,000 timestamp records had their time-of-day silently stripped to midnight during the load. The row count matched. The dates matched at the day level. Only a content reconciliation comparing TO_CHAR(source_date, 'YYYY-MM-DD HH24:MI:SS') against the target would have caught it. The lesson that generalised: cross-engine migrations introduce type-system impedance mismatches that are invisible to row counts and invisible to engineers who have only ever worked in one engine.

4 Data Preservation and Reconciliation

Data preservation is the core promise of a migration: every row and every value comes across unchanged, unless a transformation was deliberately specified. You prove it with reconciliation — comparing source and target and asserting they agree. Reconciliation has two layers, and most teams only do the first.

This is ETL testing. A data migration is an ETL pipeline — Extract from the source, Transform to fit the new model, Load into the target. Testers map their checks to those three stages: Extract — did every source row come out, with nothing truncated or lost to a bad query? Transform — were the deliberate conversions applied correctly, and only those (no silent rounding, type coercion, or encoding damage)? Load — did every transformed row land, with constraints, keys, and indexes intact? The same vocabulary covers nightly data-warehouse ETL and analytics feeds, not just one-off migrations — reconciliation is the common test pattern across all of them.

Row-count reconciliation is the cheap, necessary first check. Count the rows in each source table, count them in the target, and assert they match (allowing for any rows a transformation was meant to add or remove, with that number stated in advance).

-- source

SELECT COUNT(*) FROM legacy.member;   -- expect: 84,602

-- target

SELECT COUNT(*) FROM new.member;      -- must equal 84,602

A row count catches dropped and duplicated rows. It is completely blind to Rimu Wealth’s defect, because every row was present — the values inside had changed.

Content reconciliation is the layer that catches the rounding. The standard technique is a checksum: compute a fingerprint of the actual data — a hash or a set of column aggregates — on the source, compute the same on the target, and assert they match. If a single balance was rounded, the fingerprint differs, and the migration fails.

-- a simple content reconciliation: aggregates must match exactly

-- source

SELECT COUNT(*), SUM(balance), MIN(balance), MAX(balance) FROM legacy.member;

-- target — every figure must be identical

SELECT COUNT(*), SUM(balance), MIN(balance), MAX(balance) FROM new.member;

Had Rimu Wealth run this, SUM(balance) on the target would have been lower than the source by the total of all the rounding — an immediate, unmissable fail. For stronger assurance, a row-by-row hash of the key columns catches changes that happen to cancel out in a sum. The principle is constant: reconcile content, not just count.

Pro tip: A sum can hide offsetting errors — one balance up by $5, another down by $5, and the total looks fine. A per-row checksum (hash each row’s key columns and compare the set of hashes) catches changes a sum would mask. Use a sum for a fast smoke test, a row-level hash for the real gate.

5 Backward Compatibility

A schema change rarely happens in isolation. The same database is read and written by application code, batch jobs, reports, and integrations. A change that is correct for the new feature can break every one of those if it is not backward compatible.

The dangerous changes are the destructive ones. Dropping a column breaks any query that still selects it — a nightly report, an integration with MBIE, a fallback code path that runs once a quarter. Renaming a column is a drop and an add in disguise: everything referring to the old name breaks at once. Narrowing a type — Rimu Wealth’s four-decimals-to-two — truncates data and can break writes that send the old precision. Tightening a constraint — adding NOT NULL to a column that already has NULLs, or a UNIQUE where duplicates exist — can fail on the existing data the moment it is applied.

The tester’s job is to enumerate every consumer of the changed table and confirm each still works. The safe pattern for risky changes is expand then contract: add the new column alongside the old, migrate code to use it, and only drop the old column once nothing reads it. That sequencing lets old and new coexist, which is what makes the next idea — zero-downtime — possible. A tester who only checks that the new feature works, and never checks the old report that still selects the dropped column, ships a backward-incompatible change.

6 Zero-Downtime Migrations

Many NZ systems cannot simply go offline for a migration — an internet banking platform, a HealthNZ clinical system, a TransitNZ service. These run zero-downtime migrations, where the change is applied while the system stays live and serving real traffic. That introduces a window in which the schema is mid-change and both old and new code may be running at once.

The technique that makes this safe is the expand-then-contract pattern from the last section, run in phases:

Expand: add the new structure (a new column, a new table) without removing the old. The schema now supports both shapes.
Dual-write / backfill: the application writes to both old and new while a background job copies existing rows across. During this phase the two must stay in agreement.
Switch reads: point the application at the new structure once it is fully populated and verified.
Contract: only after everything reads the new structure, remove the old — the irreversible step, done last and deliberately.

For a tester, the new and hardest target is the in-between state. You must test that the system is correct at every phase, not just at the end — that during dual-write a row written the old way and the new way agree, that a read during the switch never sees a half-migrated record, and that the backfill does not collide with live writes. The defects in a zero-downtime migration live in the overlap, where old and new code touch the same data at the same time. Testing only the final state misses them entirely.

7 Rollback and the Point of No Return

The single most important question before any migration is: if this goes wrong, how do we get back? A migration that cannot be undone is a one-way door, and walking through it untested is how a bad weekend becomes a lost member’s balance with no clean source to restore from.

Rollback is not free, and not always possible. A purely additive change (adding a column) is easy to reverse. A destructive change (dropping a column, rewriting values in place) may have no way back once the old data is overwritten — the information needed to reverse it is simply gone. This is the point of no return: the moment after which the original data no longer exists to restore from.

So a tester insists on three things:

A taken, verified backup before the migration runs — and tested that it actually restores, not just that the backup job reported success. An untested backup is a guess.
A rollback procedure that has been rehearsed on a copy — run the migration, run the rollback, and reconcile that the data returned exactly to its starting state.
A clear, agreed point of no return — everyone knows the moment after which rollback is no longer possible, so the decision to cross it is deliberate and the reconciliation that gates it is run first.

The reconciliation from earlier is what gates the point of no return. You do not decommission the old system, and you do not cross into the irreversible, until content reconciliation has proven the target matches the source. Rimu Wealth crossed that line on a row count alone — and once the old four-decimal data was gone, there was nothing left to restore the truncated balances from.

8 Building Migration Test Cases

A migration test case names what it preserves, reconciles source against target on content, and ties the result to the go/no-go decision on crossing the point of no return.

Here is a migration test case written to catch the exact Rimu Wealth defect:

Test ID:            MIG-RECON-003

Preserves:          Member balances, to full source precision.

Risk category:      Data preservation — value corruption (type narrowing).

Reconciliation:     1) Row count: COUNT(*) target = COUNT(*) source.

                  2) Content: SUM(balance) and per-row hash of (member_id, balance)

                    match exactly between source and target.

                  3) Target balance column retains full source precision.

Pass criterion:     All three pass. Any mismatch blocks the point of no return.

Rollback:           Verified restore from pre-migration backup rehearsed on a copy.

Evidence required:  Source & target counts; source & target SUM(balance); hash

                  comparison result; rollback rehearsal log.

Traceability:       Risk R-02 (value corruption during migration, irreversible once old

                  system decommissioned).

Result:             [Pass / Fail]

Notice the shape: reconciliation is layered — count then content — so a passing count alone cannot let the migration through. The pass criterion explicitly gates the point of no return on content reconciliation, which is the exact control Rimu Wealth lacked. And rollback is a named, rehearsed prerequisite, not an afterthought. This is the difference between a migration you can defend to an auditor and one you hope worked.

9 Common Mistakes

🚫 Reconciling on row count alone

Why it happens: Counts are easy, fast, and feel like proof everything came across.
The fix: A matching count proves nobody was dropped, nothing more — the Rimu Wealth trap. Add content reconciliation: matching aggregates and a per-row hash of the key columns. A value that was truncated or rounded leaves the count untouched and the content fingerprint different.

🚫 Testing only the new feature, never the old consumers

Why it happens: The change was made for the new feature, so that is what gets tested.
The fix: A dropped or renamed column breaks every report, batch, and integration that still reads it — a quarterly MBIE feed you never thought to run. Enumerate every consumer of the changed table and confirm each still works. Use expand-then-contract so old and new can coexist.

🚫 Crossing the point of no return on an unverified backup

Why it happens: The backup job reported success, so the data is assumed safe.
The fix: A backup that has never been restored is a guess. Before the irreversible step, rehearse the actual restore on a copy and reconcile that it returns the data to its starting state. Once the old four-decimal data is gone, an untested backup will not bring it back.

🚫 Testing a zero-downtime migration only in its final state

Why it happens: The end state is the deliverable, so that is where the testing points.
The fix: The defects live in the overlap — the dual-write window where old and new code touch the same data. Test each phase: that dual-written rows agree, that a read mid-switch never sees a half-migrated record, and that the backfill does not collide with live writes.

10 Now You Try

Three graded exercises: spot the migration risks, fix the reconciliation, then build a migration test plan. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Migration Risks

Read the migration plan for a fictional Coastal Bank customer platform below. Identify 3 risks that could lose or corrupt data, or break existing consumers, and name the category each falls in (data preservation, backward compatibility, rollback, zero-downtime overlap).

Migration plan
The customer table’s phone column (free-text) is being split into country_code and phone_number, and the old phone column dropped in the same release. A monthly AML report and a marketing integration both still SELECT phone. The migration runs in place, overwriting the table; sign-off is “row counts match between a backup snapshot and the live table”. The platform stays online during the change. The old system is decommissioned the same evening once counts match.

List 3 migration risks and the category of each:

Show model answer

There are at least four real risks; any three well-explained earns full marks.

1. Backward compatibility — dropping the phone column in the same release breaks the monthly AML report and the marketing integration, both of which still SELECT phone. The AML report in particular is a compliance feed. Category: backward compatibility. Fix: expand-then-contract — keep phone until every consumer is migrated off it, drop it in a later release.

2. Data preservation / value corruption — splitting free-text phone into country_code and phone_number is a transformation that can silently mangle values (numbers with no country code, odd formats, extensions). Row counts will still match. Category: data preservation. Fix: content reconciliation — reconstruct phone from the two new columns and compare to the original per row.

3. Rollback / point of no return — the migration overwrites in place AND the old system is decommissioned the same evening on a row-count match alone. Once the original free-text phone is gone there is nothing to restore the split from if it went wrong, and there is no rehearsed rollback. Category: rollback. Fix: verified backup, rehearsed restore, and do not decommission until content reconciliation passes.

Bonus: zero-downtime overlap — the platform stays online, but there is no dual-write/expand plan, so reads during the in-place change can see a half-migrated table. Category: zero-downtime overlap.

The thread: row-count sign-off cannot see any of the value/compatibility damage. Reconcile content and keep a tested way back before decommissioning.

🔧 Exercise 2 of 3 — Fix the Reconciliation

A fictional HealthNZ migration moved prescription records to a new schema. The team’s entire reconciliation was the check below. It passed, but a later audit found some dose_mg values had been corrupted. Explain why this reconciliation is insufficient, and write a complete content reconciliation that would have caught the corruption.

    -- reconciliation that "passed"

    SELECT COUNT(*) FROM legacy.prescription;  -- 1,204,883

    SELECT COUNT(*) FROM new.prescription;     -- 1,204,883  ==> PASS

Explain the gap and write the content reconciliation:

Show model answer

Why row-count alone missed it: a row count only proves the same NUMBER of prescriptions came across. Every row was present, so the count matched — but the dose_mg values inside some rows had changed. Count is completely blind to value corruption. This is the Rimu Wealth defect in a clinical context, where a corrupted dose is a patient-safety issue.

Content reconciliation — match aggregates AND a per-row hash:

-- aggregate smoke test (fast)
SELECT COUNT(*), SUM(dose_mg), MIN(dose_mg), MAX(dose_mg) FROM legacy.prescription;
SELECT COUNT(*), SUM(dose_mg), MIN(dose_mg), MAX(dose_mg) FROM new.prescription;
-- every figure must be identical

-- the real gate: per-row hash of the key columns
-- source set of hashes
SELECT prescription_id, HASH(prescription_id, dose_mg, drug_code) FROM legacy.prescription;
-- target set of hashes — must match the source set row for row
SELECT prescription_id, HASH(prescription_id, dose_mg, drug_code) FROM new.prescription;
-- assertion: the two sets of (id, hash) are identical; any prescription_id
-- whose hash differs is a corrupted row. PASS = zero differing rows.

Why a SUM alone is still not enough: a sum can hide offsetting errors — one dose corrupted up by 5 mg and another down by 5 mg leave SUM(dose_mg) unchanged, so the aggregate passes while two rows are wrong. The per-row hash catches changes a sum would cancel out. Use the aggregate as a quick smoke test and the row-level hash as the actual pass gate.

🏗️ Exercise 3 of 3 — Build a Migration Test Plan

Design a migration test plan of 5 test cases for a fictional zero-downtime migration of an Harbour Bank internet-banking account table (the system must stay live). Each case needs at least: an ID, what it verifies, an acceptance criterion, and the evidence required. Cover row-count reconciliation, content/checksum reconciliation, backward compatibility, the dual-write overlap, and rollback.

Show model answer

MIG-01 | Verifies: no rows lost or duplicated | Acceptance criteria: COUNT(*) target equals COUNT(*) source for every table (plus any pre-stated transformation delta); 0 difference | Evidence required: source and target counts per table; the run ID

MIG-02 | Verifies: values preserved exactly (content reconciliation) | Acceptance criteria: matching aggregates AND a per-row hash of key columns match source-to-target; 0 differing rows | Evidence required: source/target aggregates; hash comparison result listing any mismatched IDs

MIG-03 | Verifies: existing consumers still work after the change (backward compatibility) | Acceptance criteria: every report, batch, and integration that reads the account table runs successfully and returns identical results pre- and post-change | Evidence required: list of consumers; before/after run outputs for each

MIG-04 | Verifies: dual-write overlap is consistent | Acceptance criteria: during the expand/dual-write phase, a row written via old and new paths agrees; a read mid-switch never returns a half-migrated record; backfill does not overwrite a newer live write | Evidence required: dual-write comparison samples; concurrent-read test results; backfill-vs-live-write test log

MIG-05 | Verifies: rollback returns the data to its starting state | Acceptance criteria: on a copy, run migration then rollback; reconcile that counts and per-row hashes match the original pre-migration state exactly | Evidence required: pre-migration snapshot fingerprint; post-rollback fingerprint; they must be identical

Strong plans: each case is specific, measurable, names concrete evidence, and together they cover count (MIG-01), content (MIG-02), backward compatibility (MIG-03), the zero-downtime overlap (MIG-04), and rollback (MIG-05). Weak plans say "check the migration worked" five times — that is the difference being marked. Note that the point of no return must be gated on MIG-02 passing.

Why teams fail here

Flyway or Liquibase gives a green run and teams treat that as a passing migration test — but a migration tool only proves the DDL executed without error, not that the data arrived correctly. Script success and data integrity are different assertions.
Zero-downtime expand-then-contract is planned but the dual-write window is never tested under realistic concurrent load — the defects (stale reads, backfill collisions, disagreeing writes) only appear at production throughput, not in a quiet staging environment.
The rollback procedure exists on paper but has never been end-to-end rehearsed — so when the team needs to roll back under pressure at 2am, they discover the restore takes four hours, not the forty minutes documented, and the old system is already decommissioned.
NZ legacy systems — especially those migrating off Oracle, SQL Server 2008, or Progress/OpenEdge — carry implicit type coercions, collation differences, and NULL-handling rules that are undocumented and only surface during a cross-engine migration when production data hits the edge cases that test data never did.

11 Self-Check

Click each question to reveal the answer.

Q1: Why is a matching row count not enough to sign off a migration?

Because a row count only proves nobody was dropped or duplicated — it is blind to whether the values inside the rows survived. Rimu Wealth’s balances were all present but silently rounded; the count matched and the money did not. You must add content reconciliation: matching aggregates and a per-row hash of the key columns.

Q2: Why is a SUM check alone still insufficient for content reconciliation?

Because a sum can hide offsetting errors — one value up by $5 and another down by $5 leave the total unchanged while two rows are wrong. A per-row hash of the key columns catches changes that cancel out in an aggregate. Use the sum as a fast smoke test and the row-level hash as the real pass gate.

Q3: What is the expand-then-contract pattern, and why does it matter?

Add the new structure alongside the old, migrate code to use it, and only drop the old once nothing reads it. It lets old and new coexist, which preserves backward compatibility and makes zero-downtime migrations possible. The contract step — removing the old — is the irreversible one, done last and deliberately.

Q4: Where do the defects in a zero-downtime migration live?

In the overlap — the dual-write window where old and new code touch the same data at once. The defects are dual-written rows that disagree, a read mid-switch seeing a half-migrated record, or a backfill colliding with a live write. Testing only the final state misses them; you must test each phase.

Q5: What is the “point of no return” and what gates crossing it?

The moment after which the original data no longer exists to restore from — an in-place overwrite or decommissioning the old system. Crossing it is gated on content reconciliation passing and a rehearsed, verified rollback. You never cross it on a row count alone, and never on an untested backup.

12 Interview Prep

Real questions asked in NZ QA interviews for data-heavy roles. Read the model answers, then practise your own version.

“How would you test that a data migration preserved the data correctly?”

I reconcile source against target on two layers. First, row counts per table, to prove nothing was dropped or duplicated — necessary but not sufficient. Then content reconciliation, because a matching count says nothing about the values: I compare aggregates like sum, min and max, and then run a per-row hash of the key columns and assert the two sets of hashes are identical. A sum alone can hide offsetting errors, so the row-level hash is my real gate — any row whose hash differs is corrupted. I also check that any deliberate transformation produced exactly the intended result, and I gate the point of no return on content reconciliation passing, so the old system is never decommissioned on a count alone.

“A column is being dropped from a busy table. What are you worried about?”

Backward compatibility. The table is read by more than the new feature — reports, batch jobs, integrations, fallback code paths that only run quarterly. Dropping the column breaks every one of those that still selects it, and a feed like an AML or MBIE report failing is a compliance problem, not just a bug. So I enumerate every consumer of the table and confirm each still works, and I push for expand-then-contract: keep the old column until everything has migrated off it, and drop it in a later release once nothing reads it. That sequencing lets old and new coexist and removes the all-at-once breakage.

“Why insist on a rollback plan if the migration tested fine?”

Because a migration is often a one-way door, and production has data and load that no test environment fully reproduces. If something the tests did not cover goes wrong after the point of no return — once the original data is overwritten or the old system is gone — there is nothing to restore from unless I prepared one. So I insist on a verified backup that has actually been restored, not just a backup job that reported success, and a rollback procedure rehearsed on a copy where I reconcile that the data returned exactly to its starting state. The reconciliation gates the irreversible step: I do not cross the point of no return until content reconciliation passes and I have a proven way back.

Key takeaway

A migration is the one operation where the cost of finding a bug after the fact is higher than the cost of proving correctness before — which is why content reconciliation and a rehearsed rollback are not optional extras, they are the engineering definition of done.

← Data & Referential Integrity Next: Test Data Management & Masking →

Migration & Schema-Change Testing

1 The Hook

2 The Rule

3 The Analogy

4 Data Preservation and Reconciliation

5 Backward Compatibility

6 Zero-Downtime Migrations

7 Rollback and the Point of No Return

8 Building Migration Test Cases

9 Common Mistakes

10 Now You Try

11 Self-Check

Related techniques

12 Interview Prep