Healthcare QA · Lesson 2

HL7 & FHIR Interoperability Testing

Clinical systems do not work alone — they pass messages and resources to each other constantly. This lesson teaches you to read those messages, validate them against their structure and profile, and break them on purpose.

Healthcare QA Healthcare & Health Data — Lesson 2 of 5 ~35 min read · ~80 min with exercises

1 The Hook

A fictional NZ private laboratory, Tui Diagnostics, sent test results to GP practices as HL7 v2 messages. For years it worked. Then a GP practice upgraded its patient management system, and within days a result came through that read as a sodium level of 145 when the lab had reported 14.5. The decimal point had moved.

The cause was small and entirely structural. The lab’s message put the result value and its units in separate fields of the observation segment. The new practice system read the value field but parsed the units field as part of the number on certain messages, because the two systems disagreed about exactly which delimiter separated those pieces. Most results were whole numbers and survived. Results with a decimal place — a minority — came through wrong.

No system crashed. Both ends believed they were speaking the same language. The lab’s message was, by its own rules, valid. The practice’s parser was, by its own rules, working. The defect lived in the gap between two systems’ interpretations of the same message — and only a tester who read the message structure itself, and tested the decimal-place case, would have caught it before a clinician did.

Here is the lesson hidden in that story. The integration had been tested by sending a result and seeing a result appear. That is testing the happy path of the channel, not the structure of the message. Interoperability testing is reading the segments and fields, validating them against an agreed profile, and deliberately sending the awkward cases — the decimal, the missing field, the unexpected repeat — that expose where two systems quietly disagree.

2 The Rule

An integration that delivers a message is not an integration that delivers the right message. Two systems can both be “working” and still disagree about what a field means. Test the structure, not just the flow: validate every segment, field, and resource against the agreed conformance profile, and deliberately send the malformed, missing, and edge cases. The message is correct only when the receiver reconstructs exactly what the sender meant — not when something simply arrives.

3 The Analogy

Analogy

Posting a parcel with a NZ Post courier label.

A courier label has fixed boxes — recipient name here, street number here, suburb here, postcode here. The system works because both the sender and the depot agree precisely which box holds which piece of information. If a sender writes the unit number in the street-name box, the parcel still “has an address” and still gets scanned, but it may turn up at the wrong door. Nothing failed loudly; the meaning just shifted between two readers of the same label.

An HL7 message is a courier label for clinical data, and a FHIR resource is the same idea written as a structured form. Each field is a labelled box. Tui Diagnostics and the practice system disagreed about exactly where one box ended and the next began, so a decimal slid into the wrong place — the parcel arrived, at the wrong door. An interoperability tester is the person who checks every box is filled with what the agreed label says belongs there, and who deliberately posts the tricky parcels — missing postcode, two recipients, an address with a decimal in it — to see what the depot does.

4 HL7 v2 Messages

HL7 version 2 is the older but still dominant messaging standard in health. It is a text-based format that carries events between systems — a patient admitted, a result ready, an order placed. You do not need to memorise the specification, but you must be able to read the shape of a message and reason about its parts.

An HL7 v2 message is a series of segments, one per line, each starting with a three-letter code. Within a segment, fields are separated by a delimiter (commonly the pipe character), and fields can themselves be split into components by a sub-delimiter. A simplified result message looks like this:

MSH|^~\&|TUILAB|TUI|GPSYS|PRACTICE|20260605103000||ORU^R01|MSG0001|P|2.4

PID|1||ZAB1234^^^NHI||WIREMU^TANE||19800101|M

OBR|1||ORD987|SODIUM^Serum sodium

OBX|1|NM|NA^Sodium||14.5|mmol/L|135-145|L|||F

Reading it as a tester:

MSH is the message header — who sent it, who receives it, the message type (here ORU^R01, an observation result), and the HL7 version. A wrong version or message type is an immediate structural fail.
PID is the patient identifier segment — this is where the NHI lives (the ZAB1234^^^NHI component). From Lesson 1, this is the field a patient-safety tester scrutinises most.
OBR is the observation request — what test was ordered.
OBX is the observation result itself — the value (14.5), its units (mmol/L), its reference range (135-145), and an abnormal flag (L for low). The Tui Diagnostics decimal bug lived exactly here, in the boundary between the value and the units.

The structural concepts a tester checks are cardinality (how many times a segment or field may or must appear — is PID required, can OBX repeat?), data type (is OBX-2 declared NM for a numeric result, and does the value actually parse as one?), and delimiters (are the field and component separators consistent, the exact disagreement that bit Tui). Reading the message and asking “is each field where the structure says it should be, and does its value match its declared type?” is the core of HL7 v2 testing.

5 FHIR Resources

FHIR — Fast Healthcare Interoperability Resources — is the modern standard, and it is where NZ health interoperability is heading, including the national Hira platform. Where HL7 v2 is pipe-delimited text, FHIR represents the same clinical concepts as structured resources, usually exchanged as JSON or XML over a web API. The sodium result above, as a simplified FHIR Observation resource, looks like this:

{

  "resourceType": "Observation",

  "status": "final",

  "code": { "coding": [ { "system": "http://loinc.org", "code": "2951-2", "display": "Sodium" } ] },

  "subject": { "reference": "Patient/ZAB1234" },

  "valueQuantity": { "value": 14.5, "unit": "mmol/L", "system": "http://unitsofmeasure.org", "code": "mmol/L" },

  "referenceRange": [ { "low": { "value": 135 }, "high": { "value": 145 } } ]

}

Notice how the same pieces reappear, now as named structured fields rather than positional ones. For a tester, the key concepts are:

resourceType: which kind of resource this is — Patient, Observation, Encounter, ServiceRequest. The wrong type, or a type the receiver does not handle, is a structural fail.
References between resources: the subject points to a Patient resource. A FHIR record is a web of linked resources, and a reference to a patient who does not exist, or to the wrong patient, is the FHIR version of the Lesson 1 wrong-patient risk.
valueQuantity and coded values: the value carries an explicit unit and a unit code, and the test is identified by a coding system. Because the unit is its own named field, the FHIR design makes the Tui decimal-and-units confusion harder — but you still test that the value parses as a number and the unit is the one expected.
status and required elements: an Observation has a required status (final, preliminary, amended). A result marked preliminary treated as final, or a missing required element, is a meaningful defect.

Pro tip: HL7 v2 and FHIR carry the same clinical meaning in different shapes. When you move between them, test the translation itself — the mapping from an OBX result into an Observation resource is a common integration point, and it is exactly where a value, unit, or patient reference can be dropped or transformed wrongly. Many NZ systems run both standards side by side, so the v2-to-FHIR bridge is prime testing ground.

6 Conformance Profiles

A base standard like HL7 v2 or FHIR is deliberately broad — it has to serve every country and every use. A conformance profile narrows it for a specific context: which segments or resources are required, which fields are mandatory, what code sets and identifier systems must be used, and what the cardinalities are. It is the agreed contract that turns “valid HL7” into “valid for our exchange.”

This distinction is the heart of conformance testing. A message can be perfectly valid against the base standard and still violate the profile that two NZ systems agreed to. For example, the base FHIR Patient resource does not require any particular identifier, but an NZ profile will require the NHI as a specific identifier system. A patient resource with no NHI is valid base FHIR and a profile failure — and from Lesson 1 you know it is also a patient-safety risk.

So conformance testing happens at two levels, and you test both:

Against the base standard: is this structurally a legal HL7 v2 message or FHIR resource at all — correct segments, valid data types, parseable structure?
Against the agreed profile: does it meet the stricter local contract — the NHI present as the required identifier, the mandated code system used (such as a recognised terminology for the test or diagnosis), required fields populated, cardinalities respected?

The profile is your test oracle. When you have one, “is this message correct?” becomes a concrete, checkable question rather than a judgement call. When a project does not have a written profile, surfacing that absence is itself a valuable testing outcome — because without an agreed contract, two systems will eventually disagree, exactly as Tui Diagnostics and the practice did.

Pro tip: Validation tooling exists for both standards — FHIR in particular has validators that check a resource against a named profile. Stay conceptual about the exact tool, but know the workflow: run the message or resource through a validator against the agreed profile, and treat each reported error and warning as a finding to triage, not noise to ignore.

7 Negative and Edge Cases

Happy-path interoperability testing — send a clean result, see it arrive — catches almost none of the defects that matter. The real failures live in the cases a healthy sender rarely produces but a real-world sender eventually will. A serious interoperability test suite is mostly negative and edge cases:

Missing required field: a message with no NHI, no result status, or no units. Does the receiver reject it clearly, or silently accept a half-message?
Wrong data type: a numeric result field carrying text (“see comment”), or a date in the wrong format. Does it error, or guess?
Boundary values: the Tui decimal — values with decimal places, very large numbers, negative values, zero, and empty strings.
Unexpected repeats and cardinality: two PID segments, an Observation with multiple values, a referral with no patient. Does the receiver handle the count the profile allows, and reject the count it does not?
Encoding and special characters: macrons in NZ names (ā, ē, ī, ō, ū), the delimiter character appearing inside a data value, and non-ASCII content. Does the name survive the round trip intact?
Status and amendment: a preliminary result later amended or cancelled. Does the receiver update or supersede the earlier value, or leave a stale result on the record?
Out-of-order and duplicate messages: an amendment arriving before the original, or the same message delivered twice. Does the record end up correct?

The macron case deserves special weight in NZ. A name like Māori or Ngāti that loses its macron in transit is both a data-quality defect and a respect-and-accuracy issue, and character encoding is a classic place for it to break silently between two systems.

8 Building Interoperability Test Cases

A strong interoperability test case names the standard and profile, asserts on specific segments or fields, and states the expected receiver behaviour — including rejection for the bad cases. The assertion is at the field level, not “the result appeared.”

Here is a worked test case for the Tui Diagnostics decimal bug:

Test ID:            INT-OBX-031

Standard / profile: HL7 v2.4 ORU^R01, lab-results profile

Clinical risk:      Corrupted numeric result (wrong value reaches clinician)

Segment / field:    OBX-2 (value type NM), OBX-5 (value), OBX-6 (units)

Pre-conditions:     Receiver configured with the agreed delimiters and profile.

Action:             Send an ORU result where OBX-5 = 14.5 and OBX-6 = mmol/L.

Expected result:    1) Receiver stores the value as exactly 14.5 (not 145).

                  2) Units stored as mmol/L in OBX-6, not merged into the value.

                  3) Abnormal flag and reference range round-trip unchanged.

Negative variants:  Value with no decimal (14), negative (-1.5), and text in a

                  numeric field — the text case must be rejected, not coerced.

Evidence required:  Sent message; received/stored field values; validator report

                  against the profile.

Traceability:       Interop risk register IR-04 (numeric corruption across boundary).

Result:             [Pass / Fail]

Notice what makes this catch the Hook bug: it asserts on specific fields (OBX-5 and OBX-6) and their exact values, it includes the decimal boundary as the primary case with negative variants, and it demands the field-level stored values plus a validator report as evidence — not a screenshot of a result on a screen. The standard and profile are named at the top, so the reviewer knows the contract being tested.

9 Common Mistakes

🚫 Testing that a message arrives instead of what it contains

Why it happens: “A result appeared on the screen” feels like the integration works.
The fix: Arrival proves the channel, not the meaning — the Tui decimal arrived too. Assert on specific segments and fields, comparing sent and stored values exactly, so a corrupted value or misplaced unit is caught at the field level.

🚫 Validating only against the base standard, not the agreed profile

Why it happens: A validator says “valid HL7” or “valid FHIR” and that looks like a pass.
The fix: A message can be valid base HL7/FHIR and still break the local contract — for example a Patient with no NHI is valid base FHIR but fails an NZ profile. Validate against the agreed conformance profile, which is your real oracle.

🚫 Only sending clean, happy-path messages

Why it happens: Test data is usually generated by a healthy sender that never produces the awkward cases.
The fix: Real senders eventually produce missing fields, wrong data types, decimals, repeats, and amendments. Deliberately send malformed and edge messages and assert the receiver rejects or handles each correctly — that is where interoperability defects live.

🚫 Ignoring character encoding and NZ names with macrons

Why it happens: ASCII test names round-trip fine, so encoding never gets exercised.
The fix: Names with macrons (ā, ē, ī, ō, ū) and delimiter characters inside data values are classic silent-corruption points. Test that a Māori name survives the round trip exactly — it is both a data-quality and a respect issue.

10 Now You Try

Three graded exercises across HL7 v2, FHIR, and conformance. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Message Defects

Read the simplified HL7 v2 result message below, sent to an NZ GP practice. Identify 3 structural or clinical defects a tester should raise, and say what each could do to the patient record.

    MSH|^~\&|LAB|LAB|GPSYS|PRACTICE|20260605||ORU^R01|M1|P|2.4

    PID|1||^^^NHI||SMITH^JOHN||19750312|M

    OBR|1||ORD55|GLUC^Glucose

    OBX|1|TX|GLUC^Glucose||7.8 mmol/L||3.0-7.7|H|||P

List 3 defects and the impact of each:

Show model answer

There are at least four real defects here; any three well-explained earns full marks.

1. PID NHI is empty — the NHI component before ^^^NHI is blank, so the patient has no reliable identifier. Impact: the result may be matched by name+DOB and land on the wrong patient (the Lesson 1 risk). A profile requiring the NHI would reject this.

2. OBX-2 declared TX (text) for a numeric result — the value type is TX but the value is a quantitative measurement. Impact: a numeric glucose is stored as free text, so it cannot be trend-graphed, range-checked, or reliably parsed; it should be NM.

3. Value and units merged in OBX-5 — "7.8 mmol/L" sits in the value field with OBX-6 (units) empty, instead of value 7.8 and units mmol/L in their own fields. Impact: the value will not parse as a number; this is the Tui-style value/units confusion.

Bonus: status flag P (preliminary) — the result is preliminary, not final (F). Impact: if the receiver treats it as final, a clinician may act on a result still subject to change; the amended/final version must supersede it.

The trap: this message still "arrives" and shows a glucose on screen, so an arrival-only test passes it.

🔧 Exercise 2 of 3 — Fix the Test Case

The interoperability test case below only checks that a message arrives. Rewrite it to validate structure and meaning, with these fields: Test ID, Standard / profile, Clinical risk, Field(s) under test, Pre-conditions, Action, Expected result, Negative variants, Evidence required, Traceability. Use a fictional FHIR Observation lab result sent to an NZ clinical system as the context.

Original (too shallow):
“Send a FHIR Observation to the system. Check it is received. Pass if the result shows up.”

Rewrite as a structure-and-profile interoperability test case:

Show model answer

Test ID: INT-FHIR-018

Standard / profile: FHIR R4 Observation, NZ lab-results profile (NHI required as patient identifier; recognised unit and test code systems)

Clinical risk: Corrupted/mis-mapped result or wrong patient reference

Field(s) under test: Observation.status, Observation.subject (Patient reference / NHI), Observation.valueQuantity (value + unit + unit code), Observation.code (test coding), Observation.referenceRange

Pre-conditions: Receiver loaded with the agreed NZ profile and validator; a known Patient resource with a valid NHI exists.

Action: POST a final Observation with valueQuantity value 14.5 unit mmol/L, subject referencing the known patient's NHI, and a valid test code.

Expected result: 1) Resource validates against the named profile with no errors. 2) Stored value is exactly 14.5 with unit mmol/L (value and unit not merged). 3) subject resolves to the correct Patient by NHI. 4) status final is stored as final; referenceRange round-trips.

Negative variants: missing NHI (must be rejected per profile); status preliminary (must not be treated as final); value as a string (must be rejected); subject referencing a non-existent Patient (must be rejected, not stored against a guess).

Evidence required: the sent resource; the validator report against the profile; the stored field values; the resolved patient reference.

Traceability: Interop risk register IR-02 (mis-mapped value/unit) and IR-03 (wrong patient reference).

What makes it strong: it names the standard and the NZ profile, asserts on specific fields and their exact values, validates against the profile (not just base FHIR), includes negative variants with required rejections, and demands field-level evidence plus the validator report — not "the result showed up".

🏗️ Exercise 3 of 3 — Build a Conformance Test Plan

Design a conformance and edge-case test plan of 5 test cases for a fictional HL7 v2 to FHIR translation bridge in an NZ integration engine (it receives HL7 v2 ORU results and emits FHIR Observations). Each case needs at least: an ID, what it verifies, an acceptance criterion, and the evidence required. Cover the value/unit mapping, the NHI/patient reference mapping, a missing required field, a macron-bearing Māori name, and a status/amendment case.

Show model answer

BR-01 | Verifies: OBX value and units map to FHIR valueQuantity correctly | Acceptance criteria: OBX-5 value (incl. decimals like 14.5) maps to valueQuantity.value exactly and OBX-6 maps to valueQuantity.unit; value and unit never merged | Evidence required: source HL7 message; emitted FHIR resource; field-by-field comparison

BR-02 | Verifies: the PID NHI maps to the FHIR patient reference/identifier | Acceptance criteria: the NHI in PID becomes the correct Patient reference/identifier in the Observation.subject; no name+DOB fallback | Evidence required: source PID; emitted subject/identifier; resolution to the correct Patient

BR-03 | Verifies: a missing required field is handled, not silently emitted | Acceptance criteria: an HL7 message missing a profile-required field (e.g. NHI or status) produces a rejection/error, not a partial FHIR resource | Evidence required: the deficient source message; the rejection/error; confirmation no resource was stored

BR-04 | Verifies: a Māori name with macrons round-trips intact | Acceptance criteria: a name containing ā ē ī ō ū is byte-for-byte preserved from HL7 to the emitted FHIR resource; no mojibake or stripping | Evidence required: source name; emitted name; character-level comparison

BR-05 | Verifies: a status/amendment is reflected correctly | Acceptance criteria: a preliminary result emits status preliminary (not final); a later amendment supersedes the prior Observation rather than duplicating it | Evidence required: source messages (preliminary then amended); emitted status values; the supersede/amend linkage

Strong plans: each case is specific, has a measurable criterion, names concrete evidence, and together they cover value/unit mapping, NHI/patient mapping, a missing required field, macron handling, and status/amendment. Weak plans say "check the translation works" five times — that is the difference being marked.

11 Self-Check

Click each question to reveal the answer.

Q1: Why is “the result arrived” not enough to prove an integration works?

Arrival proves the channel carried something, not that the meaning survived. Two systems can both be working and still disagree about what a field means — the Tui decimal arrived and was wrong. You must assert on the specific segments and fields, comparing sent and stored values exactly, not just that a result appeared.

Q2: What is the difference between an HL7 v2 message and a FHIR resource?

They carry the same clinical meaning in different shapes. HL7 v2 is pipe-delimited text made of segments (MSH, PID, OBR, OBX) with positional fields. FHIR is structured resources (Patient, Observation) with named fields, usually JSON or XML over a web API. Many NZ systems run both, so the v2-to-FHIR translation is a key integration point to test.

Q3: What is a conformance profile and why is it your test oracle?

A profile narrows a broad base standard to a specific context — which fields are required, which code systems and identifiers must be used, what the cardinalities are. It is the agreed local contract. A message can be valid base HL7/FHIR yet break the profile (for example, a Patient with no NHI), so the profile is what makes “is this correct?” a checkable question.

Q4: Why should most interoperability tests be negative and edge cases?

Because happy-path messages from a healthy sender rarely expose the disagreements between two systems. The defects live in missing fields, wrong data types, decimals and boundaries, unexpected repeats, amendments, and encoding. Deliberately sending those awkward cases is what catches the silent corruption a clean message never reveals.

Q5: Why is testing macrons in NZ names an interoperability concern?

Character encoding is a classic place for silent corruption between two systems. A name with macrons (ā, ē, ī, ō, ū) can lose them or become garbled in transit, which is both a data-quality defect and a respect-and-accuracy issue. Test that a Māori name survives the round trip byte-for-byte intact.

12 Interview Prep

Real questions asked in NZ QA interviews for health integration roles. Read the model answers, then practise your own version.

“How would you test an HL7 v2 lab-results integration between a lab and a GP system?”

I’d start from the message structure, not the channel. I read the segments — MSH for the message type and version, PID for the patient and the NHI, OBR for the order, OBX for the result value, units, reference range, and status. Then I validate against the agreed conformance profile, because valid base HL7 is not the same as valid for our exchange. The bulk of my testing is negative and edge cases: a missing NHI, a numeric field carrying text, decimal values, unexpected repeats, amendments, and macron-bearing names. For each I assert on the exact stored field values, comparing them to what was sent — arrival alone proves nothing, because a corrupted result still arrives.

“What is the difference between validating against the base standard and against a profile?”

The base standard tells you whether something is a legal HL7 message or FHIR resource at all — correct structure, valid data types. The profile is the stricter local contract two systems agreed to: which fields are mandatory, which code systems and identifiers must be used, what cardinalities apply. A message can pass the base standard and fail the profile — the classic NZ example is a FHIR Patient with no NHI, which is valid base FHIR but fails an NZ profile that requires it. I treat the profile as my oracle, and if a project has no written profile I raise that, because without an agreed contract two systems will eventually disagree.

“A numeric result came through wrong after a system upgrade. How would you approach it?”

My first hypothesis is a structural disagreement at a field boundary — a delimiter or value/units mismatch, like a decimal sliding because two systems parse the observation segment differently. I’d capture the raw message as sent and compare it field-by-field against what the receiver stored, focusing on the value and units fields. Then I’d reproduce with boundary values — decimals, negatives, no-decimal, and text in a numeric field — and check the value type is declared correctly. The fix I’d verify is that the value parses and stores exactly, units stay in their own field, and a non-numeric value is rejected rather than coerced. I’d also confirm there is a profile and validator in the pipeline so this is caught structurally next time.

← Clinical Systems & Patient Safety Next: PHI & Health-Data Privacy →