Test Manager BCP & Resilience

Business Continuity & Resilience

Ensuring the business survives the unthinkable. Learn how to verify failover and build resilient systems.

1 The Hook

A major NZ insurer suffered a data centre outage during a storm. Their Disaster Recovery (DR) plan said they would be back up in 4 hours. It actually took 3 days.

Why? Because while they had the "Plan," they had never **tested** it under real load. They found that their database restoration scripts failed when trying to process 10TB of real data. A plan that hasn't been tested is just a hallucination.

2 The Rule

Resilience is an Engineering Practice, not a Paper Document. You don't "have" a BCP; you "verify" a BCP through continuous failure testing.

Senior engineer insight

The most dangerous BCP document is the one that was accurate 18 months ago. I've seen organisations pass annual DR audits with a playbook that hadn't been touched since the cloud migration — every IP address, every hostname, every runbook step referenced infrastructure that no longer existed. The moment I started treating BCP test reports like regression test suites (version-controlled, date-stamped, owned by a named person), the gap between documented procedure and actual capability collapsed within two cycles.

Most common Test Manager mistake: signing off a BCP test that passed on synthetic data volumes without ever validating against production-scale data, then discovering during an actual incident that restoration takes four times as long as the RTO target.

From the field

A Wellington-based financial services company had a BCP certified against the NZISM resilience controls and GCIO information security requirements — auditors were satisfied, executive sign-off was in place. When a scheduled maintenance window triggered an unplanned failover to their Hamilton secondary site, the team assumed the automated DNS cutover would take three minutes as documented. What they discovered was that a configuration change three months earlier had silently broken the replication health check, and the secondary database was running 47 minutes behind primary — obliterating their 5-minute RPO target in a non-emergency scenario. The lesson that generalises: NZISM compliance demonstrates intent; only a live failover test with production-representative data demonstrates capability, and those two things are not the same.

3 Watch Me Do It: Verifying the "Safe Zone"

Observe how a Test Manager verifies the survival metrics of an organisation using the RTO and RPO standards.

Recovery Time Objective (RTO): The clock starts the moment the server dies. We must verify that the "Auto-Failover" completes in under 15 minutes.
Recovery Point Objective (RPO): We kill the database mid-transaction. We verify that the "Secondary Site" has all data up to 5 minutes ago.
Connectivity Check: We verify that the Payment Gateway and PostNZ API automatically point to the new IP address without manual intervention.

4 Resilience Lab: Chaos Engineering

In this lab, you must design a "Controlled Failure" to verify the Resilience of a new cloud-based banking app.

The Analogy

The Controlled Burn. Foresters burn small patches of land to prevent a massive wildfire. We kill small servers to prevent a massive outage.

Your Task: The Server Kill

You are testing a "High Availability" cluster. You have 3 servers. You want to kill Server #1 during peak traffic.

WORK THROUGH THESE STEPS:

The Hypothesis: What should happen? (e.g. Load balancer shifts traffic to #2 and #3).
The Verification: How do we prove no users were logged out?
The Rollback: How do we bring Server #1 back online safely?

Design your "Chaos Test." What is the one critical failure you would test first? Write it in your notes.

5 RTO/RPO Calculation Worksheets

RTO and RPO are not IT numbers; they are business numbers. The CFO or Ops director defines them. Your job is to verify the technology can meet them.

RTO/RPO Definition Matrix

Metric	Definition	Example
RTO (Recovery Time Objective)	Max time the business can survive without the system. Measured from "system fails" to "system back online."	Banking app: 15 minutes. Ticketing app: 4 hours.
RPO (Recovery Point Objective)	Max data loss the business can tolerate. If the primary fails at 2:00 PM, we can recover data up to 1:55 PM (RPO = 5 minutes).	Core banking: 1 minute RPO. Reporting: 1 hour RPO.
RTA (Recovery Time Actual)	How long it actually took to recover in a real or simulated failure. Should be ≤ RTO.	Your recent failover test took 8 minutes. RTA = 8 min.
RPO Actual	How much data was actually lost. Should be ≤ RPO.	The backup was 3 minutes behind when primary failed. RPO Actual = 3 min.

NZ-Specific Risk: New Zealand's geography means many companies use single data centres. In case of earthquake, fire, or flood, a single-DC strategy can be catastrophic. Mandate geographic redundancy (e.g. Auckland primary, Wellington secondary) for any Sev 1 system.

RTO/RPO Calculation Template

Example: Payment Processing System

Component	Failure Impact	Recovery Steps	Time (mins)
Detect Failure	Monitoring alerts go off	Alerting system triggers (automated)	1
Manual Approval	On-call engineer confirms failure	Page on-call, they review alerts	5
Activate Secondary	Switch traffic to backup DC	Update DNS, enable secondary DB	3
Data Sync Check	Verify backup is current	Check replication lag	2
Smoke Tests	Verify system is online	Run automated health checks	4
TOTAL RTA			15 mins

Result: RTO target is 15 minutes. RTA is 15 minutes. We PASS (just barely). If the on-call engineer takes 10 minutes to respond, we fail RTO.

6 DR Testing Frameworks: Tabletop to Full Failover

You cannot test DR only in production. Use a pyramid: tabletop exercises at the base, limited failovers in the middle, full failover tests at the top.

DR Testing Pyramid

Test Level	What You Test	Frequency	Cost / Risk
1. Tabletop Exercise	Walk through the DR playbook on paper. "If primary fails, what's step 1?" No systems involved.	Quarterly	Low cost, low risk
2. Component Test	Test ONE component. E.g. "Can we restore the database from backup?" Don't activate secondary yet.	Quarterly	Low cost, low risk
3. Limited Failover	Activate secondary, but keep primary online. Route 5% of traffic to secondary. Measure RTA and data consistency.	Twice yearly	Medium cost, medium risk
4. Full Failover Test	Kill primary completely. All traffic goes to secondary. Measure actual RTO and RPO. Full team involved.	Annually	High cost (1 full day), high impact

Tabletop Exercise Template (60 minutes)

Scenario: "It's Tuesday 9 AM. The primary data centre loses all power. What happens?"

0-5 min: Who detects the failure first? How? (Monitoring, customer complaints, internal alerts?)
5-10 min: Who gets called? What's the escalation path? (On-call engineer → Tech Lead → VP Ops?)
10-20 min: Who decides to activate secondary? (VP Ops? CIO? Product Lead?) What's the approval process?
20-40 min: Walk through the actual failover steps: DNS changes, database sync, load balancer config, communication to customers.
40-60 min: Debrief: What gaps did we find? What's unclear? Who owns the actions to fix it?

Common Tabletop Failure

"Who owns the DNS change?" — Cricket sounds. No one knows. This is a critical gap. Assign ownership in the playbook.

7 NZ Earthquake & Natural Disaster Resilience

New Zealand sits on the Pacific Ring of Fire. Earthquakes, flooding, and volcanic activity are real risks. Your BCP must account for geographic failures, not just server failures.

NZ Resilience Checklist

Before your next release, verify:

Primary DC and Secondary DC are in different geographic regions (e.g. Auckland + Wellington, NOT two buildings in Auckland).
Backup power systems (UPS + Generators) are tested. In the Christchurch 2011 quake, one hospital's generator had been running continuously for 3 months and failed.
Network connectivity has redundant ISP routes (if one ISP's fibre is cut, traffic re-routes via another ISP).
Staff can work remotely. If the office is damaged, the team must be able to execute the DR plan from home.
Out-of-band communication plan exists (if the main office is destroyed, how do you reach the on-call engineer? Via SMS? Slack? Phone tree?).
Critical vendor contacts (AWS, hosting provider, ISP) are documented in a hardcopy runbook stored outside the office.

Government Requirement: If your company handles government data or provides critical services (banking, health), the State Services Commission and MBIE require evidence of BCP testing. Keep test reports and RTA measurements to prove compliance.

8 Common Mistakes

⚠ Testing BCP on "Mock" Data

Why it fails: Restoration scripts that work for 1GB of data often fail at 10TB. You must test with **Real Volume** (masked for privacy) to verify restoration times.

⚠ Ignoring "Configuration Drift"

Why it fails: The Primary site is patched, but the Secondary site is forgotten. When you failover, the software versions don't match, and the system crashes.

⚠ Single Data Centre in NZ

Why it fails (NZ-specific): One earthquake, one fire, one flood, and your entire business is down. Secondary DC must be in a different region with different power, network, and physical infrastructure.

Why teams fail here

RTO and RPO targets are set by IT, not the business — so they reflect what engineering thinks is achievable rather than what the organisation can actually survive, and nobody discovers the mismatch until an incident.
DR environments are treated as static clones that drift from production over months; configuration drift means the secondary site runs software versions, certificate chains, or firewall rules that will fail when they matter most.
BCP tests are scheduled during low-traffic windows with a skeleton crew — which means the human coordination bottlenecks (who approves the failover? who has production credentials?) are never stress-tested under realistic conditions.
New Zealand geographic risk is treated as theoretical: primary and secondary sites are in the same Auckland suburb (different buildings, same seismic zone, same ISP trunk), meaning a regional event takes both down simultaneously.

Key takeaway

A BCP that has never failed a test has never been tested — the only way to know your organisation will survive the real thing is to deliberately try to break it under controlled conditions before the crisis does it for you.

9 Self-Check

Q1. What is "Chaos Engineering"?

The practice of intentionally injecting failures into a system to prove it can survive them. It moves testing from "Reactive" to "Proactive."

Q2. Who defines the "Success" of a BCP test?

The Business Stakeholders. They define the RTO/RPO; the Test Manager proves whether the technology meets those business needs.

Q3. What's the difference between RTO and RPO?

RTO is time to recovery (how fast the system comes back). RPO is data loss (how much data was lost). A system can have 15min RTO but 5min RPO = you lose 5 minutes of data.

Q4. Why should NZ companies avoid single data centre strategies?

New Zealand is earthquake-prone. A single DC in Auckland could be destroyed, taking down the entire business. Use geographic redundancy across regions.

1 The Hook

2 The Rule

3 Watch Me Do It: Verifying the "Safe Zone"

4 Resilience Lab: Chaos Engineering

Your Task: The Server Kill

5 RTO/RPO Calculation Worksheets

RTO/RPO Definition Matrix

RTO/RPO Calculation Template

6 DR Testing Frameworks: Tabletop to Full Failover

DR Testing Pyramid

Tabletop Exercise Template (60 minutes)

7 NZ Earthquake & Natural Disaster Resilience

NZ Resilience Checklist

8 Common Mistakes

9 Self-Check

Related techniques