Test Manager BCP & Resilience

Business Continuity & Resilience

Ensuring the business survives the unthinkable. Learn how to verify failover and build resilient systems.

1 The Hook

A major NZ insurer suffered a data centre outage during a storm. Their Disaster Recovery (DR) plan said they would be back up in 4 hours. It actually took 3 days.

Why? Because while they had the "Plan," they had never **tested** it under real load. They found that their database restoration scripts failed when trying to process 10TB of real data. A plan that hasn't been tested is just a hallucination.

2 The Rule

Resilience is an Engineering Practice, not a Paper Document. You don't "have" a BCP; you "verify" a BCP through continuous failure testing.

3 Watch Me Do It: Verifying the "Safe Zone"

Observe how a Test Manager verifies the survival metrics of an organisation using the RTO and RPO standards.

  • Recovery Time Objective (RTO): The clock starts the moment the server dies. We must verify that the "Auto-Failover" completes in under 15 minutes.
  • Recovery Point Objective (RPO): We kill the database mid-transaction. We verify that the "Secondary Site" has all data up to 5 minutes ago.
  • Connectivity Check: We verify that the Payment Gateway and NZ Post API automatically point to the new IP address without manual intervention.

4 Resilience Lab: Chaos Engineering

In this lab, you must design a "Controlled Failure" to verify the Resilience of a new cloud-based banking app.

The Analogy

The Controlled Burn. Foresters burn small patches of land to prevent a massive wildfire. We kill small servers to prevent a massive outage.

Your Task: The Server Kill

You are testing a "High Availability" cluster. You have 3 servers. You want to kill Server #1 during peak traffic.

WORK THROUGH THESE STEPS:

  1. The Hypothesis: What should happen? (e.g. Load balancer shifts traffic to #2 and #3).
  2. The Verification: How do we prove no users were logged out?
  3. The Rollback: How do we bring Server #1 back online safely?

Design your "Chaos Test." What is the one critical failure you would test first? Write it in your notes.

5 RTO/RPO Calculation Worksheets

RTO and RPO are not IT numbers; they are business numbers. The CFO or Ops director defines them. Your job is to verify the technology can meet them.

RTO/RPO Definition Matrix

MetricDefinitionExample
RTO (Recovery Time Objective)Max time the business can survive without the system. Measured from "system fails" to "system back online."Banking app: 15 minutes. Ticketing app: 4 hours.
RPO (Recovery Point Objective)Max data loss the business can tolerate. If the primary fails at 2:00 PM, we can recover data up to 1:55 PM (RPO = 5 minutes).Core banking: 1 minute RPO. Reporting: 1 hour RPO.
RTA (Recovery Time Actual)How long it actually took to recover in a real or simulated failure. Should be ≤ RTO.Your recent failover test took 8 minutes. RTA = 8 min.
RPO ActualHow much data was actually lost. Should be ≤ RPO.The backup was 3 minutes behind when primary failed. RPO Actual = 3 min.

NZ-Specific Risk: New Zealand's geography means many companies use single data centres. In case of earthquake, fire, or flood, a single-DC strategy can be catastrophic. Mandate geographic redundancy (e.g. Auckland primary, Wellington secondary) for any Sev 1 system.

RTO/RPO Calculation Template

Example: Payment Processing System

ComponentFailure ImpactRecovery StepsTime (mins)
Detect FailureMonitoring alerts go offAlerting system triggers (automated)1
Manual ApprovalOn-call engineer confirms failurePage on-call, they review alerts5
Activate SecondarySwitch traffic to backup DCUpdate DNS, enable secondary DB3
Data Sync CheckVerify backup is currentCheck replication lag2
Smoke TestsVerify system is onlineRun automated health checks4
TOTAL RTA15 mins

Result: RTO target is 15 minutes. RTA is 15 minutes. We PASS (just barely). If the on-call engineer takes 10 minutes to respond, we fail RTO.

6 DR Testing Frameworks: Tabletop to Full Failover

You cannot test DR only in production. Use a pyramid: tabletop exercises at the base, limited failovers in the middle, full failover tests at the top.

DR Testing Pyramid

Test LevelWhat You TestFrequencyCost / Risk
1. Tabletop ExerciseWalk through the DR playbook on paper. "If primary fails, what's step 1?" No systems involved.QuarterlyLow cost, low risk
2. Component TestTest ONE component. E.g. "Can we restore the database from backup?" Don't activate secondary yet.QuarterlyLow cost, low risk
3. Limited FailoverActivate secondary, but keep primary online. Route 5% of traffic to secondary. Measure RTA and data consistency.Twice yearlyMedium cost, medium risk
4. Full Failover TestKill primary completely. All traffic goes to secondary. Measure actual RTO and RPO. Full team involved.AnnuallyHigh cost (1 full day), high impact

Tabletop Exercise Template (60 minutes)

Scenario: "It's Tuesday 9 AM. The primary data centre loses all power. What happens?"

  • 0-5 min: Who detects the failure first? How? (Monitoring, customer complaints, internal alerts?)
  • 5-10 min: Who gets called? What's the escalation path? (On-call engineer → Tech Lead → VP Ops?)
  • 10-20 min: Who decides to activate secondary? (VP Ops? CIO? Product Lead?) What's the approval process?
  • 20-40 min: Walk through the actual failover steps: DNS changes, database sync, load balancer config, communication to customers.
  • 40-60 min: Debrief: What gaps did we find? What's unclear? Who owns the actions to fix it?

Common Tabletop Failure

"Who owns the DNS change?" — Cricket sounds. No one knows. This is a critical gap. Assign ownership in the playbook.

7 NZ Earthquake & Natural Disaster Resilience

New Zealand sits on the Pacific Ring of Fire. Earthquakes, flooding, and volcanic activity are real risks. Your BCP must account for geographic failures, not just server failures.

NZ Resilience Checklist

Before your next release, verify:

  • Primary DC and Secondary DC are in different geographic regions (e.g. Auckland + Wellington, NOT two buildings in Auckland).
  • Backup power systems (UPS + Generators) are tested. In the Christchurch 2011 quake, one hospital's generator had been running continuously for 3 months and failed.
  • Network connectivity has redundant ISP routes (if one ISP's fibre is cut, traffic re-routes via another ISP).
  • Staff can work remotely. If the office is damaged, the team must be able to execute the DR plan from home.
  • Out-of-band communication plan exists (if the main office is destroyed, how do you reach the on-call engineer? Via SMS? Slack? Phone tree?).
  • Critical vendor contacts (AWS, hosting provider, ISP) are documented in a hardcopy runbook stored outside the office.

Government Requirement: If your company handles government data or provides critical services (banking, health), the State Services Commission and MBIE require evidence of BCP testing. Keep test reports and RTA measurements to prove compliance.

8 Common Mistakes

⚠ Testing BCP on "Mock" Data

Why it fails: Restoration scripts that work for 1GB of data often fail at 10TB. You must test with **Real Volume** (masked for privacy) to verify restoration times.

⚠ Ignoring "Configuration Drift"

Why it fails: The Primary site is patched, but the Secondary site is forgotten. When you failover, the software versions don't match, and the system crashes.

⚠ Single Data Centre in NZ

Why it fails (NZ-specific): One earthquake, one fire, one flood, and your entire business is down. Secondary DC must be in a different region with different power, network, and physical infrastructure.

9 Self-Check

Q1. What is "Chaos Engineering"?

The practice of intentionally injecting failures into a system to prove it can survive them. It moves testing from "Reactive" to "Proactive."

Q2. Who defines the "Success" of a BCP test?

The Business Stakeholders. They define the RTO/RPO; the Test Manager proves whether the technology meets those business needs.

Q3. What's the difference between RTO and RPO?

RTO is time to recovery (how fast the system comes back). RPO is data loss (how much data was lost). A system can have 15min RTO but 5min RPO = you lose 5 minutes of data.

Q4. Why should NZ companies avoid single data centre strategies?

New Zealand is earthquake-prone. A single DC in Auckland could be destroyed, taking down the entire business. Use geographic redundancy across regions.