Business Continuity & Resilience
Ensuring the business survives the unthinkable. Learn how to verify failover and build resilient systems.
1 The Hook
A major NZ insurer suffered a data centre outage during a storm. Their Disaster Recovery (DR) plan said they would be back up in 4 hours. It actually took 3 days.
Why? Because while they had the "Plan," they had never **tested** it under real load. They found that their database restoration scripts failed when trying to process 10TB of real data. A plan that hasn't been tested is just a hallucination.
2 The Rule
Resilience is an Engineering Practice, not a Paper Document. You don't "have" a BCP; you "verify" a BCP through continuous failure testing.
3 Watch Me Do It: Verifying the "Safe Zone"
Observe how a Test Manager verifies the survival metrics of an organisation using the RTO and RPO standards.
- Recovery Time Objective (RTO): The clock starts the moment the server dies. We must verify that the "Auto-Failover" completes in under 15 minutes.
- Recovery Point Objective (RPO): We kill the database mid-transaction. We verify that the "Secondary Site" has all data up to 5 minutes ago.
- Connectivity Check: We verify that the Payment Gateway and NZ Post API automatically point to the new IP address without manual intervention.
4 Resilience Lab: Chaos Engineering
In this lab, you must design a "Controlled Failure" to verify the Resilience of a new cloud-based banking app.
The Controlled Burn. Foresters burn small patches of land to prevent a massive wildfire. We kill small servers to prevent a massive outage.
Your Task: The Server Kill
You are testing a "High Availability" cluster. You have 3 servers. You want to kill Server #1 during peak traffic.
WORK THROUGH THESE STEPS:
- The Hypothesis: What should happen? (e.g. Load balancer shifts traffic to #2 and #3).
- The Verification: How do we prove no users were logged out?
- The Rollback: How do we bring Server #1 back online safely?
Design your "Chaos Test." What is the one critical failure you would test first? Write it in your notes.
5 RTO/RPO Calculation Worksheets
RTO and RPO are not IT numbers; they are business numbers. The CFO or Ops director defines them. Your job is to verify the technology can meet them.
RTO/RPO Definition Matrix
| Metric | Definition | Example |
|---|---|---|
| RTO (Recovery Time Objective) | Max time the business can survive without the system. Measured from "system fails" to "system back online." | Banking app: 15 minutes. Ticketing app: 4 hours. |
| RPO (Recovery Point Objective) | Max data loss the business can tolerate. If the primary fails at 2:00 PM, we can recover data up to 1:55 PM (RPO = 5 minutes). | Core banking: 1 minute RPO. Reporting: 1 hour RPO. |
| RTA (Recovery Time Actual) | How long it actually took to recover in a real or simulated failure. Should be ≤ RTO. | Your recent failover test took 8 minutes. RTA = 8 min. |
| RPO Actual | How much data was actually lost. Should be ≤ RPO. | The backup was 3 minutes behind when primary failed. RPO Actual = 3 min. |
NZ-Specific Risk: New Zealand's geography means many companies use single data centres. In case of earthquake, fire, or flood, a single-DC strategy can be catastrophic. Mandate geographic redundancy (e.g. Auckland primary, Wellington secondary) for any Sev 1 system.
RTO/RPO Calculation Template
Example: Payment Processing System
| Component | Failure Impact | Recovery Steps | Time (mins) |
|---|---|---|---|
| Detect Failure | Monitoring alerts go off | Alerting system triggers (automated) | 1 |
| Manual Approval | On-call engineer confirms failure | Page on-call, they review alerts | 5 |
| Activate Secondary | Switch traffic to backup DC | Update DNS, enable secondary DB | 3 |
| Data Sync Check | Verify backup is current | Check replication lag | 2 |
| Smoke Tests | Verify system is online | Run automated health checks | 4 |
| TOTAL RTA | 15 mins |
Result: RTO target is 15 minutes. RTA is 15 minutes. We PASS (just barely). If the on-call engineer takes 10 minutes to respond, we fail RTO.
6 DR Testing Frameworks: Tabletop to Full Failover
You cannot test DR only in production. Use a pyramid: tabletop exercises at the base, limited failovers in the middle, full failover tests at the top.
DR Testing Pyramid
| Test Level | What You Test | Frequency | Cost / Risk |
|---|---|---|---|
| 1. Tabletop Exercise | Walk through the DR playbook on paper. "If primary fails, what's step 1?" No systems involved. | Quarterly | Low cost, low risk |
| 2. Component Test | Test ONE component. E.g. "Can we restore the database from backup?" Don't activate secondary yet. | Quarterly | Low cost, low risk |
| 3. Limited Failover | Activate secondary, but keep primary online. Route 5% of traffic to secondary. Measure RTA and data consistency. | Twice yearly | Medium cost, medium risk |
| 4. Full Failover Test | Kill primary completely. All traffic goes to secondary. Measure actual RTO and RPO. Full team involved. | Annually | High cost (1 full day), high impact |
Tabletop Exercise Template (60 minutes)
Scenario: "It's Tuesday 9 AM. The primary data centre loses all power. What happens?"
- 0-5 min: Who detects the failure first? How? (Monitoring, customer complaints, internal alerts?)
- 5-10 min: Who gets called? What's the escalation path? (On-call engineer → Tech Lead → VP Ops?)
- 10-20 min: Who decides to activate secondary? (VP Ops? CIO? Product Lead?) What's the approval process?
- 20-40 min: Walk through the actual failover steps: DNS changes, database sync, load balancer config, communication to customers.
- 40-60 min: Debrief: What gaps did we find? What's unclear? Who owns the actions to fix it?
Common Tabletop Failure
"Who owns the DNS change?" — Cricket sounds. No one knows. This is a critical gap. Assign ownership in the playbook.
7 NZ Earthquake & Natural Disaster Resilience
New Zealand sits on the Pacific Ring of Fire. Earthquakes, flooding, and volcanic activity are real risks. Your BCP must account for geographic failures, not just server failures.
NZ Resilience Checklist
Before your next release, verify:
- Primary DC and Secondary DC are in different geographic regions (e.g. Auckland + Wellington, NOT two buildings in Auckland).
- Backup power systems (UPS + Generators) are tested. In the Christchurch 2011 quake, one hospital's generator had been running continuously for 3 months and failed.
- Network connectivity has redundant ISP routes (if one ISP's fibre is cut, traffic re-routes via another ISP).
- Staff can work remotely. If the office is damaged, the team must be able to execute the DR plan from home.
- Out-of-band communication plan exists (if the main office is destroyed, how do you reach the on-call engineer? Via SMS? Slack? Phone tree?).
- Critical vendor contacts (AWS, hosting provider, ISP) are documented in a hardcopy runbook stored outside the office.
Government Requirement: If your company handles government data or provides critical services (banking, health), the State Services Commission and MBIE require evidence of BCP testing. Keep test reports and RTA measurements to prove compliance.
8 Common Mistakes
⚠ Testing BCP on "Mock" Data
Why it fails: Restoration scripts that work for 1GB of data often fail at 10TB. You must test with **Real Volume** (masked for privacy) to verify restoration times.
⚠ Ignoring "Configuration Drift"
Why it fails: The Primary site is patched, but the Secondary site is forgotten. When you failover, the software versions don't match, and the system crashes.
⚠ Single Data Centre in NZ
Why it fails (NZ-specific): One earthquake, one fire, one flood, and your entire business is down. Secondary DC must be in a different region with different power, network, and physical infrastructure.
9 Self-Check
Q1. What is "Chaos Engineering"?
The practice of intentionally injecting failures into a system to prove it can survive them. It moves testing from "Reactive" to "Proactive."
Q2. Who defines the "Success" of a BCP test?
The Business Stakeholders. They define the RTO/RPO; the Test Manager proves whether the technology meets those business needs.
Q3. What's the difference between RTO and RPO?
RTO is time to recovery (how fast the system comes back). RPO is data loss (how much data was lost). A system can have 15min RTO but 5min RPO = you lose 5 minutes of data.
Q4. Why should NZ companies avoid single data centre strategies?
New Zealand is earthquake-prone. A single DC in Auckland could be destroyed, taking down the entire business. Use geographic redundancy across regions.