20 min read · 9 self-checks · Updated June 2026

Resilience & Performance

Chaos Engineering & Resilience Testing

Chaos engineering intentionally breaks systems in controlled ways to find weaknesses before users do. Test how your system handles dependency failures, network issues, resource constraints, and cascading failures. Netflix did this and changed how the industry thinks about testing.

Test Lead Senior

1 The Hook

An Auckland food-delivery platform is proud of its test coverage. Every happy path is green: place an order, pay, track the courier, leave a review. They have never had a serious outage, so nobody has ever asked what happens when a dependency fails.

One Friday night their payments provider has a 90-second wobble. The delivery app has no timeout on the payment call, so every order thread sits and waits. Threads pile up, the connection pool drains, and within two minutes the entire app is frozen — menus, tracking, the lot — over a dependency that was only briefly slow. A 90-second blip in one service took down everything for half an hour at peak dinner time.

The bug was never in the happy path, so no amount of happy-path testing would have found it. The only way to find it before a Friday night is to deliberately make the payment service slow in a controlled test and watch what the rest of the system does. That is chaos engineering: you break things on purpose, in a contained way, so reality cannot break them for you at the worst possible moment.

2 The Rule

Form a hypothesis about how the system should behave when something fails, define normal as measurable steady-state metrics, then inject one controlled failure inside a limited blast radius and check whether reality matches the hypothesis — with a kill switch ready the whole time.

3 The Analogy

Analogy

A fire drill in an office building.

You do not wait for a real fire to find out whether the alarms work, the exits are unlocked, and people know where to go. You set off the alarm on purpose, at a planned time, with the fire service briefed — and you watch. Does everyone get out? Does the assembly point work? Did one stairwell turn out to be blocked? You find the gaps while it is safe to find them.

Chaos engineering is a fire drill for software. The injected failure is the planned alarm, steady-state metrics tell you whether people are calmly filing out, the blast radius keeps the drill to one floor, and the kill switch is the ability to call it off. A real fire is a terrible time to discover the exit is locked — and a real Friday-night outage is a terrible time to discover there was no timeout.

Senior engineer insight

The most dangerous failure mode in distributed systems is not the dependency going down — it is the dependency going slow. A crashed service trips circuit breakers, throws connection errors, and is easy to detect. A service that responds in 15 seconds instead of 150ms is almost invisible to most monitoring until your entire thread pool is full of requests quietly waiting. Chaos engineering changed how I instrument systems: now the first thing I check in any code review is not whether there is a timeout, but whether the timeout is short enough to matter.

Most common mistake: teams run chaos experiments in staging only, declare success, and never run them in production. Staging has a fraction of the traffic and different connection-pool sizes — the bugs that matter only appear at production load.

From the field

An Auckland SRE team had instrumented their AWS microservices carefully and ran Chaos Mesh experiments in their staging cluster for six months with zero surprises. When they finally ran a pod-kill experiment against their production payments service, a completely different failure emerged: the staging environment used a mock identity-verification service with 50ms latency, but the production RealMe integration had a 3-second P99. Killing one payments pod caused all in-flight RealMe calls to be retried — simultaneously — against the surviving pods, which had no jitter on reconnect. Three pods fell over in sequence as each absorbed the thundering-herd reconnect from the pod that just died before it. The fix was jittered exponential backoff on reconnect and a 400ms cap on the RealMe call timeout. They would never have found it without running in production. The lesson that generalises: staging is not the system you are trying to protect.

💬

Senior Engineer Insight

Everyone focuses on what happens when the dependency goes down. Nobody talks about what happens when it comes back. I have seen more outages caused by the recovery than by the failure itself. You kill a service, queue up 40,000 retries, then restore it — and all 40,000 reconnect simultaneously and take it straight back down. We call it a thundering herd. The fix is jittered exponential backoff on every retry, but most teams only add that after they have lived through it once. Before you run any chaos experiment, ask: what does our reconnect behaviour look like? If the answer is "immediate retry with no jitter," fix that first or your kill switch will make things worse.

What it is

Chaos engineering is the discipline of testing system resilience by deliberately introducing failures into production (or production-like environments) and observing how the system responds. Unlike traditional testing, which verifies happy paths, chaos testing verifies failure paths: what happens when a service is slow, when a database goes down, when the network loses packets, when a server runs out of memory?

The goal: build confidence that your system can degrade gracefully when things break. Users lose confidence when your service becomes unavailable after a single failure; they forgive brief slowness if the service keeps working.

Netflix chaos engineering: Netflix runs “Chaos Monkey,” a tool that randomly kills servers in production. Why? Because it forces their engineers to build systems that survive random failures. Today, Chaos Monkey is industry standard practice.

Core principles

Build confidence, not metrics. The goal is not to get a score or pass a test. The goal is to understand how your system fails and to be confident it fails safely.
Find weaknesses before users do. Run chaos experiments before you deploy. Find the cascading failure scenarios and fix them in staging, not in production.
Test both the system and your team. Chaos testing reveals not just code defects, but operational defects: missing runbooks, unclear alerts, unclear ownership.
Start small. Don’t kill the entire database. Introduce one failure at a time, observe, then introduce more.

Types of chaos experiments

Common chaos experiment types and examples

Failure type	What breaks	What to test
Dependency failure	Service A calls Service B; Service B goes down	Does Service A timeout gracefully? Does it retry? Does it circuit-break and fail fast?
Latency injection	Service B responds, but very slowly (2s instead of 200ms)	Does Service A timeout after 5s? Does it use timeout-aware retries? Do cascading services queue or fail?
Resource exhaustion	CPU / memory / disk fills up	Does the service gracefully shed load? Do queues back up or overflow? Does monitoring alert before it crashes?
Network partition	Nodes cannot reach each other (packet loss, jitter)	Can the service split-brain? Do reads/writes remain consistent? Do nodes rejoin correctly?
Data corruption	Database writes garbage; cache returns stale data	Do validations catch corrupted data? Is there a rollback mechanism? Do integrity checks exist?

Designing a chaos experiment

A well-designed chaos experiment follows this structure:

1. Hypothesis

Start with a hypothesis: “If the payment service becomes unavailable, the order service will timeout and return a 503 to the user (instead of crashing).” or “If the database latency spikes to 5s, the web server will not run out of connections.”

2. Steady-state metrics

Define what “normal” looks like: error rate, latency p50/p95/p99, throughput, queue depth, memory usage. Before you introduce the failure, establish a baseline.

3. Blast radius

Define the scope of the failure. Don’t kill everything; start small:

Stage 1: Kill 1 of 10 instances of the dependent service.
Stage 2: Kill 5 of 10 instances.
Stage 3: Kill 100% (full outage).

4. Apply the failure

Introduce the failure (latency, packet loss, service down) in a controlled way.

5. Observe

Watch metrics in real time: does error rate spike? Does latency increase? Do queues back up? Does the system stay healthy or degrade?

6. Verify the hypothesis

Did the system behave as expected? If yes, you gained confidence. If no, you found a gap that needs fixing.

7. Automate and repeat

Once you’ve manually run the experiment, automate it so it runs on a schedule (nightly, weekly).

Kubernetes-specific chaos

Kubernetes makes it easy to run chaos experiments because you can kill, pause, or reschedule pods programmatically:

Pod termination: Kill a pod and watch the readiness probes detect it and reschedule a replacement.
Network policies: Block traffic between pods (simulate a network partition).
Resource limits: Set low CPU/memory limits on a pod and watch it get OOMKilled.
Node failure: Drain a node (move all pods off it) and verify your services still run on remaining nodes.

Kubernetes chaos experiment: pod termination

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-order-service
  namespace: production
spec:
  action: pod-kill
  mode: fixed
  value: 1  # Kill 1 pod at a time
  scheduler:
    cron: "0 2 * * *"  # Run at 2am daily
  selector:
    namespaces:
      - production
    labelSelectors:
      app: order-service
  duration: 5m  # Run for 5 minutes
  gracePeriod: 30  # Give pod 30s to shut down gracefully

Worked examples

Example 1: Database timeout handling

Hypothesis: If the database becomes slow (latency spikes from 50ms to 5s), the application will timeout after 3s and return a user-friendly error, not hang for 30s.

Experiment: Use Toxiproxy (a network proxy) to inject 5s latency on all database queries.

Baseline (steady state): p99 latency 100ms, error rate 0%, throughput 1000 req/s.

During chaos: Database queries take 5s (injected latency) + application overhead. Measure: Do requests timeout after 3s? Or do they wait the full 5s+?

Expected result: Application times out after 3s, returns HTTP 504 (Gateway Timeout), user sees error message, no hanging.

If it fails: Application hangs (no timeout configured), threads pile up, memory fills, service crashes. Add timeouts to database calls.

Example 2: Circuit breaker verification

Hypothesis: If the payment service becomes unavailable, the circuit breaker will open after 3 failed requests and fail fast (200ms) instead of waiting for timeouts (15s each).

Experiment: Kill the payment service and measure order service latency.

Expected result: First 3 requests timeout (15s each). 4th request fails immediately (circuit open, 200ms). Circuit stays open for 60s. After 60s, tries to recover.

If it fails: Circuit breaker is not configured or not working. Every request waits the full 15s timeout, cascading timeout failures to users.

Example 3: Cascading failure detection

Hypothesis: If Service A depends on Service B, and Service B depends on Service C, and Service C becomes slow, then Service A should not also become slow (due to connection pool exhaustion).

Experiment: Inject 10s latency on Service C.

Expected result: Service B times out quickly (has timeouts configured), returns errors, Service A sees errors and either retries or fails fast. Service A latency stays low because it doesn’t wait for Service C.

If it fails: Service B doesn’t timeout, waits for Service C, exhausts connections, Service A piles up waiting for connections. Latency cascades.

Observability during chaos

Chaos experiments are only useful if you can see what’s happening. Ensure you have:

Real-time metrics: Error rate, latency (p50/p95/p99), throughput, connection pools, queue depth, CPU/memory.
Distributed tracing: See a request flow from Service A through B, C, and back. Watch it fail in Service C and see how it cascades.
Logs: Application logs showing timeouts, retries, circuit breaker state changes.
Alerts: If the chaos breaks something, do your on-call alerts fire? If not, your monitoring is gapped.

Safety and governance

Chaos experiments can break things. Implement guardrails:

Start in staging, not production. Prove the experiment works in a non-critical environment first.
Schedule carefully. Run during business hours (not 3am) so your team can respond if something goes wrong.
Set a blast radius. Don’t kill 100% of a service; kill 1 instance, then 2, then more.
Set a time limit. A chaos experiment should run for minutes, not hours. If latency is injected for 5 minutes, it stops after 5 minutes (not indefinitely).
Implement kill switches. If an experiment goes wrong, have a way to stop it immediately.
Document the experiment. Other teams need to know why you’re breaking the system and what to do if it gets out of hand.
Brief the team. Before running a chaos experiment in production, tell your on-call and your team lead. It’s not a surprise.

Tools

Chaos Mesh — Kubernetes-native; inject pod failures, network delays, stress tests. Open source.
Gremlin — SaaS platform; point-and-click chaos experiments, curated experiments, integrations with monitoring.
Toxiproxy — Lightweight network proxy; inject latency, packet loss, connection resets. Good for testing individual service pairs.
Pumba — Docker-specific chaos; kill containers, pause/unpause, inject stress.
Chaos Toolkit — Framework for writing custom chaos experiments in Python.

Best practices and anti-patterns

Never run unknown chaos experiments. Always understand what failure you’re introducing and why before you run the experiment.
Don’t just focus on individual service failures. Test cascading failures: what happens when Service A, B, and C all fail simultaneously?
Don’t ignore the results. If a chaos experiment reveals a gap (e.g., no timeout configured), fix it before deploying.
Don’t run chaos experiments only in staging. Staging doesn’t have the same load, traffic patterns, or data as production. Run in production (with guardrails).
Don’t assume monitoring will catch everything. Run the chaos experiment and manually verify the system is behaving as expected, not just that metrics look okay.
Don’t skip the hypothesis. Running random chaos experiments is noise. Start with a specific hypothesis and verify or falsify it.

4 Industry Reality

🏭 What you actually encounter on the job

Most teams don't run chaos in production — they run it in staging and call it done. Staging has a fraction of the traffic, different connection-pool sizes, and often mocked dependencies. Bugs that only appear under real load (connection exhaustion, cache stampede, noisy-neighbour resource contention) survive every staging experiment and bite you anyway. Senior engineers push for a small, well-guarded blast radius in production, not because they want to break things, but because they know staging isn't the same system.
Legacy monoliths make chaos experiments hard to scope. In a well-decomposed microservice, killing one pod has a clean blast radius. In a 10-year-old Rails monolith, injecting latency on one database call can ripple through shared thread pools, global connection objects, and synchronous workers in ways that are nearly impossible to predict. Teams on legacy stacks often have to inject failures at the load balancer or DNS level rather than at the service level, and they should document what they can't control.
Time pressure kills hypothesis discipline. The textbook says: write a hypothesis, measure steady state, then run the experiment. In practice, a team under deadline pressure runs `kubectl delete pod` on a staging service and "sees what happens." This produces noise, not insight. Senior testers push back: no hypothesis, no experiment. They block out 30 minutes to write a one-paragraph hypothesis before touching anything.
On-call engineers are often not told chaos experiments are happening. This is more common than it should be. The SRE team schedules a chaos run, doesn't update the on-call handover, and the on-call engineer gets paged at 2pm and spends 45 minutes diagnosing what was actually a deliberate failure. A pre-experiment Slack message and a calendar entry cost nothing and prevent this entirely.
In NZ, regulated industries (banking, health, utilities) require change advisory board (CAB) approval before any deliberate failure injection in production. At Harbour Bank NZ, Harbour Bank, or a DHB running on-premise systems, you can't just spin up Chaos Mesh on a Friday afternoon. The experiment needs a risk assessment, a rollback plan, and sign-off from both the IT risk team and the application owner. Build that lead time into your project plan — it's weeks, not days.

5 When to Use It — and When Not To

⚡ Decision guide

✓ Use it when

Your system has multiple independent services and you need confidence that a single dependency failure won't cascade to a full outage
You are deploying to production for the first time after a major architecture change (new service mesh, new database cluster, new cloud region) and need to validate failure modes before real users hit them
You have observed a production incident that you believe was caused by a missing timeout or circuit breaker, and you want to prove the fix actually works under realistic failure conditions
Your team is on-call and lacks confidence about how the system behaves when dependencies degrade — chaos experiments build that mental model and improve incident response speed
You are preparing for a high-stakes event (Black Friday, a major NZ government go-live, an All Blacks game stream) and need pre-event confidence that the system handles load spikes and dependency wobbles

✗ Skip it when

The system is a simple monolith with one database and no downstream dependencies — there is nothing to cascade; standard performance testing and backup/restore drills are more valuable
You don't yet have observability: no metrics dashboard, no distributed tracing, no alerting. Running chaos without visibility is just breaking things randomly; fix your observability gap first
The team doesn't have the capacity to act on findings — if known resilience gaps sit unresolved for months, more chaos experiments add anxiety, not safety
You are in the middle of a production incident or a code freeze; chaos experiments inject controlled failures, but a system already under stress can tip into a real outage
The blast radius cannot be contained — if your architecture means that any failure you inject will immediately affect all customers (e.g. a single-region system with no redundancy), harden the architecture first, then run chaos experiments once redundancy exists

Context guide

How the right level of chaos engineering effort changes based on project context.

Context	Priority	Why
Harbour Bank or Harbour Bank NZ payment processing services — EFTPOS settlement, open banking APIs, real-time payments	Essential	A cascading timeout during settlement can freeze thousands of transactions. RBNZ resilience expectations for systemically important payment systems require evidence of failure-mode testing. CAB approval needed — plan weeks ahead.
HealthNZ or CoverNZ patient-facing portals after a cloud region migration or major infrastructure uplift	Essential	New network paths invalidate every timeout and circuit-breaker setting tuned for the old environment. Latency-injection against RealMe, document storage, and claims integrations must be verified before go-live — not after.
TransitNZ journey-planner or TeleNZ billing API — microservices calling multiple external dependencies	High	Multi-dependency fan-out creates hidden cascading failure paths that only appear under real load. Toxiproxy latency injection on each downstream is low effort, high return — and can run in a dev environment first.
Pacific Air or Benefits NZ self-service portals preparing for a high-traffic event (seat sale, benefit payment run)	High	Pre-event chaos experiments run two to four weeks before the peak validate that timeout and circuit-breaker settings hold under near-peak load. A confirmed hypothesis buys the team confidence when traffic spikes on the day.
Revenue NZ myIR or Benefits NZ Benefit Management — NZISM-classified systems with strict change-control windows	Medium	Value is real but the approval overhead is high. Focus chaos effort on staging environments that closely mirror production; keep production experiments to scheduled maintenance windows with a full CAB paper prepared at least four weeks in advance.
Small Wellington SaaS startup — single-region monolith with one PostgreSQL database and no downstream microservices	Low	There is nothing to cascade. Invest instead in backup-restore drills, database failover tests, and basic load testing. Add chaos experiments once you have multiple services and observability in place.

Trade-offs

What you gain and what you give up when you choose chaos engineering.

Advantage	Disadvantage	Use instead when…
Surfaces cascading failure paths that unit, integration, and load tests cannot reach — because they only exist when a dependency is absent or slow, not when everything is working.	Requires mature observability before it is useful. Without a metrics dashboard, distributed tracing, and alerting, you cannot distinguish a controlled experiment from an accidental outage.	Invest in observability first. A Datadog or Grafana/Prometheus stack is a prerequisite, not a parallel workstream.
Builds genuine on-call readiness. When engineers have run experiments and seen how the system fails, incident response is faster because the failure modes are familiar — not novel.	In regulated NZ environments (banking, health, government), production experiments require CAB approval, a documented rollback plan, and sometimes a risk sign-off from both IT and the business. Lead time is weeks, not hours.	Run fault-injection tests in staging only when the approval overhead for production chaos outweighs the expected finding. Treat staging findings as directional, not conclusive.
Validates the specific fixes from previous incidents. A circuit breaker added after last month's outage is only confirmed working if you inject the same failure again under production load and verify the breaker actually trips.	A chaos experiment without a hypothesis is a production incident you scheduled. Teams under deadline pressure skip the hypothesis step, run random failures, and produce noise rather than insight. Discipline is required.	Use structured fault injection (a specific known failure against a known recovery path) when the team is new to the practice. Reserve open-ended chaos experiments for teams with established hypothesis discipline.
Scales down to small teams. Toxiproxy costs nothing, runs locally, and can simulate a slow payment gateway in a dev environment in under ten minutes. You do not need Kubernetes or a Gremlin licence to get started.	Findings without remediation create anxiety, not safety. If resilience gaps discovered in experiments sit unresolved for months, the team loses trust in the process and stakeholders question the value.	Pause experiments and focus on fixing the backlog of known gaps before running more. More experiments on a broken foundation add noise, not confidence.

Enterprise reality

How Chaos Engineering changes at 200–300-developer scale in NZ enterprise — what gets formalised, what gets automated, and what gets much harder.

Automation replaces manual scheduling. At small-team scale you run a chaos experiment by hand before a release. At 200-developer scale you need automated GameDay pipelines: experiments scheduled via Gremlin or Chaos Mesh, triggered in CI/CD, with pass/fail gates blocking deployment if a hypothesis is violated. Manual scheduling doesn’t survive 30 squads pushing to production daily.
Governance overhead is real and non-negotiable. Harbour Bank runs chaos experiments under its operational resilience framework, which requires a risk assessment, documented rollback plan, IT Risk sign-off, and a CAB paper lodged at least three weeks before any production fault injection. The Privacy Act 2020 and NZISM controls mean that any experiment touching production customer data must be pre-approved and scoped to exclude PII from injected failure logs. Build that lead time into your programme — it is structural, not bureaucratic obstruction.
Tooling decisions at volume. At small scale, Toxiproxy running locally is sufficient. At enterprise scale, teams standardise on Gremlin (managed SaaS, audit trail, role-based access — important for regulated environments) or Chaos Mesh with a centralised Argo Workflows orchestrator. AWS Fault Injection Simulator is common on cloud-native stacks where IAM-bounded blast radius gives auditors confidence that the experiment cannot escape its defined scope.
Cross-squad coordination becomes the hardest part. With 10+ squads each owning different services, a single latency-injection experiment can trigger alerts across four on-call rotations simultaneously. Enterprise chaos programmes require a designated GameDay coordinator role, a shared Slack channel where all on-call engineers are briefed before the experiment window, and a blast-radius map showing which squads are potentially affected. Without this, chaos experiments produce real-looking incidents that burn on-call engineers’ goodwill and erode trust in the practice.

◆ What I would do

Professional judgment — when to reach for chaos engineering, when to skip it, and what to watch for.

If…

I am a QA lead on a CoverNZ claims portal migration to a new AWS region, and my test lead asks whether to run chaos experiments in the two-week go-live window

I would…

Say yes, but only to latency injection in staging against the three highest-risk integrations: RealMe identity verification, document storage, and the payments gateway. I would not attempt production chaos in the go-live window — that requires a separate CAB paper that will take weeks to approve, and a new region migration is already a high-change-risk period. The staging experiments give us directional confidence that timeout settings suit the new network path, which is the most likely gap after a region change. I would write up any findings as P1 tickets and track them to closure before the go-live date.

If…

A Harbour Bank NZ SRE team has just fixed a production incident caused by a missing circuit breaker on the CoreLogic property-valuation API, and wants to verify the fix holds

I would…

Design a single targeted fault-injection experiment rather than a broad chaos run. Hypothesis: "If CoreLogic returns 5xx on all calls for 60 seconds, the circuit breaker opens after three failures, loan-application requests fail fast within 200ms, and the error is logged with the correct correlation ID." Run it in staging first at production-representative load using Toxiproxy to return 503 on the CoreLogic endpoint. If the hypothesis holds, escalate to a production experiment in the next approved CAB window — keeping the blast radius to a 5% traffic slice. This approach verifies the specific fix rather than hunting for new problems, and keeps the compliance overhead manageable.

If…

A Revenue NZ myIR development team is enthusiastic about running their first chaos experiments but has no observability tooling — no Prometheus, no distributed tracing, no alerting beyond CloudWatch default dashboards

I would…

Block the experiments and push for observability first. Without latency percentiles, error-rate dashboards, and alerting that can detect a 2x latency spike within a minute, chaos engineering produces noise rather than insight — you cannot tell whether the experiment changed anything. I would propose a two-sprint investment: instrument the three highest-risk service calls with custom CloudWatch metrics, wire SLO alerts at the p99 threshold, and confirm the alerts fire during a synthetic smoke test. Once you can see the system clearly, run a single Toxiproxy latency injection against the first integration as a proof-of-concept. One well-observed experiment with a clear result is more persuasive to the Revenue NZ technical risk team than a month of undocumented kubectl delete pods.

The bottom line: Chaos engineering earns its keep when you have three things in place before you inject the first failure: a written hypothesis, a metrics dashboard you trust, and a kill switch you have tested. Skip any one of those and you are running a production incident on a schedule — not an experiment.

6 Best Practices

✓ What experienced testers do

Write the hypothesis before opening any terminal. A chaos experiment without a hypothesis is a production incident you scheduled yourself. One sentence: "If X fails, Y should happen and metric Z should stay within bounds W." Write it down, share it with the team, then run the experiment.
Measure steady state for at least five minutes before injecting anything. A five-minute baseline of error rate, latency percentiles, throughput, and connection-pool depth gives you a clean comparison point. Without it, you cannot separate the experiment's effect from ordinary traffic variation.
Start at 1% blast radius, not 100%. Kill one pod of ten. Inject latency on 5% of requests. If the system behaves as expected, escalate. Never start at full failure; you will cause a real outage and learn nothing you couldn't have learned at 10%.
Automate the kill switch before you start. Have a one-command rollback ready before the experiment begins — not while something is on fire. On Kubernetes this is often `kubectl rollout undo` or deleting a Chaos Mesh resource; on Toxiproxy it's removing the toxic. Know it, test it in staging, keep the command in your clipboard.
Announce the experiment on the on-call channel and set a calendar block. Even in staging: "Chaos experiment running 2–2:30pm, Service: order-service, failure: pod kill. Contact @yourname if you see unexpected alerts." This prevents false-positive incident escalations and keeps the team informed.
Verify your monitoring catches the failure, not just that the system survives it. If the experiment runs, the hypothesis holds, and not a single alert fires, that is a monitoring gap, not a clean pass. A chaos experiment should trigger your SLO alerts. If it doesn't, your alerting thresholds are too loose.
Document findings in the same ticket system as bugs. A chaos experiment that reveals a missing timeout should produce a Jira/Linear ticket assigned to a developer, just like a functional bug. "We found it" is worthless if it doesn't get fixed. Track it, prioritise it, verify it.
Run experiments on a schedule, not just before releases. Resilience rots. A timeout configured today might be removed in a refactor six months from now. Weekly or fortnightly automated chaos experiments catch regressions before they matter. Chaos Mesh and Gremlin both support scheduled experiments.
Test your team's response, not just the system's response. Give the on-call engineer the experiment window but not the specific failure. Time how long it takes them to identify the cause and escalate. This is a runbook and on-call readiness drill as much as a technical resilience test.
In NZ regulated environments, run the experiment in a dedicated chaos-testing window approved by the CAB. Get the risk sign-off, prepare a rollback plan, nominate an incident commander for the window, and have the vendor's support number ready. Treat it like a planned maintenance window — because that is exactly what it is.

7 Common Misconceptions

❌ Myth: Chaos engineering means randomly breaking things to see what happens.

Reality: Random failure injection is a production incident with extra steps. Real chaos engineering starts with a specific, falsifiable hypothesis about how the system should behave under a defined failure, measures a steady-state baseline, controls the blast radius, and ends by verifying or falsifying the hypothesis. The "chaos" is in the failure type, not in the methodology — the methodology is disciplined, structured, and repeatable. Netflix's Chaos Monkey is famous for being random, but it runs inside a sophisticated safety framework that took years to build. Starting without that framework and calling it "chaos engineering" produces noise, not insight.

❌ Myth: Once you have 100% unit test and integration test coverage, chaos engineering is unnecessary.

Reality: Unit and integration tests verify that your code does what it is supposed to do when its inputs are clean. They cannot test what happens when a downstream service is slow, when the network drops packets, or when the database runs out of connections at 3am on a Tuesday. These failure modes live in the interactions between components and in the infrastructure layer — layers that unit tests don't touch. The Auckland food-delivery platform example in this page is a real pattern: 100% green test suite, zero resilience against a 90-second dependency wobble. Coverage metrics tell you nothing about failure mode handling.

❌ Myth: Chaos engineering is only for large companies like Netflix or Amazon that run thousands of microservices.

Reality: If your application calls any external service — a payment gateway, an email provider, a government API, a database hosted in a different availability zone — you have failure modes that chaos engineering can surface. A ten-person SaaS startup in Wellington that calls Stripe, SendGrid, and AWS RDS has at least three places where a timeout, a circuit breaker, or a fallback might be missing. The tooling (Toxiproxy, Pumba, even simple `tc netem` commands) works at any scale. The techniques scale down just as well as they scale up; you don't need a chaos monkey army, you need a hypothesis and a proxy.

8 Now You Try

Three graded exercises — spot, fix, then build. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot: name the missing resilience gap

A TeleNZ billing service calls a downstream tax-rate service on every invoice. In a chaos experiment, the team injects 8 seconds of latency into the tax-rate service. The billing service has no timeout and no circuit breaker. Steady-state was: P99 latency 300ms, error rate 0%, throughput 500 req/s. Predict what the metrics will do during the experiment, and name the specific resilience gap and its fix.

Show model answer

What happens: billing-service latency tracks the injected delay — requests now take ~8s+ instead of 300ms, because the service waits for the slow tax-rate call with no timeout. Threads block waiting, the connection/thread pool drains, new requests queue then fail, and throughput collapses well below 500 req/s. Error rate climbs as requests time out at the edge or the pool is exhausted. This is a cascading failure: one slow dependency drags down the whole billing service.

The resilience gap: no timeout on the downstream call (and no circuit breaker). A dependency that is merely slow is allowed to consume every thread.

The fix: add an aggressive timeout to the tax-rate call (e.g. fail the call after ~1s rather than waiting 8s), and add a circuit breaker so that after a few failures the billing service fails fast instead of queuing. Decide a sensible fallback — cached/last-known tax rate, or a clear error — so a slow dependency degrades gracefully instead of freezing everything.

🔧 Exercise 2 of 3 — Fix: repair a reckless chaos plan

A team at a Christchurch logistics company wrote the chaos experiment below. It is dangerous and not actually an experiment. Rewrite it into a safe, well-formed chaos experiment.

Flawed plan:
1. At 3am with no one watching, kill the entire production database cluster.
2. See what happens.
3. No hypothesis, no baseline metrics.
4. Let it run until someone notices.
5. No kill switch, team not told.

Rewrite as a safe chaos experiment:

Show model answer

Safe chaos experiment:

Hypothesis: state a specific, falsifiable expectation, e.g. "If one of the three database replicas is killed, the service stays available — reads fail over to a healthy replica and P99 latency stays under 500ms."

Steady-state metrics: record the baseline first — error rate, latency P50/P95/P99, throughput, connection-pool depth, CPU/memory. You cannot tell a regression from normal noise without this.

Blast radius: start small. Kill ONE replica of three, not the whole cluster. Only escalate (kill two, then simulate full outage) once the small failure behaves as expected.

Timing / briefing: run during business hours when the team can respond, not 3am unattended. Brief on-call and the team lead beforehand — it is a planned drill, never a surprise.

Kill switch + time limit: have a way to stop the experiment instantly, and bound it to a few minutes, not "until someone notices".

Also wrong: no hypothesis (so it is random noise, not an experiment); killing the entire cluster is an enormous, uncontrolled blast radius; and "let it run until someone notices" means you are using real customers as the alert.

🏗️ Exercise 3 of 3 — Build: design a chaos experiment

A TransitNZ journey-planner API depends on a real-time traffic-data service. You want confidence that if the traffic-data service goes down, the journey planner still returns a route (using cached data) instead of failing. Design the full chaos experiment: hypothesis, steady-state metrics, blast radius, the failure you inject, what you observe, and your safety guardrails.

Show model answer

Chaos experiment for the journey-planner:

Hypothesis: "If the real-time traffic-data service is unavailable, the journey planner detects the failure within ~1s, falls back to cached traffic data, and still returns a route. The planner's error rate stays under 1% and P99 latency stays under 800ms."

Steady-state metrics: error rate, P50/P95/P99 latency, throughput (routes/sec), cache hit rate, and the proportion of responses served from live vs cached data.

Blast radius (staged): Stage 1 — inject latency / errors on a small percentage of traffic-data calls. Stage 2 — make the traffic-data service return errors for all calls. Stage 3 — full outage (service unreachable). Escalate only while behaviour matches the hypothesis.

Failure to inject: use a network proxy (e.g. Toxiproxy) or the service mesh to first add latency, then return 5xx, then drop the connection entirely to the traffic-data dependency.

What to observe / pass criteria: the planner times out the traffic-data call quickly, the circuit breaker opens, responses switch to cached data, routes are still returned, and error rate / latency stay within the hypothesis bounds. Fail = the planner hangs, returns 5xx to users, or returns no route.

Safety guardrails: run in staging first, then production during business hours; brief on-call; bound the experiment to a few minutes; keep a kill switch to restore the dependency instantly; and verify the monitoring/alerts actually fire during the experiment.

Why teams fail here

Running experiments without a hypothesis — injecting random failures and "seeing what happens" produces noise, not insight, and gives stakeholders the impression chaos engineering is just intentional vandalism.
Never graduating from staging to production — staging has different load profiles, connection-pool sizes, and dependency latencies; the failures that matter most only appear at real traffic levels.
Finding gaps and not fixing them — a chaos experiment that reveals a missing timeout and produces no Jira ticket is theatre. The value is in the remediation, not the discovery.
Skipping observability setup before the first experiment — if your dashboards, tracing, and alerting are not in place, you cannot tell a controlled experiment from an accidental outage, and you cannot learn anything from what you observe.

Key takeaway

Chaos engineering is the only testing discipline where the experiment failing to trigger an alert is itself a bug — because it means your observability cannot see a real outage either.

How this has changed

The field moved. Here is how Chaos Engineering evolved from its origins to current practice.

2008

Netflix migrates to AWS and experiences a major database corruption that takes the service down for three days. This incident motivates the creation of Chaos Monkey (2010) — a tool that randomly terminates production instances to force engineering teams to build systems that survive failure.

2012

Netflix open-sources Chaos Monkey as part of the Simian Army. The term "chaos engineering" begins to emerge. The practice is initially limited to Netflix-scale organisations with the engineering capacity to absorb random failures safely.

2016

Gremlin founded as chaos-engineering-as-a-service. Chaos engineering principles codified by Netflix and published as "Principles of Chaos Engineering." The practice becomes accessible beyond hyperscalers. Kubernetes makes fault injection at the infrastructure layer practical.

2019

Chaos engineering enters CNCF (Cloud Native Computing Foundation) as a mature practice. Tools like LitmusChaos, Chaos Toolkit, and AWS Fault Injection Simulator make structured chaos experiments practical for enterprise teams. GameDays become a standard practice.

Now

Chaos engineering is a continuous, automated discipline in mature organisations — not a periodic experiment. AI-assisted tools can propose chaos scenarios based on system topology analysis. NZ financial services regulators (RBNZ, FMA) expect evidence of resilience testing for critical payment and settlement systems.

Self-Check

Click each question to reveal the answer.

Interview Questions

What NZ hiring managers ask about Chaos Engineering — and what strong answers look like.

What is the difference between chaos engineering and traditional resilience testing?

Strong answer: Traditional resilience testing verifies known failure scenarios against expected recovery behaviour — planned failover tests, DR drills, and load tests to predetermined thresholds. Chaos engineering introduces failures continuously in production (or production-like environments) to discover unknown weaknesses — cascading failures, unexpected dependencies, and failure modes you did not know to plan for. Chaos engineering treats resilience as a continuous property to verify, not a gate to pass before release. Netflix's Chaos Monkey terminates random instances so teams cannot build systems that assume they will not fail.

Mid/Senior

How do you determine whether it is safe to run a chaos experiment in production?

Strong answer: First define the steady-state hypothesis — measurable normal behaviour (requests per second, error rate, p99 latency) that should hold during the experiment. Define the blast radius — which services, percentage of traffic, and time window are at risk. Check that rollback is possible and fast. Confirm that on-call is aware and monitoring dashboards are visible. Start with the smallest possible experiment and expand as confidence grows. For a NZ payments system, I would not run chaos experiments during peak periods (EFTPOS end-of-day settlement) or near regulatory reporting windows.

Senior/Lead

What is the difference between fault injection and chaos engineering?

Strong answer: Fault injection is deliberate introduction of specific known faults to test known recovery paths — inject a database failure, verify the application falls back to the read replica. Chaos engineering introduces random or systematic failures to discover unknown weaknesses. Fault injection validates your recovery design; chaos engineering reveals whether your recovery design covers all the failures that can actually happen. Most teams start with fault injection to build confidence, then progress to chaos engineering as maturity grows.

Mid/Senior

Q1: Why can't happy-path testing find the failures chaos engineering targets?

Happy-path tests verify behaviour when everything works. The defects chaos engineering hunts — missing timeouts, absent circuit breakers, connection-pool exhaustion, cascading failures — only appear when a dependency is slow or down. Those failure paths are never exercised by happy-path tests, so the only way to find them before production is to inject the failure deliberately.

Q2: What is a hypothesis in a chaos experiment, and why start with one?

A hypothesis is a specific, falsifiable prediction of how the system should behave under a given failure, e.g. “if the payment service is unavailable, the order service times out and returns a 503 rather than crashing.” Without it you are running random chaos — just noise. The hypothesis gives you something concrete to verify or falsify, so the experiment either builds confidence or pinpoints a real gap.

Q3: What is the blast radius, and why start small?

The blast radius is the scope of the injected failure — how much of the system you allow to break. You start small (kill one instance of ten, not the whole cluster) so that if the system behaves worse than expected, only a contained slice is affected. You escalate (one, then several, then full outage) only while behaviour keeps matching the hypothesis.

Q4: Why measure steady-state metrics before injecting the failure?

Because “did the system degrade?” is only answerable against a known normal. Steady-state metrics — error rate, latency percentiles, throughput, queue depth, resource usage — are your baseline. During the experiment you compare against them to tell a genuine regression from ordinary variation.

Q5: What guardrails make a production chaos experiment responsible rather than reckless?

Prove it in staging first; run during business hours so the team can respond; keep the blast radius small and staged; set a short time limit so injected failures stop on their own; have a kill switch to abort instantly; brief on-call and the team lead beforehand; and document why you are breaking the system. Chaos is a planned drill, never a surprise.

Q: The CoverNZ online claims portal is about to go through a major platform migration to a new cloud region. Your test lead asks you to design a chaos experiment for the post-migration environment. What would you test first, and why?

A: The highest-priority target is the connection between the new region and CoverNZ's downstream dependencies — identity verification (RealMe), document storage, and the payments integration. The most dangerous gap after a region migration is missing or misconfigured timeouts and circuit breakers on those calls, because they were often tuned for the old network path and may not suit the new latency profile. Start with a latency-injection experiment against each downstream dependency in staging: form a hypothesis (e.g. "if RealMe responds in 4s instead of 200ms, the portal returns a user-friendly error within 5s and does not exhaust connection threads"), measure steady-state baseline, then inject latency with Toxiproxy and verify. Only move to production experiments after CAB approval and briefing the on-call team — in a Crown entity like CoverNZ, that approval process can take weeks.

Q: What is the key difference between chaos engineering and load/performance testing, and how do you explain that difference to a developer who thinks they are the same thing?

A: Performance testing asks "how does the system behave as concurrent users or request volume increases?" — it stresses the happy path. Chaos engineering asks "how does the system behave when a specific dependency fails, slows, or corrupts data?" — it stresses failure paths. A system can pass every load test at 10,000 concurrent users and still collapse in 90 seconds when the payment gateway wobbles, because load tests never remove a dependency. You can explain it to a developer this way: load testing fills the motorway with extra cars; chaos engineering removes a lane, a petrol station, or a traffic light — and checks whether the remaining infrastructure routes around it or grinds to a halt.

Q: When should you NOT run chaos experiments, even if the team is enthusiastic about trying them?

A: Skip chaos experiments when observability is absent — if you have no metrics dashboard, no distributed tracing, and no alerting, injecting failures is just randomly breaking things with no way to learn from it. Also skip them during an active production incident or code freeze: a system already under stress can tip from a controlled experiment into a real outage. In NZ regulated environments (banking, HealthNZ, CoverNZ), also skip if you have not yet obtained CAB approval and a documented rollback plan — the compliance risk of an unapproved production failure injection outweighs any benefit. Finally, skip if the team cannot action findings: if known resilience gaps sit unresolved for months, more experiments add anxiety without adding safety.

Q: A developer on your team says "we already have 95% integration test coverage, so chaos engineering would just be duplicating what we've already tested." What is wrong with this reasoning, and how do you respond?

A: Integration tests verify that components work correctly together when all their inputs and dependencies are well-behaved. They cannot test what happens when a dependency is slow, returns garbage, or becomes unreachable — because the test setup mocks or stubs dependencies to return clean responses. The 95% coverage metric tells you nothing about failure-mode handling. The Auckland food-delivery example in this page illustrates exactly this: 100% green test suite, zero resilience against a 90-second payment provider wobble. Respond to the developer by pointing out that coverage measures code paths exercised, not failure paths survived, and offer to design a single small chaos experiment — say, injecting latency on your most critical downstream dependency — to demonstrate what coverage misses. One experiment usually makes the argument more convincingly than any explanation.

Go Deeper

The Chaos Testing (Specialised) track goes further: multi-lesson deep-dive with NZ-specific compliance context, advanced tooling, and practice exercises. Recommended once you have the fundamentals on this page.

Continue Learning

Prerequisites

Related Techniques

What to Learn Next

Also in Bootcamp

Resilience over perfection: Chaos engineering is not about building perfect systems; it’s about building systems that fail safely and allow teams to understand and handle failures. If your system can survive the loss of a database server, you’ve won.

← Back to library Next: API Testing →

Chaos Engineering & Resilience Testing

1 The Hook

2 The Rule

3 The Analogy

What it is

Core principles

Types of chaos experiments

Designing a chaos experiment

1. Hypothesis

2. Steady-state metrics

3. Blast radius

4. Apply the failure

5. Observe

6. Verify the hypothesis

7. Automate and repeat

Kubernetes-specific chaos

Worked examples

Example 1: Database timeout handling

Example 2: Circuit breaker verification

Example 3: Cascading failure detection

Observability during chaos

Safety and governance

Tools

Best practices and anti-patterns

4 Industry Reality

5 When to Use It — and When Not To

✓ Use it when

✗ Skip it when

Context guide

Trade-offs

◆ What I would do

6 Best Practices

7 Common Misconceptions

8 Now You Try

How this has changed

Related techniques

Self-Check

Interview Questions

Prerequisites

Related Techniques

What to Learn Next

Also in Bootcamp