Chaos Testing
Break your system on purpose, in controlled conditions, so production does not break you on its own terms at 2am on a Saturday. Chaos testing is how you turn "I hope it fails gracefully" into evidence.
1 The Hook
In December 2021, AWS us-east-1 had a multi-hour outage. Netflix stayed up. Disney+ did not. Ring doorbells went dark; Netflix users kept watching.
The difference was not luck. Netflix had been deliberately killing its own production servers at random — in prod, during the day — since 2011. The tool, Chaos Monkey, would pick a running EC2 instance and terminate it. If anything broke, the team fixed it. By 2021, killing a single host was such a non-event that a whole region disappearing caused visible but survivable degradation.
Chaos testing is the discipline of finding the things that would break your system before nature finds them for you. It is not "testing in production for fun." It is a rigorous scientific method: hypothesise that the system is resilient, run a controlled experiment to try to disprove it, and learn something either way.
For NZ testers: this matters because most NZ products run on cloud infrastructure that has global blast radius. An Azure region failure, a CDN outage, a DNS blip at 2degrees or Spark — your product will meet at least one of these in its lifetime. Chaos testing is how you find out whether it survives.
2 The Rule
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.
The operative words are experiment and confidence. You are not randomly breaking things — you are running hypotheses. You are not trying to prove the system is perfect — you are trying to measure what happens when specific failures occur, and comparing that to what you thought would happen.
3 The Analogy
Fire drills.
Every office in NZ runs fire drills. Not because a fire is expected this Tuesday, but because the first time staff practice evacuating should not be during a real fire. The drill reveals that the fire door sticks, the deaf receptionist cannot hear the alarm, the accessible ramp is blocked by a recycling bin. Those are the findings. Chaos testing is fire drills for your infrastructure. You pull the alarm in a controlled window, with the fire brigade on standby, so that when a real fire happens everyone knows where the doors are.
4 The Principles of Chaos
The formal Principles of Chaos Engineering define a five-step method:
- Define a "steady state" — a measurable business metric that represents the system working normally. Not CPU; not "app is up." Think: orders per minute, KiwiSaver enrolments completed per hour, login success rate.
- Hypothesise the steady state will continue in both a control group and an experimental group when you inject a real-world failure.
- Introduce the failure — server crash, network partition, disk full, clock skew, slow dependency, DNS failure, region outage.
- Try to disprove the hypothesis by comparing the steady-state metric between control and experiment. If it diverges, you have found a resilience gap.
- Minimise blast radius — run the smallest experiment that can answer the question. Start in staging. Move to a prod canary. Move to full prod only with demonstrated confidence.
"If we terminate one of the three app-tier pods in Kubernetes, the checkout success rate will remain within 99.5–100% for the 5 minutes following termination."
"If we break a server, the app will still work." No measurable metric, no time window, no blast radius. Cannot pass or fail.
5 Experiment Types
| Category | Example experiments |
|---|---|
| Resource | Burn CPU, exhaust memory, fill disk, saturate IO, starve file descriptors. |
| Network | Add latency (add 500ms to calls to payments), drop packets (5% loss on the API gateway), corrupt DNS, block a security group. |
| State | Terminate a pod, reboot a VM, kill a primary database, fail over a replica. |
| Application | Inject a 500 response on 10% of requests, slow response time 10x, return malformed JSON, throw unexpected exceptions. |
| Region / AZ | Black-hole traffic to one availability zone, simulate a region outage, break inter-region VPC peering. |
| Time | Skew the clock on one host, simulate daylight-saving edge cases, advance the clock to trigger certificate expiry. |
| People / process | "The on-call is unreachable." "The CI/CD system is down during an urgent fix." "The one person who knows the runbook is on holiday." |
For testers specifically: The application-level experiments (bad responses, slow responses, malformed payloads) are within easy reach with tools like Toxiproxy, Pumba, or even a Postman pre-request script. You do not need a Kubernetes cluster to start.
6 Tools in Action
The original. Open source, AWS-focused, terminates EC2 instances at random within a specified schedule. Part of Netflix’s "Simian Army" (Chaos Monkey, Chaos Kong for region outages, Latency Monkey for network delay, Doctor Monkey for health checks). When to use: you run on AWS and want to prove host-failure resilience.
Commercial, hosted chaos engineering platform. UI-driven, supports resource, network, and state attacks across cloud and on-prem. Halt-on-red-alert built in. When to use: you want a managed platform with guardrails rather than rolling your own.
Kubernetes-native. Declarative chaos experiments as custom resources, a library of pre-built "ChaosExperiments," integrates with Argo and Prometheus. When to use: your platform is Kubernetes and you want experiments checked into Git alongside code.
Managed AWS service for injecting real-world failures into AWS workloads. Natively supports EC2, ECS, EKS, RDS. When to use: heavy AWS user, want managed experiments with AWS-level observability.
A TCP proxy that lets you inject latency, jitter, bandwidth limits, and connection-resets on any network path. Runs locally or in CI. When to use: you want network-level chaos in tests without touching infrastructure.
Microsoft’s managed chaos service. Works with AKS, VMs, Cosmos DB, network security groups. When to use: NZ agencies and vendors with government workloads on Azure (very common).
7 Running a Game Day
A "game day" is a scheduled, structured exercise where the team deliberately breaks the system and observes the human + technical response. Amazon baked this into the AWS Well-Architected Framework. It is the single highest-leverage chaos activity a small team can do.
Before the game day
- Pick one scenario. "The payments provider starts returning 503s for 30% of calls."
- Write the hypothesis and the pass/fail criteria.
- Get explicit sign-off from the product owner and ops lead.
- Pick a low-traffic window (NZ: 6am Sunday is common).
- Brief the wider team so they do not think a real incident is happening.
- Pre-stage the abort button: how do we stop if things go worse than expected.
- Set up a dedicated chat channel and a shared timeline doc.
During the game day
- Announce "start" in the channel. Timestamp.
- Inject the failure.
- Observe: what monitors fire? Who gets paged? How fast?
- Let the on-call responders work the incident. Do not help them — the point is to test the runbook.
- Record surprises: things that nobody expected, steps that were longer than imagined, tools that were missing, runbook pages that were wrong.
- Stop if blast radius exceeds the plan.
After the game day
- Blameless retro within 48 hours.
- For every surprise, file a ticket with an owner and a due date.
- Publish the report. Share what you found. Celebrate the finds — this is how the team gets better.
- Schedule the next game day before you leave the room.
- Runbook pointed at a dashboard that had been deprecated 6 months prior
- On-call rotation was mis-configured; the person paged was on holiday
- Circuit breaker did not trip because the threshold was never tuned after launch
- Retry logic caused a thundering-herd effect that made the outage worse
- Fallback static page rendered with a broken CSS bundle
- Status page communication went to a Twitter account nobody maintains
8 Common Mistakes
🚫 Running chaos before you have observability
I used to think: We will break stuff and watch what happens.
Actually: Without baseline metrics and dashboards, you cannot measure the blast radius. Chaos without observability is just vandalism. Invest in logs, metrics, traces, and alerting first.
🚫 Running chaos in prod before it works in staging
I used to think: Staging is too different from prod to be useful.
Actually: Staging is where you find the obvious failures at no cost to users. If the experiment surfaces issues in staging, you fix them and retry. Only when staging is boring do you consider a prod canary.
🚫 Running chaos without a kill switch
I used to think: The experiment will stop itself.
Actually: Every experiment has a documented abort procedure. Named person, command, tested in advance. If you cannot stop within 30 seconds, you are not ready to start.
🚫 Measuring "did the server come back?" instead of a business metric
I used to think: If the instance restarts within our SLO, the experiment passes.
Actually: The instance may restart but drop in-flight orders. The test must measure the business-relevant outcome (orders completed, logins succeeded) not just the infrastructure event.
🚫 One game day and done
I used to think: We did it. We are resilient.
Actually: Chaos is a cadence, not an event. Quarterly at minimum; monthly for large systems. Systems drift; dependencies change; new failure modes appear.
9 Now You Try
Task: Pick a product you work on (or a public NZ service like RealMe, KiwiSaver enrolment, or an MSD service). Write a game-day plan for one failure scenario.
- Scenario (one sentence): e.g. "Our payment provider returns 503s for 50% of calls for 5 minutes."
- Steady-state metric: what business number should remain healthy?
- Hypothesis: what should happen?
- Pass/fail criteria: what numeric threshold makes this a pass?
- Blast radius: staging only? canary? full prod? Why?
- Tool: how do you inject the failure (Toxiproxy, Gremlin, feature flag)?
- Abort: how do you stop within 30 seconds?
- Observers: who watches what dashboard during the window?
- Expected surprises: what are you actually trying to learn?
10 Self-Check
Click each question to reveal the answer.
Q1. What are the five steps of a chaos experiment per the Principles of Chaos?
1. Define a steady-state business metric. 2. Hypothesise the steady state continues under failure. 3. Introduce the failure. 4. Try to disprove the hypothesis by comparing control and experiment. 5. Minimise blast radius — start small and expand with confidence.
Q2. Why is "is the server up?" not a good steady-state metric?
It is a proxy for availability, not a business outcome. A server can be "up" while dropping 40% of orders. Steady state must be a user-visible or business-visible metric — orders, logins, enrolments, revenue.
Q3. What is the difference between Chaos Monkey and Gremlin?
Chaos Monkey is Netflix’s open-source tool, AWS-focused, that randomly terminates EC2 instances. Gremlin is a commercial SaaS platform that supports a broader catalogue of attack types (resource, network, state) across clouds with a UI, halt-on-alert, and managed guardrails.
Q4. Your team wants to run chaos in prod on day one. What is your answer?
No. Prove it works in staging first; you will find plenty. Graduate to a prod canary (5% traffic) with full observability and a 30-second kill switch. Full-prod experiments earn their way in with evidence, never on enthusiasm.
Q5. What typically gets discovered in a first game day that technical design missed?
Operational and human-process failures. Out-of-date runbooks, mis-configured on-call rotations, alerts that paged the wrong channel, status-page updates that never went out, customer-support team unaware of the incident, retry logic that amplified rather than softened the outage.
Q6. Your chaos experiment in staging causes a 30-minute outage. Is this a failure?
No — it is the reason you ran the experiment in staging. You have just prevented a real 30-minute outage in prod. File the findings, fix them, run again. Chaos experiments that never find anything are not passing; they are not being bold enough.