Chaos Engineering & Resilience Testing
Chaos engineering intentionally breaks systems in controlled ways to find weaknesses before users do. Test how your system handles dependency failures, network issues, resource constraints, and cascading failures. Netflix did this and changed how the industry thinks about testing.
What it is
Chaos engineering is the discipline of testing system resilience by deliberately introducing failures into production (or production-like environments) and observing how the system responds. Unlike traditional testing, which verifies happy paths, chaos testing verifies failure paths: what happens when a service is slow, when a database goes down, when the network loses packets, when a server runs out of memory?
The goal: build confidence that your system can degrade gracefully when things break. Users lose confidence when your service becomes unavailable after a single failure; they forgive brief slowness if the service keeps working.
Netflix chaos engineering: Netflix runs “Chaos Monkey,” a tool that randomly kills servers in production. Why? Because it forces their engineers to build systems that survive random failures. Today, Chaos Monkey is industry standard practice.
Core principles
- Build confidence, not metrics. The goal is not to get a score or pass a test. The goal is to understand how your system fails and to be confident it fails safely.
- Find weaknesses before users do. Run chaos experiments before you deploy. Find the cascading failure scenarios and fix them in staging, not in production.
- Test both the system and your team. Chaos testing reveals not just code defects, but operational defects: missing runbooks, unclear alerts, unclear ownership.
- Start small. Don’t kill the entire database. Introduce one failure at a time, observe, then introduce more.
Types of chaos experiments
| Failure type | What breaks | What to test |
|---|---|---|
| Dependency failure | Service A calls Service B; Service B goes down | Does Service A timeout gracefully? Does it retry? Does it circuit-break and fail fast? |
| Latency injection | Service B responds, but very slowly (2s instead of 200ms) | Does Service A timeout after 5s? Does it use timeout-aware retries? Do cascading services queue or fail? |
| Resource exhaustion | CPU / memory / disk fills up | Does the service gracefully shed load? Do queues back up or overflow? Does monitoring alert before it crashes? |
| Network partition | Nodes cannot reach each other (packet loss, jitter) | Can the service split-brain? Do reads/writes remain consistent? Do nodes rejoin correctly? |
| Data corruption | Database writes garbage; cache returns stale data | Do validations catch corrupted data? Is there a rollback mechanism? Do integrity checks exist? |
Designing a chaos experiment
A well-designed chaos experiment follows this structure:
1. Hypothesis
Start with a hypothesis: “If the payment service becomes unavailable, the order service will timeout and return a 503 to the user (instead of crashing).” or “If the database latency spikes to 5s, the web server will not run out of connections.”
2. Steady-state metrics
Define what “normal” looks like: error rate, latency p50/p95/p99, throughput, queue depth, memory usage. Before you introduce the failure, establish a baseline.
3. Blast radius
Define the scope of the failure. Don’t kill everything; start small:
- Stage 1: Kill 1 of 10 instances of the dependent service.
- Stage 2: Kill 5 of 10 instances.
- Stage 3: Kill 100% (full outage).
4. Apply the failure
Introduce the failure (latency, packet loss, service down) in a controlled way.
5. Observe
Watch metrics in real time: does error rate spike? Does latency increase? Do queues back up? Does the system stay healthy or degrade?
6. Verify the hypothesis
Did the system behave as expected? If yes, you gained confidence. If no, you found a gap that needs fixing.
7. Automate and repeat
Once you’ve manually run the experiment, automate it so it runs on a schedule (nightly, weekly).
Kubernetes-specific chaos
Kubernetes makes it easy to run chaos experiments because you can kill, pause, or reschedule pods programmatically:
- Pod termination: Kill a pod and watch the readiness probes detect it and reschedule a replacement.
- Network policies: Block traffic between pods (simulate a network partition).
- Resource limits: Set low CPU/memory limits on a pod and watch it get OOMKilled.
- Node failure: Drain a node (move all pods off it) and verify your services still run on remaining nodes.
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-order-service
namespace: production
spec:
action: pod-kill
mode: fixed
value: 1 # Kill 1 pod at a time
scheduler:
cron: "0 2 * * *" # Run at 2am daily
selector:
namespaces:
- production
labelSelectors:
app: order-service
duration: 5m # Run for 5 minutes
gracePeriod: 30 # Give pod 30s to shut down gracefully
Worked examples
Example 1: Database timeout handling
Hypothesis: If the database becomes slow (latency spikes from 50ms to 5s), the application will timeout after 3s and return a user-friendly error, not hang for 30s.
Experiment: Use Toxiproxy (a network proxy) to inject 5s latency on all database queries.
Baseline (steady state): p99 latency 100ms, error rate 0%, throughput 1000 req/s.
During chaos: Database queries take 5s (injected latency) + application overhead. Measure: Do requests timeout after 3s? Or do they wait the full 5s+?
Expected result: Application times out after 3s, returns HTTP 504 (Gateway Timeout), user sees error message, no hanging.
If it fails: Application hangs (no timeout configured), threads pile up, memory fills, service crashes. Add timeouts to database calls.
Example 2: Circuit breaker verification
Hypothesis: If the payment service becomes unavailable, the circuit breaker will open after 3 failed requests and fail fast (200ms) instead of waiting for timeouts (15s each).
Experiment: Kill the payment service and measure order service latency.
Expected result: First 3 requests timeout (15s each). 4th request fails immediately (circuit open, 200ms). Circuit stays open for 60s. After 60s, tries to recover.
If it fails: Circuit breaker is not configured or not working. Every request waits the full 15s timeout, cascading timeout failures to users.
Example 3: Cascading failure detection
Hypothesis: If Service A depends on Service B, and Service B depends on Service C, and Service C becomes slow, then Service A should not also become slow (due to connection pool exhaustion).
Experiment: Inject 10s latency on Service C.
Expected result: Service B times out quickly (has timeouts configured), returns errors, Service A sees errors and either retries or fails fast. Service A latency stays low because it doesn’t wait for Service C.
If it fails: Service B doesn’t timeout, waits for Service C, exhausts connections, Service A piles up waiting for connections. Latency cascades.
Observability during chaos
Chaos experiments are only useful if you can see what’s happening. Ensure you have:
- Real-time metrics: Error rate, latency (p50/p95/p99), throughput, connection pools, queue depth, CPU/memory.
- Distributed tracing: See a request flow from Service A through B, C, and back. Watch it fail in Service C and see how it cascades.
- Logs: Application logs showing timeouts, retries, circuit breaker state changes.
- Alerts: If the chaos breaks something, do your on-call alerts fire? If not, your monitoring is gapped.
Safety and governance
Chaos experiments can break things. Implement guardrails:
- Start in staging, not production. Prove the experiment works in a non-critical environment first.
- Schedule carefully. Run during business hours (not 3am) so your team can respond if something goes wrong.
- Set a blast radius. Don’t kill 100% of a service; kill 1 instance, then 2, then more.
- Set a time limit. A chaos experiment should run for minutes, not hours. If latency is injected for 5 minutes, it stops after 5 minutes (not indefinitely).
- Implement kill switches. If an experiment goes wrong, have a way to stop it immediately.
- Document the experiment. Other teams need to know why you’re breaking the system and what to do if it gets out of hand.
- Brief the team. Before running a chaos experiment in production, tell your on-call and your team lead. It’s not a surprise.
Tools
- Chaos Mesh — Kubernetes-native; inject pod failures, network delays, stress tests. Open source.
- Gremlin — SaaS platform; point-and-click chaos experiments, curated experiments, integrations with monitoring.
- Toxiproxy — Lightweight network proxy; inject latency, packet loss, connection resets. Good for testing individual service pairs.
- Pumba — Docker-specific chaos; kill containers, pause/unpause, inject stress.
- Chaos Toolkit — Framework for writing custom chaos experiments in Python.
Best practices and anti-patterns
- Never run unknown chaos experiments. Always understand what failure you’re introducing and why before you run the experiment.
- Don’t just focus on individual service failures. Test cascading failures: what happens when Service A, B, and C all fail simultaneously?
- Don’t ignore the results. If a chaos experiment reveals a gap (e.g., no timeout configured), fix it before deploying.
- Don’t run chaos experiments only in staging. Staging doesn’t have the same load, traffic patterns, or data as production. Run in production (with guardrails).
- Don’t assume monitoring will catch everything. Run the chaos experiment and manually verify the system is behaving as expected, not just that metrics look okay.
- Don’t skip the hypothesis. Running random chaos experiments is noise. Start with a specific hypothesis and verify or falsify it.
Resilience over perfection: Chaos engineering is not about building perfect systems; it’s about building systems that fail safely and allow teams to understand and handle failures. If your system can survive the loss of a database server, you’ve won.