Senior · Non-Functional Technique

Performance Testing Deep Dive

Q: Q1. What's the difference between p50, p95, and p99?

p50 (median): 50% of requests are faster, 50% are slower. p95: 95% of requests are faster than this value. p99: 99% of requests are faster. For SLOs, you should target p95 and p99, not p50. If p50=100ms but p99=2s, you're failing 1% of users, which is unacceptable at scale.

Q: Q2. What's a baseline and why is it important?

A baseline is a measurement of current performance under defined conditions. Before optimising, you measure the baseline (e.g., p95=450ms). After optimising, you re-measure and compare (e.g., p95=150ms). Without a baseline, you can't tell if your optimisation actually helped. The baseline also helps you spot regressions in CI/CD.

Q: Q3. How do you identify the bottleneck without guessing?

Use profiling tools: CPU profiler (shows which functions consume CPU), memory profiler (shows heap growth and GC), database profiler (shows slow queries and lock contention). Run the application under realistic load and capture profiles. The profiles tell you exactly where time is spent. Then, optimise that specific area and re-measure to verify improvement.

Load testing is not performance testing. Load testing asks "what breaks?" Performance testing asks "why?" Finding the bottleneck requires profiling, baseline data, SLO targets, and the discipline to measure before and after.

Senior ISTQB CTAL-TTA 5.3 — K4 Analyse ~14 min read + exercise

1 The Hook — Why This Matters

In 2021, a major NZ bank's payment processing system began slowing down during peak business hours (9am-11am, 1pm-3pm). Payment processing took 2 seconds at 8am, 8 seconds at 10am. Customers complained. The team ran a load test: "yes, it's slow under load." A junior engineer suggested "buy more servers!" They doubled the server count. Payments were still slow. The team commissioned a performance consultant at NZD 5,000/day. Within two days, the consultant found the real bottleneck: a database query in the transaction ledger was doing a full table scan instead of using an index. The index had been accidentally dropped during a deployment. Rebuilding the index took 30 minutes and cost NZD 0. The consultant's fee could have been saved with proper performance testing and profiling.

Performance testing without profiling is expensive guesswork. Profiling without baselines is chasing shadows. You must measure, analyse, and verify fixes systematically.

2 The Rule — The One-Sentence Version

Establish a baseline, define SLOs, profile for bottlenecks, make one change, re-measure, verify improvement, repeat. Measure p95 and p99, not just averages.

Performance testing has five layers: (1) Baselines — what is the current performance? (2) SLOs — what performance do we need? (3) Load Testing — what load causes issues? (4) Profiling — where is the bottleneck? (5) Optimisation & Verification — did the fix work? Test each layer independently, and always re-measure after changes.

3 The Analogy — Think Of It Like...

Analogy

Diagnosing a car that's running slowly, not just "driving faster."

You notice the car is slow (baseline). The owner wants it to do 100km/h on the motorway (SLO). You drive it and it's struggling at 60km/h (load test). Now, where's the problem? You could add more fuel, change the oil, upgrade the engine. But if the real problem is the parking brake is on, none of that helps. A mechanic would: check the baseline (tachometer reads 2000 rpm at 60km/h when it should read 1000), diagnose the brake issue (profiling), release the brake (fix), re-measure (now 1000 rpm at 60km/h), verify improvement (yes, the brake was the bottleneck). Without profiling, you'd waste money on engine upgrades.

Senior engineer insight

The moment that changed how I approach performance testing was realising that load models built from gut feel are worse than useless — they give you false confidence. Once I started building k6 scripts from production access logs (parsing nginx logs with a Python script to extract URL distributions, session lengths, and inter-request timing), every load test I ran told me something actionable. The synthetic traffic matched what users actually did, not what we assumed they'd do.

The most common mistake senior testers make: defining SLOs based on what the system currently achieves rather than what users actually need — then calling a regression anything that changes the number, even when the original number was never fit for purpose.

From the field

A NZ council rates and billing system had been in production for three years without a formal performance test. The team assumed it was fine because "nobody complained." We were brought in before the annual rates-billing run — the single largest peak the system would see all year, when roughly 140,000 notices are generated and a spike of ratepayers log in to dispute their assessments over a 72-hour window. The team had modelled the load test at a flat 200 concurrent users based on average daily traffic. What we found in the nginx access logs was a very different story: peak login concurrency hit 900+ in the first four hours after notices dropped, with a secondary spike on day two when email reminders went out.

We rebuilt the k6 load model from two years of log data — ramping to 950 concurrent users over 30 minutes, holding for two hours, then tapering. The system fell over at 380 concurrent users; a combination of database connection pool exhaustion and a PDF-generation service that held connections open for 8–12 seconds per notice. We found this in 40 minutes of profiling. The fix — connection pooling via PgBouncer and async PDF queuing — took three days and cost nothing in infrastructure. The lesson: production data is your load model. If you don't build from logs, you're guessing at the shape of your peak, and in NZ public sector, the peak shape is everything.

4 Watch Me Do It — Step by Step

Here is a real NZ example: an API that returns user account summaries is slow. Follow these steps to find and fix the bottleneck.

Establish a baseline Make 100 requests to the API in serial (one after another), measure response time. Record p50, p95, p99. Note: p50 is the median, p95 means 95% of requests are faster than this, p99 means 99% are faster. Don't use averages; they hide outliers.
```
// Baseline: serial requests
const times = [];
for (let i = 0; i < 100; i++) {
  const start = Date.now();
  await fetch('/api/account-summary');
  times.push(Date.now() - start);
}
times.sort((a, b) => a - b);
console.log({
  p50: times[50],  // median: 120ms
  p95: times[95],  // 95th percentile: 450ms
  p99: times[99]   // 99th percentile: 2100ms (outlier)
});
```
Found: p50 is 120ms (acceptable), but p99 is 2100ms (bad). This tells you the system works for 99% of users but 1% experience 2-second waits. You must investigate the p99 outliers.
Define SLOs (Service Level Objectives) What response time do you need? Typical SLOs for NZ fintech: p95 < 200ms, p99 < 500ms. For payment processing: p95 < 100ms, p99 < 300ms. Define SLOs based on user experience and business requirements, not what the system currently achieves.
SLO: p95 < 200ms, p99 < 500ms. Current: p95 = 450ms, p99 = 2100ms. We're failing SLO for p95 and p99.

Load test to find the breaking point Gradually increase concurrent requests (1, 5, 10, 50, 100, 200 concurrent users) and measure response times. Find the point where p95 exceeds the SLO. This is your breaking point.

// Ramp up: 10 concurrent users
const startTime = Date.now();
const promises = [];
for (let i = 0; i < 10; i++) {
  promises.push(fetch('/api/account-summary'));
}
const results = await Promise.all(promises);
const elapsed = Date.now() - startTime;
console.log({concurrency: 10, avgResponseTime: elapsed / 10, p95: /* ... */});
// Repeat with 20, 50, 100, 200 concurrent users

Found: At 50 concurrent users, p95 jumps from 200ms to 1000ms. The system starts struggling at 50 concurrent users.

Profile to find the bottleneck Use CPU profiling (flame graphs), memory profiling (heap dumps), and database profiling (slow query logs). Run the API under load and capture profiles.
```
// Node.js CPU profiling with clinic.js
clinic doctor -- node app.js
// Then run load test: artillery run load-test.yml
// clinic.js produces a report showing which functions consume CPU
```
Found: CPU profiling shows 45% of CPU time is spent in calculateAccountBalance(). Database profiling shows this function runs a query: SELECT * FROM transactions WHERE user_id = ? ORDER BY date DESC. The query takes 500ms (no index on user_id).
Optimise: add the missing index The transactions table has millions of rows. The query without an index does a full table scan (reads every row). Adding an index on user_id makes the query read only relevant rows.
```
-- Add index
CREATE INDEX idx_transactions_user_id ON transactions(user_id);
-- Verify index is used
EXPLAIN SELECT * FROM transactions WHERE user_id = ? ORDER BY date DESC;
```

Re-measure and verify improvement Run the same load test again and capture new baseline metrics. Compare p95 and p99 before and after.

// Before: p50=120ms, p95=450ms, p99=2100ms (at 50 concurrent users, p95=1000ms)
// After: p50=50ms, p95=120ms, p99=300ms (at 50 concurrent users, p95=150ms)
// We now meet SLO: p95 < 200ms ✓, p99 < 500ms ✓

Test memory usage under load Even if response time is fast, memory might grow unbounded. Run load test for 10 minutes, monitor memory. If memory grows continuously, there's a leak.
Pattern: Use heap dump analysis (jmap in Java, heap snapshots in Node.js) to find which objects are growing. Common causes: cached objects never evicted, event listeners never removed, database connections not closed.

Test cache effectiveness If your system uses caching (Redis, Memcached), measure cache hit rate. A low hit rate (< 80%) means most requests miss the cache. Optimise by increasing cache size, using better cache keys, or reducing TTL.

// Monitor cache metrics
const cacheHits = metrics.cacheHits;
const cacheMisses = metrics.cacheMisses;
const hitRate = cacheHits / (cacheHits + cacheMisses);
console.log({hitRate: (hitRate * 100).toFixed(2) + '%'}); // Target: > 80%

Test for regressions in CI/CD Add performance tests to your CI/CD pipeline. On every PR, run load tests and alert if p95 degrades by > 10%. This catches performance regressions before they ship.

// CI/CD performance gate
baseline_p95 = 150ms  # from previous measurement
current_p95 = run_load_test()
if current_p95 > baseline_p95 * 1.1:  # 10% threshold
  fail("Performance regression: p95 increased from {baseline_p95}ms to {current_p95}ms")

Pro tip: Always measure on the same hardware under the same conditions. Run tests at the same time of day (to avoid network congestion variations). Use k6 or Apache JMeter for repeatable, scriptable load tests. Use clinic.js (Node.js), Java Flight Recorder (Java), or Datadog APM for profiling.

5 When to Use It / When NOT to Use It

✅ Prioritise performance testing when...

The application is user-facing (e.g., web, mobile, API)
Performance directly affects user experience (slow = churn)
You process high traffic (payments, messaging, streaming)
You have SLA/SLO requirements (99.9% uptime, p95 < 100ms)
You've made architectural changes (database, caching, async)
You're preparing for peak load (Black Friday, tax deadline)

❌ Don't fall into these traps...

Running load tests without a baseline (you won't know if you improved)
Optimising without profiling (guessing where the bottleneck is)
Using average response time (p50) instead of p95/p99
Testing on different hardware than production (results won't transfer)
Ignoring memory and GC overhead (fast CPU means slow memory leak)
Not testing for regressions in CI (performance degrades silently)

6 Common Mistakes — Don't Do This

❌ Optimising without profiling

I used to think: The API is slow, so I'll add caching and use async. That should help.
Actually: The NZ bank wasted time adding servers when the real problem was a missing database index. Profiling (CPU, memory, database) reveals the true bottleneck. Without profiling, you're guessing. The consultant saved days of blind optimisation by profiling first.

❌ Using average response time instead of percentiles

I used to think: If average response time is 100ms, that's good.
Actually: Average hides outliers. If 99% of requests are 80ms and 1% are 10 seconds, the average is ~110ms, but 1% of users experience terrible performance. Always report p50, p95, p99. SLOs should target p95 and p99, not average.

❌ Testing on different hardware than production

I used to think: Performance results on my laptop are good enough.
Actually: Production hardware is different. CPUs, memory, network, disks vary. Measure on hardware identical to production or use cloud environments (AWS, Azure) that mimic production. Otherwise, your results won't transfer.

7 Now You Try — Interview Warm-Up

🎯 Interactive Exercise

Question: Your API's p95 response time is 500ms, but your SLO is p95 < 200ms. You have two options: (1) add more servers (costs NZD 500/month), or (2) profile and optimise. What do you do, and why?

Think through the logic before revealing.

Best answer: Profile first.

Why: Adding servers might help (if the bottleneck is CPU or I/O saturation), but if the bottleneck is a missing database index (like the NZ bank), adding servers does nothing. You'd waste NZD 500/month and still miss the SLO.

Process: (1) Profile: capture CPU, memory, database metrics under load. (2) Find the bottleneck: maybe it's a slow query, maybe it's garbage collection, maybe it's a missing cache. (3) Optimise: add the index, enable caching, etc. (4) Re-measure: verify p95 is now < 200ms. (5) If still not met, and CPU is at 95%, then add servers.

Cost savings: Profiling takes 2-4 hours. Fixes (index, cache) cost NZD 0. You save NZD 500/month forever. Adding servers first wastes money and doesn't solve the problem.

Why teams fail here

They build load models from assumptions rather than production log data — the virtual user count is plausible but the traffic shape (burst pattern, endpoint mix, session length) bears no resemblance to reality, so the test passes and the real peak kills the system.
They run performance tests once, before launch, then never again — a six-month-old baseline against a codebase that has had 200 commits is not a baseline, it's archaeology. Without regression gates in CI, performance degrades one slow pull request at a time.
They optimise the thing they understand (adding cache, scaling horizontally) rather than the thing that is actually slow — because profiling feels harder than deploying more servers. The NZ bank in the hook section is not unusual; "buy more compute" is the default response to a problem that would have taken an afternoon to diagnose properly.
They test in isolation and declare victory — the API is fast in the load test, but in production it shares a database with six other services, runs behind a WAF that adds 80ms, and competes for connection pool slots at peak. Environment parity matters as much as the test script itself.

Key takeaway

Performance testing is not a load test you run once before go-live — it is a measurement discipline that starts with production data, compares against SLOs that users actually need, and runs automatically on every change so you know the moment you introduced a regression.

8 Self-Check — Can You Actually Do This?

Click each question to reveal the answer. If you got all three, you're ready to own performance.

Q1. What's the difference between p50, p95, and p99?

p50 (median): 50% of requests are faster, 50% are slower. p95: 95% of requests are faster than this value. p99: 99% of requests are faster. For SLOs, you should target p95 and p99, not p50. If p50=100ms but p99=2s, you're failing 1% of users, which is unacceptable at scale.

Q2. What's a baseline and why is it important?

A baseline is a measurement of current performance under defined conditions. Before optimising, you measure the baseline (e.g., p95=450ms). After optimising, you re-measure and compare (e.g., p95=150ms). Without a baseline, you can't tell if your optimisation actually helped. The baseline also helps you spot regressions in CI/CD.

Q3. How do you identify the bottleneck without guessing?

Use profiling tools: CPU profiler (shows which functions consume CPU), memory profiler (shows heap growth and GC), database profiler (shows slow queries and lock contention). Run the application under realistic load and capture profiles. The profiles tell you exactly where time is spent. Then, optimise that specific area and re-measure to verify improvement.

9 Interview Prep — Common Questions

Q. "How do you establish a performance baseline?"

I make many sequential requests to the system and measure response times (at least 100 requests to get stable statistics). I calculate p50, p95, p99, and min/max. I document the test conditions (hardware, concurrent users, time of day) so the baseline is reproducible. Then, after any changes, I re-run the same test and compare. Baselines are critical: without them, you can't tell if you improved or regressed.

Q. "How do you approach profiling a slow application?"

I use three profilers: (1) CPU profiler to see which functions consume CPU time (flame graphs). (2) Memory profiler to detect leaks and GC overhead. (3) Database profiler to find slow queries. I run the application under representative load, capture profiles, and analyse. The profiles show exactly where the bottleneck is. Then, I make one targeted fix, re-measure, and verify improvement. Iterate until I meet SLO targets.

Q. "What's the difference between load testing and stress testing?"

Load testing gradually increases load until you find breaking points (p95 exceeds SLO). Stress testing increases load beyond normal to find the absolute breaking point (system becomes unavailable). Load testing answers "at what load do we fail?" Stress testing answers "how much abuse can the system take?" Both are important: load testing helps you set capacity limits; stress testing helps you understand failure modes.

Q. "How do you prevent performance regressions?"

I add performance tests to CI/CD. On every PR, I run a load test and measure p95. If p95 degrades by more than 10% compared to baseline, I fail the build. This catches regressions before they ship. I also monitor production (APM tools like Datadog, New Relic) and alert if p95 exceeds SLO. Combining CI/CD gates with production monitoring prevents silent performance degradation.

← Message Queue Testing All Senior learning

Performance Testing Deep Dive

1 The Hook — Why This Matters

2 The Rule — The One-Sentence Version

3 The Analogy — Think Of It Like...

4 Watch Me Do It — Step by Step

5 When to Use It / When NOT to Use It

6 Common Mistakes — Don't Do This

7 Now You Try — Interview Warm-Up

8 Self-Check — Can You Actually Do This?

Related techniques

9 Interview Prep — Common Questions