Performance Testing — Specialised

1 The Hook

It was 9:00 AM on a Tuesday in March. A major NZ bank had just launched a new home loan application portal. The marketing team sent an email to 80,000 customers. Within 15 minutes, the site was down. Not hacked. Not buggy. Just overwhelmed.

The bank lost an estimated $2.3 million in application volume that morning. The root cause? Nobody had tested what happened when more than 200 people used the site at once. The database connection pool maxed out. Threads hung. CPU spiked to 100%. Customers saw spinning wheels and timeout errors.

A single afternoon of load testing would have caught this. But performance testing is often treated as an afterthought — something you do "if there is time." In enterprise software, that mindset is expensive.

In NZ, performance testing is especially critical for:

Banking and finance: End-of-month payroll, tax season (IRD integrations), mortgage rate changes
Government: Census submissions, benefit applications, visa processing portals
Retail: Boxing Day sales, Click Frenzy, new product drops
Utilities: Power company billing cycles, water restriction announcements

2 The Rule

Performance testing measures how a system behaves under a defined workload, compares it against agreed criteria, and identifies the breaking point before users do.

The key word is defined. "Make it fast" is not a requirement. "The 95th percentile response time for the checkout API must be under 800ms with 5,000 concurrent users" is. Performance testing turns subjective feelings ("it feels slow") into objective data.

3 The Analogy

Analogy

A bridge engineer stress-testing a new motorway bridge.

Before opening the Harbour Bridge to traffic, engineers do not just guess it will hold. They calculate the maximum load (trucks, wind, earthquakes). They test with weights. They simulate rush hour. They measure flex and vibration. Performance testing is the same discipline applied to software: define the expected load, simulate it, measure the response, and certify it is safe for production.

4 Types of Performance Testing

Performance testing is an umbrella term. Each subtype answers a different question. A comprehensive strategy uses all of them.

Load Testing

Question: Does the system handle expected traffic gracefully?

Simulate the number of concurrent users you expect in production. For an NZ insurance quote portal, this might be 500 concurrent users during business hours. You are not trying to break anything — you are confirming the system meets its SLA under normal conditions.

Stress Testing

Question: Where does it break, and how does it fail?

Gradually increase load beyond expected levels until the system fails. Does it crash? Does it slow to a crawl? Does it return corrupted data? Most importantly: does it recover when load drops? Stress testing reveals your safety margin.

Scalability Testing

Question: Can the system grow with the business?

Measure how performance changes as you add resources (horizontal scaling: more servers; vertical scaling: bigger servers). An NZ SaaS startup planning to expand to Australia needs to know: if we double our user base, do we need to double our infrastructure cost, or does the system scale efficiently?

Endurance / Soak Testing

Question: Does performance degrade over time?

Run the system at a steady, moderate load for an extended period — 8 hours, 24 hours, even a week. Memory leaks, connection pool exhaustion, log disk filling up, and database index fragmentation all show up here. Critical for systems that must run continuously: payment gateways, IoT data pipelines, trading platforms.

Spike Testing

Question: Can it handle sudden, extreme bursts?

Instantly jump from 100 to 10,000 users. Simulate a viral tweet, a breaking news alert, or a celebrity endorsement. Unlike stress testing (gradual ramp), spike testing is abrupt. Many systems handle smooth ramps but collapse under sudden pressure because caches are cold, connection pools are not pre-warmed, and autoscaling takes minutes to react.

5 Key Metrics You Must Measure

If you cannot measure it, you cannot improve it. Here are the metrics every performance test report should include:

Response Time

< 2s

Time from request sent to full response received. Measure average, median, and percentiles (p50, p95, p99). NZ benchmarks: Banking p95 <100ms, e-commerce <800ms, govt forms <2s.

Throughput

RPS / TPS

Requests per second or transactions per second. How much work the system completes in a given time.

Error Rate

< 0.1%

Percentage of requests that fail (5xx, 4xx, timeouts). Should be near zero under normal load.

Concurrent Users

VU

Virtual users actively using the system simultaneously. Distinguish concurrent from total registered users.

CPU Utilisation

< 70%

Average CPU usage across application servers. Sustained >80% indicates a bottleneck.

Memory Usage

MB / GB

Heap usage, garbage collection frequency, and available RAM. Watch for memory leaks in soak tests.

Percentiles matter more than averages. If your average response time is 400ms but your p99 is 12 seconds, 1% of your users are having a terrible experience. In a system with 100,000 daily users, that is 1,000 frustrated people.

6 Tools in Action

Each tool has strengths. The right choice depends on your tech stack, team skills, and budget.

Tool	Best For	Protocol	NZ Context
JMeter	GUI-based test creation, enterprise teams, HTTP/HTTPS/FTP/JDBC	Java	Most common in NZ government and enterprise. Free, huge community, steep learning curve.
Gatling	Code-first tests, CI/CD integration, beautiful reports	Scala / Java / Kotlin	Popular in NZ fintech and SaaS startups. Elegant DSL, great for developers.
k6	Developer-friendly, JavaScript-based, cloud-native	JavaScript (Go runtime)	Rapidly growing in NZ. Cloud execution, easy Docker integration, perfect for DevOps teams.
LoadRunner	Enterprise protocols (SAP, Citrix, legacy systems)	Proprietary C/Java/VBScript	Expensive but still used in large banks and telcos. The gold standard for complex protocols.
Locust	Python-based, distributed load, programmable behaviour	Python	Great for Python-heavy teams. Define user behaviour as code, scale with Kubernetes.

Worked example — k6 script for an NZ e-commerce checkout

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // ramp up
    { duration: '5m', target: 100 },   // steady state
    { duration: '2m', target: 400 },   // stress
    { duration: '5m', target: 400 },   // sustained stress
    { duration: '2m', target: 0 },     // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<800'],  // 95% under 800ms
    http_req_failed: ['rate<0.01'],     // error rate under 1%
  },
};

export default function () {
  const res = http.post('https://shop.example.co.nz/api/checkout', {
    product_id: '12345',
    quantity: '2',
    region: 'auckland',
  });

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 800ms': (r) => r.timings.duration < 800,
    'has order_id': (r) => r.json('order_id') !== undefined,
  });

  sleep(1);
}

This script ramps from 100 to 400 virtual users, measures p95 response time, checks for successful checkouts, and validates that every response contains an order ID. Run it with k6 run checkout-test.js.

7 When to Use It / When NOT to Use It

✅ Run performance tests when...

Before any major release to production
After significant architecture changes (new DB, microservices split)
Before marketing campaigns or expected traffic spikes
When SLA commitments are negotiated with clients
During infrastructure migration (on-prem to cloud)

❌ Skip or defer when...

The product has zero users and no launch date ( premature optimisation)
Only static content changed (CSS, copy updates)
You have no production-like environment to test against
The test environment is not representative (different hardware, no CDN)

8 Common Mistakes

🚫 Testing in production-like environments that are not actually production-like

I used to think: Running on a single t2.medium EC2 instance tells me how production will perform.
Actually: If your production has load balancers, CDNs, read replicas, and 8 application servers, your test environment must mirror that. A test on underpowered hardware gives misleading results.

🚫 Ignoring think time

I used to think: More requests = better test. Send as fast as possible.
Actually: Real users pause. They read. They click slowly. A test with zero think time creates artificial load. Always add realistic delays between actions (2-5 seconds is typical for web apps).

🚫 Testing only the happy path

I used to think: If the homepage loads fast, we are good.
Actually: Search, checkout, report generation, and file uploads are where bottlenecks hide. Test the full user journey, including data-heavy operations and complex database queries.

🚫 Not monitoring the test environment itself

I used to think: The load tool tells me everything I need.
Actually: You need server-side metrics too: CPU, memory, disk I/O, database slow query logs, and application logs. APM tools like New Relic, Datadog, or Dynatrace are essential companions to load testing.

9 Now You Try

🎯 Scenario Exercise (30 min)

Scenario: You are testing an NZ event ticketing site. A popular concert goes on sale at 9:00 AM. Historically, this causes a traffic spike of 5,000 users in the first 10 minutes. The business requirement is: "The ticket purchase flow must complete in under 3 seconds for 95% of users during peak load."

Your task: Design a performance test strategy. Specify:

What type(s) of performance testing you would run
The load profile (ramp pattern, concurrent users, duration)
The key metrics you would measure
What tool you would choose and why

10 Self-Check

Click each question to reveal the answer.

Q1. What is the difference between load testing and stress testing?

Load testing verifies behaviour under expected load. Stress testing pushes beyond expected load to find the breaking point and observe recovery behaviour.

Q2. Why are percentiles (p95, p99) more meaningful than average response time?

Averages hide outliers. A p95 of 3 seconds means 95% of users experience 3 seconds or less. The average could be 800ms while 5% of users wait 30 seconds — a hidden disaster.

Q3. What is the purpose of soak testing?

To detect degradation over extended periods. Memory leaks, connection pool exhaustion, log rotation failures, and disk space issues only appear after hours or days of continuous operation.

Q4. Why is "think time" important in performance tests?

It simulates realistic user behaviour. Without pauses between actions, you create artificially high load that does not represent real usage patterns, leading to false positives.