Security Automation in CI
Security found in a yearly pen test is security found too late. The senior QE’s job is to push security checks left into the pipeline — so every commit is scanned, every risky change is caught, and the build is the gate. This lesson shows you what to run, where it fits, and how to keep it useful instead of noise.
1 The Hook
Kowhai Pay, a fictional NZ payments startup, took security seriously — once a year. Every February they paid a firm to run a penetration test, fixed what came back, and moved on. In between, the team shipped to production a dozen times a day, and no security check ran in any of those deploys.
One November, a developer added a new logging library to fix a tracing issue. It pulled in a transitive dependency with a known critical vulnerability — one that had a public advisory and a patched version available for months. The pipeline ran the unit tests, they passed, and the change went live. The flaw sat in production for eleven weeks until the February pen test found it. By then it had been deployed across every service.
Here is the lesson in that story: the pen test was not wrong, it was just too slow. A yearly check cannot keep up with daily deploys. The vulnerable library would have been caught the moment it was added if a dependency scan had been running in the pipeline — a check that takes seconds and would have failed the build on the spot.
Security automation in CI is the discipline of moving these checks out of the annual audit and into the pipeline, so the feedback is continuous and the build itself becomes the gate. That is what this lesson teaches: what to scan, where each scan belongs, when to fail the build, and how to keep the results trustworthy.
2 The Rule
Security checks that run yearly cannot protect software that ships daily. Push the checks into the pipeline so every change is scanned automatically, fail the build on real findings, and triage the rest — an unconfigured scanner that no one trusts is worse than no scanner at all.
3 The Analogy
Airport security versus a single annual customs raid.
Imagine an airport that only checked bags once a year. Every other day, anyone could carry anything onto a plane, and the one big raid in February would catch a year’s worth of problems all at once — long after they had already flown. That is a yearly pen test against daily deploys. Security automation in CI is the permanent screening lane: every passenger, every bag, every time, with a fast scan that stops the dangerous items at the gate instead of discovering them later.
And just like airport screening, the value depends on tuning. A scanner that beeps at every belt buckle gets ignored, and a real weapon walks through. A security gate that floods developers with false alarms gets switched off or waved past — which is why triage is not optional, it is half the job.
4 The Four Scan Types
Security automation in CI is not one tool — it is a small family of scans, each answering a different question. A serious pipeline runs all four.
SAST — static application security testing
The question: does our own source code contain insecure patterns? SAST reads the code without running it, looking for known-dangerous constructs — SQL built by string concatenation, unescaped user input flowing into a page, weak cryptography, hard-coded logic that bypasses auth. It runs early because it only needs the source. Its weakness is false positives: it sees a risky pattern but cannot always tell whether the path is actually reachable.
Dependency / SCA scanning
The question: are the third-party libraries we pull in known to be vulnerable? Software composition analysis (SCA) checks every direct and transitive dependency against public vulnerability advisories. This is the Kowhai Pay failure — the risk was not in code anyone wrote, it was in a library someone added. SCA is one of the highest-value, lowest-cost scans you can run, because most modern apps are mostly other people’s code.
Secret scanning
The question: has anyone committed a credential into the repository? Secret scanning hunts for API keys, tokens, passwords, and private keys accidentally hard-coded into source or config. A leaked AWS key or a database password in a commit is a direct path to a breach. This scan should run on every commit and, ideally, as a pre-commit hook as well — once a secret hits the git history it must be treated as compromised and rotated, even after it is removed.
DAST — dynamic application security testing
The question: when the app is actually running, can it be attacked? DAST tests the deployed, running application from the outside — the way an attacker would. A common pipeline entry point is the ZAP baseline scan: it spiders a running app and runs passive checks for issues like missing security headers, cookies without secure flags, and exposed endpoints, fast enough to sit in a pipeline. DAST needs a deployed target, so it runs later than the others, against a test environment.
5 Where Each Scan Fits in the Pipeline
The order matters. Cheap, fast checks that need no deployment go first, so a bad change fails in seconds rather than after a ten-minute build and deploy. The principle is “fail fast, fail cheap” — the same one that governs ordering unit tests before end-to-end tests.
————————————————————————————————————————————————
Pre-commit Secret scan Before code leaves laptop Stops a key reaching the repo at all
Commit / PR Secret scan On every push Catches what slipped past pre-commit
Commit / PR SAST On every push Needs only source; fast feedback
Build SCA / dependency After dependency resolve Now the full dependency tree is known
Post-deploy DAST (ZAP baseline) After deploy to test env Needs a running target to attack
Two practical points. First, put the scans that need nothing but source — secret scanning and SAST — as early as possible, ideally on the pull request, so a reviewer sees the result before merge. Second, the DAST baseline scan runs against a deployed test environment, never against production, and never against an environment with real customer data in it.
6 Failing the Build on Findings
A scanner that only writes a report no one reads is theatre. The point of putting security in CI is that the pipeline can stop — a real finding fails the build and the change does not ship until it is dealt with. But fail on everything and the team drowns; fail on nothing and you are back to a report no one reads. The art is the threshold.
The usual approach is severity-gated. Decide, as a team and written down, which severities break the build:
- Critical and High: fail the build. A known-exploitable dependency or a hard-coded secret should never reach production silently.
- Medium: warn and track, often allowed to pass but raised as a ticket with an owner and a due date, so it does not accumulate forever.
- Low / informational: logged for visibility, not a gate.
Two refinements keep this workable. A baseline lets you adopt scanning on an existing app without the first run failing on hundreds of pre-existing findings — you record the current state as accepted, and the gate only fails on new findings introduced by this change. And a documented, time-boxed exception process (often called a suppression or waiver, with an expiry date and an approver) handles the rare case where a High finding is genuinely accepted as a risk — with a name against it, not a silent skip.
7 Triaging False Positives
Every security scanner produces false positives — findings that are technically flagged but not a real risk in your context. SAST is the worst offender: it sees a dangerous pattern but cannot always tell whether user input ever actually reaches it. If you treat every finding as a real bug, developers learn the gate cries wolf and stop trusting it. Triage is how you keep the signal high.
A simple, repeatable triage for each finding:
2. Is it exploitable here? A vulnerable function in a dependency you import but never call may not be exploitable in your usage. → Still patch when you can, but it may not warrant breaking the build today.
3. If false, suppress it — visibly. Record the suppression in config with a reason and an author, so the next person sees why it was dismissed. Never silence a whole rule to kill one false positive.
4. If real, fix or risk-accept. Patch the dependency, fix the code, or raise a documented, time-boxed exception with an owner.
The discipline that separates a senior QE here: a false positive is suppressed with a recorded reason, never by disabling the rule wholesale. Turning off the rule to clear one noisy alert blinds you to every real instance of that rule forever — the single most common way a security gate quietly stops protecting anything.
8 Audit-Ready Evidence
Security in CI produces something a yearly pen test cannot: a continuous, dated record that every change was scanned. For NZ teams handling payments, health, or government data, that record is exactly what an auditor or a customer’s security questionnaire asks for. Make it deliberate:
- Scan results retained per build: the SAST, SCA, secret, and DAST reports stored against the build that produced them, not just the latest run.
- The gate decision recorded: what severities fail the build, captured in pipeline config under version control, so the policy itself is auditable.
- Suppressions and exceptions traceable: every dismissed finding has a reason, an author, and (for risk-accepted Highs) an expiry and approver.
- Trend over time: findings opened versus closed, so you can show the security debt is shrinking, not quietly growing.
This is the bridge between security and the CI/CD work you already do. The pipeline is not just where tests run — it is the system of record that proves, build by build, that security was checked. “We scan on every deploy and here are the reports” is a far stronger answer than “we had a pen test in February.”
9 Common Mistakes
🚫 Running the scan but never failing the build
Why it happens: Teams add a scanner that writes a report and feel covered, without wiring it to the exit code.
The fix: A report no one reads protects nothing. Gate the build on Critical/High and secrets so a real finding actually stops the deploy. The pipeline must be able to say no.
🚫 Turning on full strictness against a legacy codebase on day one
Why it happens: Strict feels responsible, and the first scan looks alarming, so the instinct is to block everything.
The fix: Hundreds of pre-existing findings on the first run get the gate switched off within a week. Record a baseline so only new findings fail the build, then tighten over time.
🚫 Killing a false positive by disabling the whole rule
Why it happens: One noisy alert is annoying, and turning off the rule makes it go away instantly.
The fix: Disabling the rule blinds you to every real instance of it from then on. Suppress the single finding with a recorded reason and author, and leave the rule active.
🚫 Running DAST against production or real data
Why it happens: Production is the “real” environment, so it seems like the most honest place to scan.
The fix: An active scan can corrupt data, trip alerts, and breach customer privacy. Run the DAST baseline against a deployed test environment with no real customer data, never against production.
10 Now You Try
Three graded exercises across the scan types and the pipeline. Write your answer, run it for AI feedback, then compare to the model answer.
Read the description of a fictional NZ insurance API team’s pipeline below. Identify 3 security automation gaps and name the scan type that would close each.
The team deploys to production several times a day. The pipeline runs unit tests and integration tests, and that is all that gates a deploy. Dependencies are pulled fresh on every build and updated whenever a developer feels like it; no one checks them against advisories. A developer once committed a database password to fix something quickly and removed it the next day. Security is covered by a penetration test booked once a year. The running app has never been scanned for missing security headers or exposed endpoints.
List 3 gaps and the scan type for each:
Show model answer
There are at least four real gaps; any three well-explained earns full marks. 1. SCA / dependency scanning — Dependencies are pulled fresh and never checked against advisories. A known-vulnerable library could ship to production unnoticed (the classic transitive-dependency breach). Add SCA gated on Critical/High at the build stage. 2. Secret scanning — A password was committed and only removed the next day; nothing scans for credentials. Once a secret is in git history it must be rotated, not just deleted. Add secret scanning on every commit and ideally a pre-commit hook. 3. DAST (ZAP baseline) — The running app has never been scanned for missing security headers or exposed endpoints. Add a baseline DAST scan post-deploy against a test environment. Bonus gap: SAST — the team's own source is never scanned for insecure patterns (e.g. SQL built by concatenation). Add SAST on the pull request. The theme: a yearly pen test cannot keep up with several deploys a day. Every gap above is a check that belongs in the pipeline, running on each change.
The build-failure policy below is unworkable. Rewrite it into a sensible severity-gated policy for a fictional NZ government services portal that is adopting scanning on an existing, legacy codebase. Cover: which severities fail the build, how to handle existing findings, the false-positive process, and the exception process.
“Fail the build on any finding of any severity. If there are too many, disable the rules that fire most.”
Rewrite as a workable gate policy:
Show model answer
Build-failure thresholds: Fail the build on Critical and High findings and on any detected secret. Medium = warn and raise a tracked ticket with an owner and due date (build passes). Low / informational = logged for visibility, not a gate. Existing / pre-existing findings: Record a baseline of the current state as accepted on adoption, so the first run does not fail on hundreds of legacy findings. The gate fails only on NEW findings introduced by a change. Burn down the baseline over time as a separate, tracked piece of work. False-positive handling: Triage each finding — is it reachable / exploitable here? If genuinely false, suppress that single finding in config with a recorded reason and author. Never disable the whole rule to silence one alert, because that hides every real future instance. Exception (risk-accept) process: For a real High that the business chooses to accept, raise a documented, time-boxed waiver with an approver and an expiry date — never a silent skip. It reappears at expiry for review. Why this works: it gates hard on what actually matters (Critical/High + secrets), it lets a legacy codebase adopt scanning without the gate being switched off, and it keeps every suppression and exception named and auditable. The original failed on all three: gating on everything is unworkable, and disabling noisy rules blinds you to real issues.
Design the security stages of a CI/CD pipeline for a fictional NZ fintech mobile-banking backend that deploys several times a day. For each of the 5 stages, give: the stage, the scan(s) that run there, and a one-line reason for the placement. Cover secret scanning, SAST, SCA, and a ZAP baseline DAST scan.
Show model answer
Stage 1 | Scan(s): Secret scan (pre-commit hook) | Reason: Stop a credential before it ever leaves the developer's machine and enters git history. Stage 2 | Scan(s): Secret scan + SAST (on the pull request) | Reason: Both need only source, so they run fast and a reviewer sees results before merge; secret scan here is the backstop for anyone who skipped the hook. Stage 3 | Scan(s): SCA / dependency scan (build stage) | Reason: After dependencies resolve, the full direct-and-transitive tree is known and can be checked against advisories; gate on Critical/High. Stage 4 | Scan(s): Deploy to test environment | Reason: DAST needs a running target, so the app must be deployed to a non-production test env (no real customer data) first. Stage 5 | Scan(s): ZAP baseline DAST scan (post-deploy) | Reason: With the app running, scan it from the outside for missing security headers, insecure cookies, and exposed endpoints. Strong plans follow "fail fast, fail cheap": the checks that need only source (secret, SAST) run earliest, SCA runs once dependencies are resolved, and DAST runs last because it needs a deployed target — and never against production. Weak plans put DAST early (impossible, nothing is running) or never fail the build.
11 Self-Check
Click each question to reveal the answer.
Q1: Why is a yearly penetration test not enough for a team that deploys daily?
Because a yearly check cannot keep up with daily change — a vulnerability introduced the day after the pen test can sit in production for months before the next one finds it. Security automation in CI runs the checks on every change, so feedback is continuous and the build itself is the gate.
Q2: Name the four scan types and the question each answers.
SAST — does our own source contain insecure patterns? SCA / dependency — are our third-party libraries known to be vulnerable? Secret scanning — has a credential been committed? DAST — can the running app be attacked from the outside?
Q3: Why does DAST run later in the pipeline than SAST, SCA, and secret scanning?
Because DAST tests a running application from the outside, so it needs a deployed target — it cannot run until the app is deployed to a test environment. SAST, SCA, and secret scanning need only the source or dependency list, so they run early and fast under “fail fast, fail cheap.”
Q4: What is a baseline, and why does it matter when adopting scanning on a legacy codebase?
A baseline records the current set of findings as accepted, so the gate fails only on new findings introduced by a change — not on the hundreds of pre-existing ones. Without it, the first run on a legacy app fails on everything and the team switches the gate off within a week.
Q5: What is the right way to handle a false positive, and what should you never do?
Suppress the single finding in config with a recorded reason and author, leaving the rule active. Never disable the whole rule to silence one noisy alert — that blinds you to every real future instance of it, which is the most common way a security gate quietly stops protecting anything.
12 Interview Prep
Real questions asked in NZ QA interviews for senior automation and DevSecOps roles. Read the model answers, then practise your own version.
“We have a yearly pen test. Why would you add security scanning to the pipeline as well?”
Because a yearly pen test cannot keep up with daily deploys — a vulnerable dependency or a committed secret can ship the day after the test and sit in production until the next one. I’d add the checks that run on every change: secret scanning on every commit, SAST and SCA on the pull request and build, and a ZAP baseline DAST scan after deploy to a test environment. I’d gate the build on Critical/High and any secret, warn on Medium, and keep the results as per-build evidence. The pen test still adds value as a deeper, human-led check — the pipeline scanning is the continuous backstop between them.
“A developer says the security gate is too noisy and wants it turned off. What do you do?”
I’d treat noise as a tuning problem, not a reason to switch off protection. First, check we’re gating only on what matters — Critical/High and secrets — and warning rather than failing on Medium. Second, if we’re adopting on a legacy codebase, set a baseline so only new findings fail the build. Third, triage the noisy findings: real ones get fixed or a time-boxed waiver, false ones get suppressed individually with a recorded reason — never by disabling the whole rule. A gate the team trusts and keeps is worth far more than a strict one they bypass.
“Where would you place a DAST scan in the pipeline, and what would you scan against?”
DAST tests a running application from the outside, so it has to run after a deploy — I’d place a ZAP baseline scan post-deploy, against a dedicated test environment with no real customer data, never against production. A baseline scan spiders the app and runs passive checks for things like missing security headers and insecure cookies, fast enough to live in a pipeline. The scans that need only source — secret scanning, SAST — run much earlier, with SCA at the build stage once dependencies resolve. The ordering follows fail-fast: cheap, deployment-free checks first.