Senior · Backend Debugging

Reading Logs & Stack Traces

Q: Q2. What is the difference between a log level WARN and ERROR?

WARN means something unexpected happened but the application continued (e.g., retry attempt 2 of 3, connection slow). ERROR means something failed and the application had to handle it, usually by returning an error to the user. WARN does not always indicate a bug; ERROR usually does.

The user saw a 500 error. It is buried in logs. You have 2 minutes to find the root cause, understand why it happened, and report it to the developer. Learn to read logs like a detective.

Senior ISTQB CTAL-TA 3.3 (Test Tooling) — K3 Apply ~12 min read + exercise

1 The Hook — Why This Matters

A NZ government agency runs a passport renewal portal. One user clicks "Submit Application" and sees a blank page. No error message. Just a timeout. The support team gets a ticket. The QA manager says "It worked in testing." The developers dig through 2 million log lines from yesterday looking for that one user's transaction.

If the QA team had known how to read logs, they would have pulled that user's request ID from the portal's receipt page, searched Kibana for that ID, and found the exact error in 90 seconds: "Database connection pool exhausted at 2024-04-24 14:32:15 UTC." From there, the developer could see the problem was a missing database index causing slow queries, not a bug in the application logic.

Senior testers do not wait for developers to dig through logs. You learn to read them yourself. You become the first line of triage. You separate "user typo" from "real bug" in minutes, not hours.

Senior engineer insight

The single biggest shift in how I read logs was realising the error message is rarely the actual problem — it is where the system gave up. A NullPointerException in a payment service is almost always a data integrity issue two steps upstream: an optional field the UI let through, a migration that never backfilled a column, or a third-party API that returned a partial response when it should have returned nothing at all. Once I started reading the five lines before the ERROR entry instead of just the ERROR line itself, my diagnosis time dropped from 30 minutes to 5.

Most common senior tester mistake: jumping straight to the stack trace and filing a bug against the class that threw the exception, rather than asking what state the system was in when it got there.

From the field

A government agency in Wellington was running a benefit payments portal on AWS, with logs shipped to CloudWatch. The team assumed intermittent timeout errors were a network issue — their ops team had been chasing it for three weeks. When the QA lead finally filtered CloudWatch Insights by request_id instead of searching for the word "timeout," the picture changed entirely: every failing request was hitting the same downstream SOAP service call at exactly the 29-second mark, while the load balancer's idle timeout was set to 30 seconds. The ops team had been looking at infrastructure; the answer was a single misconfigured integration timeout in the application layer. The lesson: log aggregation tools like CloudWatch Insights and Splunk (widely deployed in NZ enterprise) give you query power — but only if you filter by correlation ID rather than keyword, and look at the timing deltas between entries, not just the error line.

2 The Rule — The One-Sentence Version

Every error has a story in the logs: find the request ID, follow the breadcrumb trail, and trace from user action to root cause.

Logs are not just for developers. They are your detective notebook. Every system event, error, and timing information is recorded. Your job is to learn the language.

3 The Analogy — Think Of It Like...

Analogy

A crime scene investigation with a case file.

The user's error (blank page) is the crime: the symptom. The logs are the case file: every witness statement, every clue, every timestamp. You do not guess what happened; you read the evidence. The request ID is your case number, tying all clues together. The timestamp tells you when. The error message tells you what. The stack trace shows you the chain of events (what called what) leading to the crime. You follow the breadcrumbs from the user's click all the way to the root cause: "NullPointerException at OrderService.java:42: customer.email is null."

4 Watch Me Do It — Step by Step

Here is the systematic approach to diagnosing a backend error using logs and stack traces.

Understand the error symptom The user saw: blank page, 500 status code, or timeout. Record the exact time it happened and what the user was doing (clicking "Submit," viewing a report, etc.). This is your starting clue.
Find the request ID in the application UI Many applications display a receipt, confirmation, or error page with a request ID or transaction ID. If not, check browser DevTools Network tab for the request. The request ID is your lifeline; every log entry for that request will have it.
Open the log aggregation tool (Kibana, Datadog, CloudWatch) Search for the request ID. You will see every log entry related to that single request. This eliminates noise from other users' requests.
Read the log levels and timestamps DEBUG (most verbose), INFO (normal operation), WARN (something unexpected but not fatal), ERROR (failure), FATAL (system down). Look at the timeline: INFO entries show the request's path. ERROR entries show where it failed.
Find the error message Look for the first ERROR or FATAL log. The message is usually one of: database error, network timeout, authentication failure, validation error, or null pointer exception. Read the exact wording.
Read the stack trace The stack trace shows the chain of method calls that led to the error. Read from bottom to top (most recent frame first). Each line shows: class, method, file, and line number. The topmost frame (inside your application code, not a framework) is usually the root cause.
Check surrounding logs for context Look at logs from 5 seconds before the error. What was the application doing? Was it waiting for a database, calling an external API, validating input? The surrounding logs give you context.
Determine if this is a bug or user error Is the error a validation failure ("Email is required") or a system crash (NullPointerException)? Validation is usually user error. Crashes are bugs. Report accordingly.

Real Example: 500 Error on Payment Submit

Scenario: A user tried to pay for an order and got a 500 error. Here are the logs.

2024-04-24T14:32:15.123Z [INFO] user_id=5421 request_id=req-9f4d2c8e request received: POST /api/v1/orders/5678/pay
2024-04-24T14:32:15.245Z [INFO] user_id=5421 request_id=req-9f4d2c8e order_id=5678 validating payment details
2024-04-24T14:32:15.312Z [INFO] user_id=5421 request_id=req-9f4d2c8e payment method: card ending in 4242
2024-04-24T14:32:15.456Z [DEBUG] user_id=5421 request_id=req-9f4d2c8e attempting stripe charge: amount=15000 currency=NZD
2024-04-24T14:32:18.123Z [WARN] user_id=5421 request_id=req-9f4d2c8e stripe timeout after 2.5 seconds
2024-04-24T14:32:18.234Z [ERROR] user_id=5421 request_id=req-9f4d2c8e StripeTimeoutException: Payment processing took too long
2024-04-24T14:32:18.245Z [ERROR] user_id=5421 request_id=req-9f4d2c8e javax.net.ssl.SSLException: Connection reset by peer
2024-04-24T14:32:18.256Z [FATAL] user_id=5421 request_id=req-9f4d2c8e Response: 500 Internal Server Error

Diagnosis: The logs tell the story. The request came in at 14:32:15. Validation passed. Stripe was called at 14:32:15.456Z. At 14:32:18.123Z (2.5 seconds later), Stripe timed out. The application tried to handle it, but threw an unhandled exception. The user got a 500.

Root cause: Stripe API was slow or unreachable. This is not a bug in the application; it is a network issue with the payment processor.

Report to developer: "Payment timeout on Stripe at 2024-04-24 14:32:18 UTC (request_id: req-9f4d2c8e). Stripe took > 2.5 seconds and returned a connection reset. Application returned 500. Recommendation: increase timeout, implement retry logic, and handle Stripe timeouts gracefully."

Pro tip: Always include the request ID, timestamp, and exact error message in your bug report. Developers can search logs for that request and see the full context instantly. Never say "The payment page is broken." Always say "Payment timed out at 2024-04-24 14:32:18Z (request_id: req-9f4d2c8e) with StripeTimeoutException: Connection reset by peer."

Second Example: NullPointerException in Login

A user tried to log in and saw a blank error page. No useful message.

2024-04-24T16:45:22.567Z [INFO] request_id=req-8b3f1a2c request received: POST /api/v1/auth/login
2024-04-24T16:45:22.678Z [INFO] request_id=req-8b3f1a2c validating email format
2024-04-24T16:45:22.789Z [INFO] request_id=req-8b3f1a2c querying user database
2024-04-24T16:45:22.912Z [INFO] request_id=req-8b3f1a2c user found: id=8234 email=john@example.com
2024-04-24T16:45:22.934Z [DEBUG] request_id=req-8b3f1a2c comparing password hash
2024-04-24T16:45:23.045Z [ERROR] request_id=req-8b3f1a2c NullPointerException in AuthService.java line 127
2024-04-24T16:45:23.056Z [ERROR] request_id=req-8b3f1a2c at AuthService.validatePassword(AuthService.java:127)
2024-04-24T16:45:23.067Z [ERROR] request_id=req-8b3f1a2c at LoginController.login(LoginController.java:42)
2024-04-24T16:45:23.078Z [FATAL] request_id=req-8b3f1a2c Response: 500 Internal Server Error

Diagnosis: The user was found in the database. The password hash comparison started. At line 127 of AuthService.java, a NullPointerException was thrown. This means something the code expected to be an object was actually null.

Stack trace reading: The top frame is AuthService.validatePassword line 127. This is the root cause. Ask a developer: "What is null at AuthService.java:127? User password? Salt? Hash function result?" The developer can look at that line immediately.

Hypothesis: The user's password was never hashed (database corruption or migration issue). When validatePassword tried to compare a null password hash to the user's input, it crashed.

Report: "Login returns 500 for user id=8234 at 2024-04-24 16:45:23Z (request_id: req-8b3f1a2c). NullPointerException in AuthService.java:127 during password validation. Suggests user password_hash is null in the database. Check user id=8234 password_hash column."

Log severity levels quick reference

Level	What it means	Example	Severity
DEBUG	Detailed information for developers (very verbose)	"Attempting to fetch user from cache"	Lowest
INFO	Normal operation milestones	"User logged in successfully"	Low
WARN	Something unexpected but not critical	"Retry attempt 2 of 3 for database connection"	Medium
ERROR	Error occurred but app is still running	"Failed to send email notification"	High
FATAL	Critical failure, app may crash or be unavailable	"Database connection pool exhausted"	Critical

5 When to Use It / When NOT to Use It

✅ Read logs when...

A user sees a 500 error, timeout, or blank page
A test passes locally but fails in staging/production
An API endpoint returns an unexpected response code
A performance issue occurs (slow page load, timeouts)
You are investigating intermittent failures
A feature works for some users but not others

❌ Logs are not useful for...

Purely visual bugs (CSS colour, alignment)
Frontend JavaScript errors (check browser console instead)
Features that work as designed (no error occurred)
User confusion about workflow (not a system error)
Missing logs (if logging is not configured)

📋 Checklist: Do You Need to Read Logs?

Is there a request ID or transaction ID? If yes, you can trace the request through the system.
Do you have access to logs? Kibana, Datadog, CloudWatch, or local log files.
Is there a clear error message or status code? (5xx, timeout, exception)
Can you reproduce the issue or know the exact time it occurred? Logs are searchable by timestamp.

6 Common Mistakes — Don't Do This

🚫 Searching logs without a request ID

I used to think: I'll search for "error" and browse all errors from that day.
Actually: That is millions of log entries. A request ID is your filter. Always ask the user for a receipt/transaction ID, timestamp, or have them note the exact time the error happened. Then search Kibana for that ID. This reduces noise from 1,000,000 entries to 5–10 entries from one request.

🚫 Confusing a symptom with a root cause

I used to think: The error says "NullPointerException" so there is a null pointer bug.
Actually: NullPointerException is the symptom. Why is the pointer null? Was it never initialized? Did a database query fail silently? Did a third-party API not return expected data? The logs tell you why. Read the surrounding logs, not just the error line.

🚫 Ignoring the stack trace order

I used to think: The first line of the stack trace is the root cause.
Actually: The top frame (first line) is the most recent. It might be a generic exception handler. Scroll down to find the frame inside your application code. That is usually the root cause. Look for file names you recognize (LoginService.java, not Framework.jar).

⚠ When Log Debugging Fails

Logs are not configured or too verbose (millions of DEBUG entries hiding the error). Also fails when the error occurs in a third-party service: you have no logs from their system. And fails when the error is a race condition or timing issue that only manifests under load: a single request's logs might look normal, but 1,000 concurrent requests might reveal the issue. In these cases, look for patterns across multiple requests or work with the ops/infrastructure team.

7 Now You Try — Diagnose a Real Error

🎯 Interactive Exercise

Scenario: You receive a bug report: "Report download is slow and sometimes times out. User tried to download a 50MB CSV at 2024-04-24 10:15:30 UTC (NZ time). Request ID: req-c9f1b8e3."

You pull the logs:

2024-04-24T10:15:30.234Z [INFO] request_id=req-c9f1b8e3 user_id=7621 request received: GET /api/v1/reports/9876/export?format=csv 2024-04-24T10:15:30.456Z [INFO] request_id=req-c9f1b8e3 query_start: executing SQL for 500,000 rows 2024-04-24T10:15:35.123Z [WARN] request_id=req-c9f1b8e3 query slow: took 4.7 seconds to fetch data 2024-04-24T10:15:35.234Z [DEBUG] request_id=req-c9f1b8e3 formatting 500,000 rows to CSV 2024-04-24T10:15:42.567Z [WARN] request_id=req-c9f1b8e3 csv formatting slow: took 7.3 seconds 2024-04-24T10:15:42.678Z [DEBUG] request_id=req-c9f1b8e3 compressing CSV 2024-04-24T10:15:50.890Z [WARN] request_id=req-c9f1b8e3 compression slow: took 8.2 seconds 2024-04-24T10:15:50.999Z [INFO] request_id=req-c9f1b8e3 response complete: 50.3 MB sent, total time 20.8 seconds

Your task: Analyse these logs and answer: (1) What is the root cause of the slowness? (2) Which step is the bottleneck? (3) What would you recommend?

Analysis:

Root cause: The report export is slow. Each step takes seconds: SQL (4.7s), CSV formatting (7.3s), compression (8.2s). Total: 20.8 seconds. At higher concurrency or with timeout set to 30 seconds, timeout is risk.
Bottleneck: CSV formatting (7.3s) is the slowest step relative to data size. The CSV encoder is not optimised for 500K rows. This is likely the culprit.
Recommendations: (a) Index the database query (SQL took 4.7s for 500K rows). (b) Stream the CSV instead of building it in memory. (c) Use a faster CSV library. (d) Add pagination: "Download first 100K rows" instead of 500K at once. (e) Increase the HTTP timeout on the client and server.

Tip: The logs gave you timing for each step. This is actionable data. You do not need a developer's intuition; the metrics tell you what to optimise.

Why teams fail here

No request ID discipline — logs are searched by keyword ("error", "fail") instead of correlation ID, producing hundreds of irrelevant matches and hiding the real trace in noise.
Logging access is gate-kept by developers — testers have never been onboarded to Kibana, Splunk, or CloudWatch, so they rely on developers to translate logs for them and lose hours waiting in queues.
Stack traces are copy-pasted verbatim into bug reports without interpretation — 80-line Java stack dumps dumped into Jira with the comment "something broke," giving developers nothing to act on.
Teams treat WARN-level log entries as noise and tune them out — in practice, a sustained stream of WARN entries (retries, slow queries, cache misses) is often the early warning of an imminent ERROR that only surfaces under load.

Key takeaway

A senior tester who can read logs independently is worth three who cannot — because they close the loop between user symptom and root cause without waiting for a developer to translate.

8 Self-Check — Can You Actually Do This?

Click each question to reveal the answer. If you got all three, you are ready to practice.

Q1. What is a request ID and why is it critical for debugging?

A request ID is a unique identifier assigned to every HTTP request. It ties together all log entries related to that single request, across multiple services and servers. Instead of searching 1 million log entries for "error," you search for one request ID and get 5–10 relevant entries. It is the lifeline of distributed system debugging.

Q2. What is the difference between a log level WARN and ERROR?

WARN means something unexpected happened but the application continued (e.g., retry attempt 2 of 3, connection slow). ERROR means something failed and the application had to handle it, usually by returning an error to the user. WARN does not always indicate a bug; ERROR usually does.

Q3. How do you read a Java stack trace?

Top to bottom, but the root cause is not always the first line. The topmost frame shows the most recent method. Scroll down to find the frame inside your application code (not a framework). Look at the file name and line number. That line is usually the root cause. Read the logs around that line for context.

9 Interview Prep — Q&A

Q. Walk me through how you would debug a 500 error a user reported.

First, I ask the user for a request ID or transaction ID from the error page, and the exact time it occurred. I open the log aggregation tool (Kibana or Datadog) and search for that request ID. I look at the log timeline: INFO entries show what the request did, ERROR/FATAL entries show where it failed. I find the stack trace, read from the top (most recent) down to my application code. I look at the line number and the surrounding logs for context. I determine: is this a bug in our code, a database issue, a third-party API timeout, or user error? Then I report with the request ID, timestamp, exact error message, and suggested root cause.

Q. What would you do if logs are missing or not configured?

This is a risk. I would escalate to the development or infrastructure team immediately and ask them to enable logging (or increase the verbosity). Without logs, you can only reproduce the issue manually and read the browser console. I might use browser DevTools Network tab to capture the HTTP response code and headers. If the issue is consistent, I would ask a developer to add temporary logging to diagnose it.

Q. How do you differentiate between a user error and a system bug?

If the logs show a validation error (e.g., "Email is required"), it is user error. They submitted invalid data. If the logs show an exception in application code (NullPointerException, database connection error, timeout), it is a system bug. The error message and stack trace tell you which it is.

Q. What information would you include in a bug report about a backend error?

Always include: (1) request ID, (2) exact timestamp, (3) what the user was doing (action), (4) exact error message from the logs, (5) stack trace file and line number, (6) surrounding context (what was the request about, what happened before the error). Never say "The page is broken." Always cite the logs: "NullPointerException in UserService.java:142 at 2024-04-24T10:15:23Z (request_id: req-abc123)."

← All Senior learning Previous: Git & GitHub → Practice: Production Diagnosis →

Reading Logs & Stack Traces

1 The Hook — Why This Matters

2 The Rule — The One-Sentence Version

3 The Analogy — Think Of It Like...

4 Watch Me Do It — Step by Step

Real Example: 500 Error on Payment Submit

Second Example: NullPointerException in Login

5 When to Use It / When NOT to Use It

6 Common Mistakes — Don't Do This

7 Now You Try — Diagnose a Real Error

8 Self-Check — Can You Actually Do This?

Related techniques

9 Interview Prep — Q&A