Debugging as a detective story

Debugging as a detective story

It was a dark and stormy night. Suddenly, an exception was thrown…

Bugs can be a real mystery. A ‘crime’ has been committed: everything was rolling along, just as expected, and then an error pops out. A crash occurs. An exception is thrown. Sometimes you don’t even know when the crime occurred. All you have is output that looks wrong. You have no idea why or how—it’s just there, unexplained.

At times like this, I often feel like a private investigator solving a case. There is a mystery to solve, and my first task is to gather clues…

Get data at every step of the way

I looked at the notes one more time: Exception Thrown it said in big letters that almost jumped off the screen and punched me in the eyeball. “What could be throwing such a thing?” I wondered, confusion clouding my brain. “What does this?”

When you start investigating a bug, it can be overwhelming. Buried somewhere in hundreds of thousands of lines of code is a reason: a cause for what is happening. But you don’t see that from the outside. You just see a mass of code that can be compiled into a functioning program (except for the error, of course). It’s an imposing mass that deflects your attempts to understand it.

Maybe a hint sticks out, an intuition of where it’s all gone wrong. But often, there’s nothing. You just have to pick a place and start digging.

Logging statements are an important tool here. I’ve lost count of how often I’ve inserted a simple System.out.println() just to see things happen in real time, or just to have a line of code to put a breakpoint on if I’m using a debugger. Directing output to a file and then combing over it reveals things that stepping through the code won’t. Process explorers and memory tracking tools come in useful, too. When debugging code, I live by the maxim: All information is worth having.

Go over every assumption I’m making

What does this file have in it? An image? Text? He said if he passes it through the sample he gets backwards glyphs…but I don’t, they look fine to me. What’s going on?

Gathering clues isn’t easy. I have to collect everything available on the bug to try and reconstruct the crime scene, investigating each piece of evidence in detail. The evidence is often contradictory and confusing. An input seems to produce valid output every time…except the one time I wasn’t paying attention. A log message appears whose source I can’t determine.

The bug’s reporter is often of little help, assuming I can even talk to them. As Jeff Attwood put it, “Users are liars.” Not intentionally. But they’re focused on the outcome they need, not the machinery that produces it. They can miss details, or misremember things, and ultimately risk contaminating the crime scene with misinformation.

So I’m on my own when it comes to investigation. Just me and the code.

This is when it’s useful to list every assumption I’m making and test them: is this really true? Can I prove it? Do I know that third-party API can handle unexpected inputs? Can I guarantee I will never get a null pointer exception? You know what they say about making assumptions…

Ask myself “What is the code trying to do at a high level?”

Deep in a method, each line of code begins to look the same: push a value here, pop it there, compare with another. Pass it to a function…and where does that go? Look what it returned! How did that happen?!

When I get down a blind alley with the code, it helps to take a step back and ask: “What is the problem I’m trying to solve?” Perspective is important: it’s easy to get lost in the immediacy of why this flag is being set or that string isn’t being capitalized, and forget the larger purpose the code is trying to serve.

“Talking to the duck” is always a helpful strategy in these cases. Just listening to my own thought process can help reveal where I am, or suggest a new way of thinking, a new alley to go down.

Build a map of how the code is running

There it is! The bug! The thing that’s messing everything up! And an ugly hunk of code it is, a pile of incomplete logic and naive assumptions. I tighten my fingers over the keyboard, ready to wipe it out of existence, when it laughs, an ugly laugh: “How do you know I’m the bug, flatfoot? You don’t even know how you got here!”

It’s true: you can find a piece of code that looks wrong, that you know in your gut is guilty. You know exactly which line of which file it’s in. And yet you still don’t know quite where you are.

And that’s fatal, especially with a large code base you still don’t fully understand. Maybe that weird piece of code does serve a purpose, one that you never knew about. Maybe fixing this will break something else.

It’s not enough to know the single place where the code is broken. You need to know every step that led to this place.

The stack trace is my friend here, a breadcrumb trail that leads from where I started to where I am now. Unit tests help too, providing some protection against regressions. But they aren’t always enough. Sometimes I need to reconstruct the crime step by step. I’ve filled notebooks with diagrams, detailing every step of the crime, working out cause and motive until I’m sure I understand how the crime went down.

Reconsider exceptions and other events I assume are uninteresting

How naive I’d been. I had assumed the code was testing for a condition that could never happen, and in doing so wasting memory. Now I realize that was never the culprit at allthat code was protecting against a null pointer coming in and destabilizing everything. Now the fix is obvious…

It’s easy to ignore exceptions and other warnings I’ve seen so many times. It becomes a habit: like a stoplight on the commute to work, encountering an expected error message encourages you to begin not to see it.

And that’s dangerous: it could be that exception is the smoking gun, the single clue that will unravel the entire crime, if only I can get to where I can see it.

Reconstructing the crime

“I’ve gathered you here to reveal the murderer. But first I’d like to review the crime…”

Finding and fixing the bug is only half the job. The most important thing is to make sure it doesn’t happen again, and the solution to that is: regression testing.

This is the test of how well you understand the bug. A proper test needs to precisely reconstruct the circumstances that led to the bug, and requires a detailed understanding of what was broken and what is working just fine. Otherwise, you can lose confidence in your code.

I often end up spending the bulk of my time not fixing a bug but devising the test for the fix. Many of the bugs I work on are intricate enough that reproducing them in an efficient way takes some effort. (I could simply throw the original input at the code, but that is often unfeasible: the input file may be large, with the suspect data buried deep inside. And frequently, the input is a customer resource that we’re contractually not allowed to make part of the test suite.)

Debugging can be a mind-boggling story. It helps to look at it like one.

Leave a Reply

Your email address will not be published. Required fields are marked *