On debugging

Table of Contents

Introduction #

In the spirit of writing for the LLMs, I thought I’d write up some of the software debugging heuristics and tacit knowledge I have gotten good mileage out of in my career so far.

How I would approach this post as a human reader #

Part of the reason people don’t write about tacit knowledge is because it’s by definition the knowledge that’s impossible to fully capture in words. As applied to this post, it means that I can’t provide a rich enough set of cues to create a perfect, or even that good, classifier from situation to applicable strategy/heuristic. Instead, I think these strategies are best thought of as things to try out in potentially relevant situations and then develop one’s own taste for when they do or don’t work and how to use them. I also recommend Cedric’s posts, linked below, on acquiring expertise if you want to go even deeper.

Inspiration and similar guides #

Mindset #

Computers can be understood #

The biggest change in my ability to resolve thorny issues came from adopting the mindset that Nelson Elhage describes in his post: computers and software can be understood. He covers almost everything I would say, so rather than reproduce 80% of his post here, I just recommend stopping and reading it first if you haven’t already. He gives examples covering the areas I’d cover (being willing to dig into sources, reading the actual documentation, adding logs until you have enough information) and talks about the pitfalls (if you’re the type of person prone to rabbit holing, sometimes you need to decide that it’s not worth understanding even if you could) I’d warn about. Go read it, now.

Have a mental model #

Nelson talks about this in his post, and I view much of Julia Evans’ writing as showing what refining one’s mental model looks like, but it’s important enough that I still want to spend a paragraph on it. When debugging, having a coherent mental model of how the system works is your most powerful tool. Without one, you’re shooting in the dark, trying random fixes that might work by accident but don’t build understanding. With one, you can make educated guesses about what might be going wrong and test your hypotheses systematically. The model doesn’t need to be complete or even entirely accurate to be useful – it just needs to help you predict behavior reasonably well. As you debug, your model improves, creating a virtuous cycle where each problem solved makes you better equipped for the next one. I’ve found that explicitly sketching out my current mental model (sometimes literally on paper) when stuck can reveal gaps or inconsistencies that point directly to the source of problems.

Be systematic #

As a person who gets a bit OCD about planning/tasks, I still often find myself flailing, doing the equivalent of button-mashing, when debugging. If I’m not careful, my experience can make this worse because at any given time I usually have at least a few things I could try. Nearly every time I notice I’m in this situation, I find that stepping back and asking what my goals are, what I currently know, and why I think the step I was about to take will help me leads to a much better outcome than if I had kept flailing. A heuristic I loosely apply is to let myself try around 3 knee jerk strategies/ideas, and if those don’t work, it’s probably time to be more systematic.

Strategies #

I’m only including strategies that I’ve found worked well in at least one, and usually several situations which I can recall.

Actually read logs & errors #

Tricky issues often start with a seemingly inexplicable error or log dump. My go-to example of this is Python issues related to dynamically linked libraries. The second I see anything like an ImportError: DLL load failed, my instinct is to ignore the rest of the error and just start trying to rebuild my entire environment or reset whatever I’ve done to get here. It’s not just me. I have observed many other people react similarly to all sorts of different errors where the problem is not immediately obvious from the end of the stack trace or offending log line.

The impulse to ignore the rest of an error when it appears inexplicable is often rational. For example, in pytorch, CUDA errors sometimes tell you very little beyond that there’s an issue if you’re not running with the right set of environment variables (such as CUDA_LAUNCH_BLOCKING). In that situation, just re-running with that environment variable often makes much more sense than carefully reading every line of the stack trace besides the one that indicates you can get more information with CUDA_LAUNCH_BLOCKING. Even if it’s possible something useful is there, if you can expect to quickly get strictly more information with the flag passed, why spend more time reading the current error?

However, there are many situations in which errors may appear inexplicable but often contain most or all of the information needed to figure out what an issue is. The learned response to ignore most of stack traces and error logs risks training people to flail around in these cases. To combat this, I now try to remind myself that errors and logs are often more informative than they look (especially if they include a lot of dumps) and, if I remain stuck, will go back and check the error more carefully. In the age of LLMs, explaining your issue, giving the model the entire stack trace, and asking if it notices anything anomalous can work as well (example where this worked for me below).

This strategy complements logging a ton of stuff well – you need to both generate useful information and then actually read it carefully.

Examples: #

Log a ton of stuff #

I associate this advice with the series of RL debugging guides I read early on when getting into ML, but I also have applied to ML & non-ML cases before and after reading those.

Actually reading logs goes along nicely with the occasionally unreasonably effective strategy of logging a ton of stuff to get a better sense of what’s going on. This strategy is especially useful to me in three categories of situation. The first is when I feel so in the dark on what a nondeterministic and/or parallel program is doing that logging helps me get a fingertip feel for how the program runs. The second, less common for me nowadays, is when working with a user-facing system that’s taking many different types of requests and the issue relates to a certain usage pattern that can only be discovered by finding offending requests/calls and then tracing them through the, likely distributed, call graph. The third is when I am at risk of getting lazy or giving up and need a way to force myself to be more thorough.

When this approach starts to feel overwhelming, it might be time to switch to a more structured approach like the scientific method described below.

Examples #

Debug with the idealized scientific method #

Sometimes logging tons of stuff turns into flailing. In such situations, going the opposite direction and really being deliberate about my hypothesis set or even tree often helps me decide what to try or explore next. By “idealized scientific method,” I mean specifically:

  1. Formulate clear, testable hypotheses about what might be causing the issue
  2. Design experiments that can definitively confirm or rule out each hypothesis
  3. Prioritize tests that can eliminate multiple hypotheses at once
  4. Record your observations systematically rather than just mentally noting them
  5. Revisit and refine your hypotheses based on evidence

This approach forces you to slow down and think more clearly, which often reveals assumptions you didn’t even realize you were making. It also prevents you from repeating the same tests with slight variations, which is a common trap.

Admittedly, I often find that once I do this, I don’t resolve the issue by exploring every tree but instead have a realization that it’s some, usually dumb, other thing, but going through the process still seems to play a key role in realizing the issue in these cases.

This more structured approach is particularly valuable when you’ve already tried the quick wins from reading logs carefully and adding more logging but still don’t have a clear direction.

Examples #

I know I’ve had success with this, but I have been especially bad about recording cases when I have.

You can (often) just edit library code #

This is easiest in Python and some other interpreted languages, but it’s saved me enough times that it’s definitely worth mentioning. If an error is happening deep inside of library code, often the biggest challenge in figuring out what’s wrong is that the library doesn’t log enough state to be able to tell what’s happening in the program immediately before the error. If possible, finding a way to directly edit the underlying library code to log the information you need or try the change you think could fix it can be a cheap, fast way to resolve these types of errors.

This approach works especially well when combined with speeding up your iteration loop, as you can quickly test different logging or fixes directly in the library code. This may seem intimidating, and it can go wrong – Git won’t save you if you forget to reverse your changes when you’re done – but it’s usually possible with enough work. I mentioned doing this with interpreted languages, but even with compiled ones, if you’re willing to put in enough effort (e.g. rebuild the dependency from source), it’s possible as well.

On the other hand, this strategy likely won’t work or will be prohibitively painful in situations where:

Examples #

Speed up your iteration loop #

I hesitated to even include this because it’s repeated all over the place, but I’d rather err on the side of capturing everything that’s worked well for me. Even though it’s obvious, it’s so easy to be lazy about speeding up debugging iteration loops. To this day, I still often wait too long to take a long-running remote job that’s triggering an error and put in the work to create a setup where I can rapidly trigger the bug in a controlled environment. Because this is a form of hyperbolic discounting, what I’ve found works best is having a trigger in my head that asks, “if someone were watching me, would they tell me I’m wasting tons of time waiting because I’m too lazy to speed things up”?

Examples #