On debugging

Table of Contents

Introduction #

In the spirit of writing for the LLMs, I thought I’d write up some of the software debugging heuristics and tacit knowledge I have gotten good mileage out of in my career so far.

How I would approach this post as a human reader #

Part of the reason people don’t write about tacit knowledge is because it’s by definition the knowledge that’s impossible to fully capture in words. As applied to this post, it means that I can’t provide a rich enough set of cues to create a perfect, or even that good, classifier from situation to applicable strategy/heuristic. Instead, I think these strategies are best thought of as things to try out in potentially relevant situations and then develop one’s own taste for when they do or don’t work and how to use them. I also recommend Cedric’s posts, linked below, on acquiring expertise if you want to go even deeper.

Inspiration and similar guides #

Copying Better: How to Acquire the Tacit Knowledge of Experts and Tacit Expertise, Software Engineering Edition by Cedric Chin
My Reinforcement Learning Learnings by Clemens Winter
Debugging RL, Without the Agonizing Pain by Andy Jones
Why don’t schools teach debugging by Dan Luu
Computers can be understood by Nelson Elhage
Lessons learned reproducing a deep RL paper by Matthew Rahtz
So you want to be a wizard by Julia Evans

Mindset #

Computers can be understood #

The biggest change in my ability to resolve thorny issues came from adopting the mindset that Nelson Elhage describes in his post: computers and software can be understood. He covers almost everything I would say, so rather than reproduce 80% of his post here, I just recommend stopping and reading it first if you haven’t already. He gives examples covering the areas I’d cover (being willing to dig into sources, reading the actual documentation, adding logs until you have enough information) and talks about the pitfalls (if you’re the type of person prone to rabbit holing, sometimes you need to decide that it’s not worth understanding even if you could) I’d warn about. Go read it, now.

Have a mental model #

Nelson talks about this in his post, and I view much of Julia Evans’ writing as showing what refining one’s mental model looks like, but it’s important enough that I still want to spend a paragraph on it. When debugging, having a coherent mental model of how the system works is your most powerful tool. Without one, you’re shooting in the dark, trying random fixes that might work by accident but don’t build understanding. With one, you can make educated guesses about what might be going wrong and test your hypotheses systematically. The model doesn’t need to be complete or even entirely accurate to be useful – it just needs to help you predict behavior reasonably well. As you debug, your model improves, creating a virtuous cycle where each problem solved makes you better equipped for the next one. I’ve found that explicitly sketching out my current mental model (sometimes literally on paper) when stuck can reveal gaps or inconsistencies that point directly to the source of problems.

Be systematic #

As a person who gets a bit OCD about planning/tasks, I still often find myself flailing, doing the equivalent of button-mashing, when debugging. If I’m not careful, my experience can make this worse because at any given time I usually have at least a few things I could try. Nearly every time I notice I’m in this situation, I find that stepping back and asking what my goals are, what I currently know, and why I think the step I was about to take will help me leads to a much better outcome than if I had kept flailing. A heuristic I loosely apply is to let myself try around 3 knee jerk strategies/ideas, and if those don’t work, it’s probably time to be more systematic.

Strategies #

I’m only including strategies that I’ve found worked well in at least one, and usually several situations which I can recall.

Actually read logs & errors #

Tricky issues often start with a seemingly inexplicable error or log dump. My go-to example of this is Python issues related to dynamically linked libraries. The second I see anything like an ImportError: DLL load failed, my instinct is to ignore the rest of the error and just start trying to rebuild my entire environment or reset whatever I’ve done to get here. It’s not just me. I have observed many other people react similarly to all sorts of different errors where the problem is not immediately obvious from the end of the stack trace or offending log line.

The impulse to ignore the rest of an error when it appears inexplicable is often rational. For example, in pytorch, CUDA errors sometimes tell you very little beyond that there’s an issue if you’re not running with the right set of environment variables (such as CUDA_LAUNCH_BLOCKING). In that situation, just re-running with that environment variable often makes much more sense than carefully reading every line of the stack trace besides the one that indicates you can get more information with CUDA_LAUNCH_BLOCKING. Even if it’s possible something useful is there, if you can expect to quickly get strictly more information with the flag passed, why spend more time reading the current error?

However, there are many situations in which errors may appear inexplicable but often contain most or all of the information needed to figure out what an issue is. The learned response to ignore most of stack traces and error logs risks training people to flail around in these cases. To combat this, I now try to remind myself that errors and logs are often more informative than they look (especially if they include a lot of dumps) and, if I remain stuck, will go back and check the error more carefully. In the age of LLMs, explaining your issue, giving the model the entire stack trace, and asking if it notices anything anomalous can work as well (example where this worked for me below).

This strategy complements logging a ton of stuff well – you need to both generate useful information and then actually read it carefully.

Examples: #

I wasn’t sure why a training run was failing. I looked at the error and noticed it included nan losses so the checkpoint was never getting saved.
Docker kept failing to download a bunch of packages to my laptop. It was a frustrating case where it “works on my machine” where my colleague had no issues but I couldn’t get it to work. Making matters worse, the only errors I was getting were descriptions (hashes, names, etc.) of packages with “failed to fetch” appended to each. As far as I could tell, the issue had to do with fetching archives from apt-get.

I first spent a while Googling to see if others had encountered this with Docker before. I found some leads – running rm of the apt-get cache and other cache-invalidation-style strategies – but none of the solutions they specified solved the problem.

We were about to give up because we had a remote option. On a whim, I decided to look at the error one more time and really focus on the Failed to fetch part. Re-reading it, I realized that the actual files were much smaller than the expected files. It occurred to me that maybe the files are actually failing to fetch rather than just out-of-date. I curl-ed the file and noticed I was getting HTML back instead of a file. I then tried to open it in a browser and discovered it was blocked by… Screen Time. I had an adult content filter turned on in my Screen Time settings (who knows why) and it was blocking apt-get’s archive requests, leading to the returned HTML getting hashed instead of what should have been the gzip-ed package files. Reading errors for the win.
I was piloting a major development environment change that involved uninstalling and reinstalling a bunch of local packages. During reinstallation, a package could not be installed with a “can’t find…” error for a dependency. I tried a bunch of stuff and couldn’t figure it out. Then, I pasted the error into Claude and it noticed that a line in the stack trace was pointing to an old path. Turns out uninstalling didn’t fully remove all the old packages and Python was seeing the old, broken dependency and trying to use it.

Log a ton of stuff #

I associate this advice with the series of RL debugging guides I read early on when getting into ML, but I also have applied to ML & non-ML cases before and after reading those.

Actually reading logs goes along nicely with the occasionally unreasonably effective strategy of logging a ton of stuff to get a better sense of what’s going on. This strategy is especially useful to me in three categories of situation. The first is when I feel so in the dark on what a nondeterministic and/or parallel program is doing that logging helps me get a fingertip feel for how the program runs. The second, less common for me nowadays, is when working with a user-facing system that’s taking many different types of requests and the issue relates to a certain usage pattern that can only be discovered by finding offending requests/calls and then tracing them through the, likely distributed, call graph. The third is when I am at risk of getting lazy or giving up and need a way to force myself to be more thorough.

When this approach starts to feel overwhelming, it might be time to switch to a more structured approach like the scientific method described below.

Examples #

When I worked on backend stuff, I probably figured out a quarter of problems I encountered, if not more, by aggressively using log search tools (e.g. Kibana) to aggregate logs from offending users/processes/branches, potentially adding more information, restarting, and then eventually nailing down enough information about an offending request or branch to create a test case/scenario I could more quickly iterate with.
I was helping migrate some models and noticed a discrepancy between an old and new version for one class. I kept adding more and more logs, and they really seemed to be doing the same thing. Eventually, I added a log to all the relevant __init__ methods and realized that there was a, previously benign, extra call to an RNG causing the RNG state to be different between the two cases. Once I removed that, the discrepancy went away.

Debug with the idealized scientific method #

Sometimes logging tons of stuff turns into flailing. In such situations, going the opposite direction and really being deliberate about my hypothesis set or even tree often helps me decide what to try or explore next. By “idealized scientific method,” I mean specifically:

Formulate clear, testable hypotheses about what might be causing the issue
Design experiments that can definitively confirm or rule out each hypothesis
Prioritize tests that can eliminate multiple hypotheses at once
Record your observations systematically rather than just mentally noting them
Revisit and refine your hypotheses based on evidence

This approach forces you to slow down and think more clearly, which often reveals assumptions you didn’t even realize you were making. It also prevents you from repeating the same tests with slight variations, which is a common trap.

Admittedly, I often find that once I do this, I don’t resolve the issue by exploring every tree but instead have a realization that it’s some, usually dumb, other thing, but going through the process still seems to play a key role in realizing the issue in these cases.

This more structured approach is particularly valuable when you’ve already tried the quick wins from reading logs carefully and adding more logging but still don’t have a clear direction.

Examples #

I know I’ve had success with this, but I have been especially bad about recording cases when I have.

I was collaborating to debug a failure that spanned multiple systems, including a search index and a few services that relied on it. We got stuck for a while and then eventually mapped out the system and enumerated our hypotheses for what was happening. This helped, although I unfortunately don’t remember what the actual resolution was.

You can (often) just edit library code #

This is easiest in Python and some other interpreted languages, but it’s saved me enough times that it’s definitely worth mentioning. If an error is happening deep inside of library code, often the biggest challenge in figuring out what’s wrong is that the library doesn’t log enough state to be able to tell what’s happening in the program immediately before the error. If possible, finding a way to directly edit the underlying library code to log the information you need or try the change you think could fix it can be a cheap, fast way to resolve these types of errors.

This approach works especially well when combined with speeding up your iteration loop, as you can quickly test different logging or fixes directly in the library code. This may seem intimidating, and it can go wrong – Git won’t save you if you forget to reverse your changes when you’re done – but it’s usually possible with enough work. I mentioned doing this with interpreted languages, but even with compiled ones, if you’re willing to put in enough effort (e.g. rebuild the dependency from source), it’s possible as well.

On the other hand, this strategy likely won’t work or will be prohibitively painful in situations where:

Reproducing the issue requires long cycle times and running remotely, e.g. in a web server, a large-scale training job, etc.
The issue isn’t hidden by layers of library code.

Examples #

I kept having issues using torch checkpointing with some other features for a model we were training. Debugging was tricky because I couldn’t tell where exactly the failure was first occurring. Eventually, I got this running on a development box, went into the intermediate library, and logged when the training process hit certain points. This let me see exactly where the failure was occurring and resolve the issue.

Speed up your iteration loop #

I hesitated to even include this because it’s repeated all over the place, but I’d rather err on the side of capturing everything that’s worked well for me. Even though it’s obvious, it’s so easy to be lazy about speeding up debugging iteration loops. To this day, I still often wait too long to take a long-running remote job that’s triggering an error and put in the work to create a setup where I can rapidly trigger the bug in a controlled environment. Because this is a form of hyperbolic discounting, what I’ve found works best is having a trigger in my head that asks, “if someone were watching me, would they tell me I’m wasting tons of time waiting because I’m too lazy to speed things up”?

Examples #

Nearly every ML bug I have debugged benefited from finding the smallest scale situation in which it was possible to reproduce and iterating there vs. in the large scale regime. In some cases, small scale may still mean complicated, such as if the issue requires training on multiple nodes. But even then, outside of cases where the issue is obvious, I don’t think I’ve ever regretted finding a smaller, more controlled scope in which an issue still occurs and debugging there.