Monday, August 30, 2010

Memory leaks, log analysis and Unix command-line tools

I had cause recently to tackle a fairly tricky memory leak bug in the Delphi compiler as hosted by the IDE. The test case for the bug was complicated: compiling and running a large portion of the IDE itself. After a compile, subsequent debug runs of the IDE kept on increasing the debugging IDE's memory usage.

The memory profiler in Automated QA's AQtime tool provided a very rough indication of the call stack that allocated the memory. The reason it was very rough, and frankly, not particularly useful, was because the Delphi compiler uses a memory suballocator for performance and complexity reasons. Dcc - the Delphi compiler - has roughly 3 different classes of memory allocation: memory associated with a unit, memory associated with a depth in the call stack, and temporary memory that will be done away with periodically, in particular, memory associated with parsing the syntax tree for any one procedure and generating code for it. What the stack from AQtime did tell me was that the memory was (at least initially) associated with unit allocations. Since the compiler reuses memory as units get freed, the initial allocation isn't necessarily the one that gets leaked

To get better insight into what happened around the allocator, I instrumented it. The Delphi compiler in its debug build has a simple logging system. Every logging point has a string associated with it, and by specifying a list of globs in an environment variable, the compiler will output more or less logging info, according to how those globs match the strings associated with each logging point. For example, a log point inside the memory allocator might look like "dcc.memory.alloc", and another like "". A glob like "dcc.memory.*" would then match both of these; alternatively, "*.free;*.alloc" would match both of them, but also other logical allocations and frees that use the same naming convention. Whether or not any logging point is matched by the log glob is usually cached in static variables at each logging point, so there's surprisingly little performance loss even from very aggressive logging.

I ended up with a log file which output every new unit allocation, along with all the unit frees, and some supplementary logging relating to how the compiler managed the state of units - information it used to decide whether or not to free the unit, or keep it around for reuse later. It ended up being about 2.4G in size. That's a little big to try and eyeball, so I used some tools to whittle down the interesting bits.

There's a general pattern I use; probably not the most efficient, but it works well for me because it uses the Unix-standard (Cygwin in my case) command-line tools I use most often, rather than other tools that might be more suited but I'm less familiar with. It generally goes like this:

egrep -o 'interesting regex' log | sed -r 's|.*boring (interesting).*|\1|'

That is, it extracts just the interesting lines out, then only the interesting bits of those lines, using extended regex with egrep, and extended regex with a capture in sed. The sed s command is a substitution; I substitute the entire line with the captured interesting bit.

Using this technique, I extracted every allocation address into one temporary output file and every deallocation address into another. I could then do set operations on them to discover allocations that aren't matched by deallocations:

sort allocs deallocs deallocs | uniq -u > leaks

Sort will bring all the allocations and deallocations together. Uniq, without any arguments, reduces consecutive duplicate lines down to a single line; but with -u, it removes all lines that are part of consecutive duplicates, and leaves the unique lines. By specifying deallocs twice, I guarantee that no deallocations with an unmatched allocation will appear in the output; only allocations with an unmatched deallocation.

Often this is sufficient. But in this case, the enormous amount of churn, thousands of units, meant that duplicate memory locations were showing up, associated with different units. How did I know this? I did this:

fgrep -f leaks log | egrep -o 'interesting etc.' ...

Fgrep searches for fixed strings rather than regexes, which is much faster. The -f argument tells fgrep to load the list of fixed strings from the subsequent file argument, one string per line. This is a quick way of whittling down a log file to only those lines containing one of possibly thousands of strings.

When slightly more processing is needed, a different approach can work, using bash:

for item in $(<leaks); do fgrep "$item" log | ...; done > output

That works well if you don't expect $item to contain spaces; bash breaks it up on each whitespace boundary, rather than each line. For lines, I prefer this:

cat leaks | while read -r line; do ...; done > output

I use the redundant cat at the start for readability and ease of use; bash would otherwise need the input redirector (<) at the end of the command. The bash command "read" loads a line from standard input into the specified variable. The -r argument prevents it from messing with the string, doing funny things with backslashes, which, in my experience, is almost always undesirable.

Anyhow, to get around the duplicate problem, I changed tack. I modified my filtering so that I had a simple text file consisting of lines like "alloc <address>" and "free <address>" in chronological order of the event, and then wrote a simple program which kept track of each address in a hash table, so that it wouldn't be fooled by reuse of addresses. This got me an unambiguous list of leaked allocations.

Once I had a reliable means of extracting the addresses of the leaks, I scripted the process (not much more than a list of command-lines in a bash script) so that I could easily experiment with deeper and more focused logging. This was the real meat of the process of fixing the bug, and involved a lot of scanning the log file with the "less" pager's search forward (/) and backward (?) commands, tracing the lifetime of various units. But the fact that I knew what to search for, and could whittle down the log file as necessary, was down to what had come earlier from the command-line tools.

I don't know how other programmers tackle these kinds of problems, but these are some of the tools in my toolbox, and they come in surprisingly useful every day. In many ways, debugging is the least talked-about discipline in software engineering, and I think it would be of great benefit to us all to share more of our experiences and techniques.


Anonymous said...

Hmmm, in this case AQTime didn't seem to be quite useful ...

Just curious: Should we expect the mentioned leak is fixed in XE ?

Barry Kelly said...

Well, AQtime certainly located the bug in the compiler, and gave a hint as to where to look, so it wasn't useless by any means.

I should expect that the bug is indeed fixed in XE.

Moritz Beutel said...

Thanks for sharing. My experience with Unix command line utilities doesn't go beyond sed and grep, so it was very useful to get a practical example.

Oh, and it might be the fact that most non-trivial bugs are really fixed by "psychic debugging" (a term which I believe was coined by Raymond Chen: instead of tool-based debugging that keeps more experienced people from discussing their approaches. Even though some people discuss how they do "psychic debugging" (Raymond being one of them), it mostly happens when someone had occasion to use/show off his "super powers" in the process of psychic debugging.

My approach to resolve most of my bugs is loosely based on "psychic debugging", but not spectacular at all. Of course I use a normal debugger to get rid of the "easy" bugs, but when I'm not able to find the cause of a difficult problem, I start reading the code (or recalling it, which is even easier if I wrote the code myself :) ), trying to grasp the intentions. Usually at some point there is this "light bulb" moment, and I see how the bug can happen in the given circumstances. But I really have no idea how I could discuss this approach in a way that is useful for others :)

jouniaro said...

Interesting article, Barry. Somehow this resembles the way I tackle the most complicated issues as well.

I am an extensive user of instrumentation based logs. These come very handy in situations that the bugs reveal themselves at customer sites only. And it has often helped me to solve the issues simply by looking at the logs without even tryin to reproduce myself. However, the log can easily become too big for most editors - and it takes some expertise to find the correct information from a 2GB log file :)

I often use the same Unix tools as you do (awk can also come handy for operations that require inspection of several lines) in Linux, where it is much easier to work with large files as well. I found cygwin too awkward already ~10 years ago, compared to a real unix...

Anonymous said...

Wow, that's interesting, Barry...

So, what was the bug eventually?
Where was it found?

debug.c? unit.c? writeobj.c?

-- A friend.

Barry Kelly said...

package.c as called by browcmgr.c