Entropy Overload: August 2010

I had cause recently to tackle a fairly tricky memory leak bug in the Delphi compiler as hosted by the IDE. The test case for the bug was complicated: compiling and running a large portion of the IDE itself. After a compile, subsequent debug runs of the IDE kept on increasing the debugging IDE's memory usage.

The memory profiler in Automated QA's AQtime tool provided a very rough indication of the call stack that allocated the memory. The reason it was very rough, and frankly, not particularly useful, was because the Delphi compiler uses a memory suballocator for performance and complexity reasons. Dcc - the Delphi compiler - has roughly 3 different classes of memory allocation: memory associated with a unit, memory associated with a depth in the call stack, and temporary memory that will be done away with periodically, in particular, memory associated with parsing the syntax tree for any one procedure and generating code for it. What the stack from AQtime did tell me was that the memory was (at least initially) associated with unit allocations. Since the compiler reuses memory as units get freed, the initial allocation isn't necessarily the one that gets leaked

To get better insight into what happened around the allocator, I instrumented it. The Delphi compiler in its debug build has a simple logging system. Every logging point has a string associated with it, and by specifying a list of globs in an environment variable, the compiler will output more or less logging info, according to how those globs match the strings associated with each logging point. For example, a log point inside the memory allocator might look like "dcc.memory.alloc", and another like "dcc.memory.free". A glob like "dcc.memory.*" would then match both of these; alternatively, "*.free;*.alloc" would match both of them, but also other logical allocations and frees that use the same naming convention. Whether or not any logging point is matched by the log glob is usually cached in static variables at each logging point, so there's surprisingly little performance loss even from very aggressive logging.

I ended up with a log file which output every new unit allocation, along with all the unit frees, and some supplementary logging relating to how the compiler managed the state of units - information it used to decide whether or not to free the unit, or keep it around for reuse later. It ended up being about 2.4G in size. That's a little big to try and eyeball, so I used some tools to whittle down the interesting bits.

There's a general pattern I use; probably not the most efficient, but it works well for me because it uses the Unix-standard (Cygwin in my case) command-line tools I use most often, rather than other tools that might be more suited but I'm less familiar with. It generally goes like this:

egrep -o 'interesting regex' log | sed -r 's|.*boring (interesting).*|\1|'

That is, it extracts just the interesting lines out, then only the interesting bits of those lines, using extended regex with egrep, and extended regex with a capture in sed. The sed s command is a substitution; I substitute the entire line with the captured interesting bit.

Using this technique, I extracted every allocation address into one temporary output file and every deallocation address into another. I could then do set operations on them to discover allocations that aren't matched by deallocations:

sort allocs deallocs deallocs | uniq -u > leaks

Sort will bring all the allocations and deallocations together. Uniq, without any arguments, reduces consecutive duplicate lines down to a single line; but with -u, it removes all lines that are part of consecutive duplicates, and leaves the unique lines. By specifying deallocs twice, I guarantee that no deallocations with an unmatched allocation will appear in the output; only allocations with an unmatched deallocation.

Often this is sufficient. But in this case, the enormous amount of churn, thousands of units, meant that duplicate memory locations were showing up, associated with different units. How did I know this? I did this:

fgrep -f leaks log | egrep -o 'interesting etc.' ...

Fgrep searches for fixed strings rather than regexes, which is much faster. The -f argument tells fgrep to load the list of fixed strings from the subsequent file argument, one string per line. This is a quick way of whittling down a log file to only those lines containing one of possibly thousands of strings.

When slightly more processing is needed, a different approach can work, using bash:

for item in $(<leaks); do fgrep "$item" log | ...; done > output

That works well if you don't expect $item to contain spaces; bash breaks it up on each whitespace boundary, rather than each line. For lines, I prefer this:

cat leaks | while read -r line; do ...; done > output

I use the redundant cat at the start for readability and ease of use; bash would otherwise need the input redirector (<) at the end of the command. The bash command "read" loads a line from standard input into the specified variable. The -r argument prevents it from messing with the string, doing funny things with backslashes, which, in my experience, is almost always undesirable.

Anyhow, to get around the duplicate problem, I changed tack. I modified my filtering so that I had a simple text file consisting of lines like "alloc <address>" and "free <address>" in chronological order of the event, and then wrote a simple program which kept track of each address in a hash table, so that it wouldn't be fooled by reuse of addresses. This got me an unambiguous list of leaked allocations.

Once I had a reliable means of extracting the addresses of the leaks, I scripted the process (not much more than a list of command-lines in a bash script) so that I could easily experiment with deeper and more focused logging. This was the real meat of the process of fixing the bug, and involved a lot of scanning the log file with the "less" pager's search forward (/) and backward (?) commands, tracing the lifetime of various units. But the fact that I knew what to search for, and could whittle down the log file as necessary, was down to what had come earlier from the command-line tools.

I don't know how other programmers tackle these kinds of problems, but these are some of the tools in my toolbox, and they come in surprisingly useful every day. In many ways, debugging is the least talked-about discipline in software engineering, and I think it would be of great benefit to us all to share more of our experiences and techniques.

Entropy Overload

Monday, August 30, 2010

Memory leaks, log analysis and Unix command-line tools

About Me

Blog Archive