Friday, July 07, 2006

The not so lazy garbage collector

A poster on microsoft.public.dotnet.languages.csharp (the original post is here) wrote in the other day about a bug he had - an access violation. He wasn't using any unmanaged code, so my curiosity was piqued. The reason? It's theoretically impossible to AV without using unmanaged code, so he had almost certainly found a CLR or BCL bug. I was duly motivated to get to the root of the problem.

I'm very glad I stuck with it, because I found out something I wasn't aware of, something which will make me very careful in similar scenarios in the future. If you're not interested in the path to discovery, the summary is at the end.

I did some investigation, and I discovered that the problem lay in the Ping class. Here's some code which reproduces the problem:

using System;
using System.Net;
using System.Net.NetworkInformation;

class App
{
    static void Main()
    {
        try
        {
            for (;;)
                new Ping().Send(IPAddress.Loopback);
        }
        catch (Exception ex)
        {
            if (ex.InnerException != null)
                ex = ex.InnerException;
            Console.WriteLine("Error: {0} ({1})", ex.Message,
                                                  ex.GetType().Name);
        }
    }
}
(I've registered this as a bug on the Microsoft Connect site.)

This wasn't enough for me. I wanted to know how it happened. I looked at the code for Ping using .NET Reflector, but it wasn't obvious to me. I fired up WinDbg and ran the above code until the access violation exception was triggered (it can take some seconds, depending on your machine):

(a94.87c): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=04f20b74 ebx=00000000 ecx=00000008 edx=00000000 esi=04f20b54 edi=012e7580
eip=78144d3a esp=0012f278 ebp=0012f280 iopl=0         nv up ei pl zr na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010246
MSVCR80!memcpy+0x5a:
78144d3a f3a5            rep  movsd ds:04f20b54=???????? es:012e7580=00000000
So, it was failing in memcpy. Where was that called from? SOS and !dumpstack gives the right info (excerpted):
> .load sos
> !dumpstack

OS Thread Id: 0x87c (0)
Current frame: MSVCR80!memcpy+0x5a [F:\RTM\vctools\...\intel\memcpy.asm:188]
ChildEBP RetAddr  Caller,Callee
0012f280 79e808d5 mscorwks!memcpyNoGCRefs+0x11, // ...
0012f290 79f02572 mscorwks!MarshalNative::CopyToManaged+0x11b, // ...
0012f2e0 79f02491 mscorwks!MarshalNative::CopyToManaged+0x22, // ...
0012f340 7a615416 (/*...*/ System.Net.NetworkInformation.PingReply..ctor(// ...
0012f358 7a61474e (/*...*/ System.Net.NetworkInformation.Ping.InternalSend(// ...
0012f37c 79eec356 mscorwks!FastAllocateObject+0xa6, calling mscorwks!Object// ...
0012f3fc 7a6139b4 (/*...*/ System.Net.NetworkInformation.Ping.Send(System.// ...
0012f43c 7a613752 (/*...*/ System.Net.NetworkInformation.Ping.Send(System.// ...
0012f454 00ca009e (MethodDesc 0x922fd8 +0x2e App.Main()), calling 7a87f55c
// ...
I can see here that the last managed code was inside PingReply's constructor. Here's what the appropriate constructor overload looks like, roughly:
internal PingReply(IcmpEchoReply reply)
{
    this.address = new IPAddress((long) reply.address);
    this.ipStatus = (IPStatus) reply.status;
    if (this.ipStatus == IPStatus.Success)
    {
        this.rtt = reply.roundTripTime;
        this.buffer = new byte[reply.dataSize];
        Marshal.Copy(reply.data, this.buffer, 0, reply.dataSize);
        this.options = new PingOptions(reply.options);
    }
    else
    {
        this.buffer = new byte[0];
    }
}
The call to MarshalNative::CopyToManaged() is inside Marshal.Copy(), and it's copying data from reply.data (an IntPtr) to the managed array, this.buffer. This structure is filled by the Windows API function IcmpSendEcho(). The buffer is allocated by the caller, as is usual for Windows API functions. The buffer is owned through a SafeHandle descendant, SafeLocalFree, and it's owned by the Ping class.

The memcpy function compiles to an x86 string operation, with esi and edi being the source and destination registers. From the original exception data at the top, one can see that ESI (the source) is an invalid pointer - that's what caused the access violation. Thus, the buffer has been freed early.

How did that happen? One can use !dso (aka !dumpstackobjects) to find all the live objects rooted on the stack or registers:

> !dso

OS Thread Id: 0x87c (0)
ESP/REG  Object   Name
0012f2ac 012e7508 System.Net.NetworkInformation.PingReply
0012f2c0 012e7508 System.Net.NetworkInformation.PingReply
0012f32c 012e7578 System.Byte[]
0012f350 012e7508 System.Net.NetworkInformation.PingReply
0012f410 012e74b4 System.Byte[]
0012f438 0128142c System.Net.IPAddress
0012f448 012e74b4 System.Byte[]
Aha! There's no Ping object! Even though several instance methods of the Ping object (two Ping.Send overloads and Ping.InternalSend) are currently executing, Ping is in fact eligible for garbage collection!

It appears that's what's happened. Let's see if that can be confirmed. If one runs !dumpheap -stat, there's still lots of Pings on the heap:

> !dumpheap -stat

// ...
7a779154     2041        40820 System.Net.SafeLocalFree
7a778ec0     2041        40820 System.Net.SafeCloseHandle
7915ff38     2041        65312 System.Threading.SendOrPostCallback
79124418     2043        89864 System.Byte[]
7a7812e0     2041       179608 System.Net.NetworkInformation.Ping
// ...
Let's see if we can find the Ping that's on our call stack:
> !dumpheap -type Ping

// ...
012e7040 7a7812e0       88     
012e710c 7a7812e0       88     
012e71d8 7a7812e0       88     
012e72a4 7a7812e0       88     
012e7370 7a7812e0       88     
012e743c 7a7812e0       88     
012e7508 7a781580       32     
total 2042 objects
Statistics:
      MT    Count    TotalSize Class Name
7a781580        1           32 System.Net.NetworkInformation.PingReply
7a7812e0     2041       179608 System.Net.NetworkInformation.Ping
Total 2042 objects
Now, since there is only one thread allocating Pings and there's nothing keeping them alive, I can expect that the Ping in the highest address is the most recently allocated one. To confirm that, I'll test the last two Ping objects listed (012e743c and 012e7370):
> !gcroot 012e7370
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 87c
Scan Thread 2 OSTHread 828

> !gcroot 012e743c
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 87c
ESP:12f390:Root:0130e1cc(System.Net.NetworkInformation.Ping)->
012e7494(System.Threading.SendOrPostCallback)->
012e743c(System.Net.NetworkInformation.Ping)
ESP:12f3f8:Root:0130e1cc(System.Net.NetworkInformation.Ping)->
012e7494(System.Threading.SendOrPostCallback)
ESP:12f434:Root:0130e1cc(System.Net.NetworkInformation.Ping)->
012e7494(System.Threading.SendOrPostCallback)
ESP:12f450:Root:0130e1cc(System.Net.NetworkInformation.Ping)->
012e7494(System.Threading.SendOrPostCallback)
Scan Thread 2 OSTHread 828
Here it can be seen that of the two, only 012e743c is alive. Because of the pattern of the loop, it would be expected that any locations holding Ping would be overwritten each time around, so this Ping at 012e743c should be the right one. Peeking inside it with !do (aka !dumpobject):
> !do 012e743c

Name: System.Net.NetworkInformation.Ping
MethodTable: 7a7812e0
EEClass: 7a7e9100
Size: 88(0x58) bytes
 (C:\WINDOWS\assembly\GAC_MSIL\System\2.0.0.0__b77a5c561934e089\System.dll)
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
790f9c18  4000184        4        System.Object  0 instance 00000000 __identity
// ...
7a779154  4002c06       20 ...Net.SafeLocalFree  0 instance 012e74f4 replyBuffer
// ...
I'm really only interested in this guy, the replyBuffer - because that's the guy who's pointing to invalid memory. Drilling into that location:
> !do 012e74f4

Name: System.Net.SafeLocalFree
MethodTable: 7a779154
EEClass: 7a7db890
Size: 20(0x14) bytes
 (C:\WINDOWS\assembly\GAC_MSIL\System\2.0.0.0__b77a5c561934e089\System.dll)
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
790fe160  40005b4        4        System.IntPtr  0 instance 82971448 handle
790fed1c  40005b5        8         System.Int32  0 instance        3 _state
79104f64  40005b6        c       System.Boolean  0 instance        1 _ownsHandle
79104f64  40005b7        d       System.Boolean  0 instance        1 _fullyInitialized
7a779154  40025ec      948 ...Net.SafeLocalFree  0   static 00000000 Zero
Now, the layout of the _state field is explained in the SSCLI 2.0, in the file clr/src/vm/safehandle.cpp:
// So the state field ends up looking like this:
//
//  31                                                        2  1   0
// +-----------------------------------------------------------+---+---+
// |                           Ref count                       | D | C |
// +-----------------------------------------------------------+---+---+
// 
// Where D = 1 means a Dispose has been performed and C = 1 means the
// underlying handle has (or will be shortly) released.
This shows that a _state value of 3 means that it's been both disposed and released. So, even though the Ping and SafeLocalFree objects haven't actually been garbage collected, the eager finalization of SafeHandle objects has released the buffer early. I'll consider that definitive.

Summary

There's at least two lessons from this:
  1. In general, one cannot rely on the fact that an instance method is on the call stack to keep the instance alive. One needn't be too paranoid, though: the objects can only be collected if the current line in each instance method is near or at the end of the method, so that 'this' is no longer on the stack or enregistered.
  2. Thus, one needs to be extra careful to keep SafeHandle instances alive, either by passing them along with any buffer / handle value extracted, or by using SafeHandle.DangerousAddRef()/SafeHandle.DangerousRelease() in a try/finally block to be absolutely sure that an early release doesn't occur.
Finally, to make everything crystal clear - what will the following program print?
using System;

class App
{
    class A
    {
        public B Foo()
        {
            return new B();
        }
        
        ~A()
        {
            Console.WriteLine("A finalized.");
        }
    }
    
    class B
    {
        public B()
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();
            Console.WriteLine("B now returning.");
        }
    }
    
    static void Main()
    {
        new A().Foo();
    }
}
If you've followed everything, you'll know that this is what it prints (compiled in Release mode):
A finalized.
B now returning.
Maybe this was obvious to everybody - but it certainly wasn't obvious to me, nor to the designer of the Ping class!

1 comment:

F Quednau said...

Greetings Barry,

completely unrelated to this post of yours,but nonetheless my thanks for the tip you posted on http://aspalliance.com/groups/microsoft_public_dotnet_framework_clr/ng-362954_Question_Dynamically_Raise_Event_using_Reflection_.aspx
concerning how to raise an event reflection-style! Now my generic keystroke handler should work nicely! The working dll can be downloaded at my website.

I can see that you're well on top of the game concerning dotnet and the surrounding stuff, so I shall keep an eye on your blog, thanks for the info you provide on this and that.