Monday, May 29, 2006

Call vs CallVirt for C# non-virtual instance methods

Prompted by a post over in micrsoft.public.dotnet.languages.csharp, where someone asked if the C# compiler should issue a warning for an expression like:
this == null
The answer is no, for a fairly complicated reason. If C# was the only compiler for the CLR, then the poster might have a point - calling an instance method on a null instance always throws an exception in C#. However, other languages targeting the CLR can invoke a non-virtual instance method on a null instance, without error. In particular, the Delphi object model's TObject.Free method takes advantage of this to only call the destructor for non-null objects. How does this work under the covers? Well, it comes down to the difference in semantics between the 'call' and 'callvirt' CIL instructions. Note: everything I mention in this entry applies only to non-virtual instance methods.

One time I am aware of that the C# compiler generates a 'call' instruction is when calling the base class's method for overridden virtual methods. In that case, the compiler can use 'call' since the instance can't be null because the method was (ultimately) called using virtual dynamic dispatch.

The rationale for using 'callvirt' instead of 'call' for C# non-virtual instance methods is, I would guess, to fail sooner. When calling an instance method on a null instance, the null instance is found sooner than it might be. For example, a method might check its arguments only and thereby determine that nothing needs doing in this case, and return without causing an exception. If the compiler didn't generate code that checked the instance, the fact that an instance method was called on a null reference might not be caught until later.

What's the difference in JIT-compiled code between 'call' and 'callvirt' on CLR 2.0.50727?

For this analysis, I started with this CIL:

.assembly extern mscorlib {}
.assembly Test {}
.subsystem 0x0003

.class App extends [mscorlib]System.Object
    .method public instance void Test()
        ldstr "This is null\? {0}"
        box [mscorlib]System.Boolean
        call void [mscorlib]System.Console::WriteLine(string,object)
    .method public static void Main()
        ldstr "First:"
        call void [mscorlib]System.Console::WriteLine(string)
        call instance void App::Test()
        ldstr "Second:"
        call void [mscorlib]System.Console::WriteLine(string)
        callvirt instance void App::Test()
Roughly transliterated into C# code, it looks like this:
using System;

class App
    public void Test()
        Console.WriteLine("This is null? {0}", this == null);

    public static void Main()
        ((App) null).Test(); // with 'call': not possible in MS C# 2.0
        ((App) null).Test(); // with 'callvirt': default for C# compiler
This assembly's name is Test, but I compiled it to an executable called CallVirt.exe, with ilasm, and started the VS 2005 debugger:
ilasm -debug=opt
devenv -debugexe CallVirt.exe
I changed the project's debugger settings from Auto to Mixed, and stepped into the code. When disassembled with SOS, the code for the App.Main method looks like this:
.load sos
extension C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\sos.dll loaded

!name2ee CallVirt.exe App.Main
PDB symbol for mscorwks.dll not loaded
Module: 00912c14 (CallVirt.exe)
Token: 0x06000002
MethodDesc: 00912fe0
Name: App.Main()
JITTED Code Address: 00de0070

!u 00de0070
Normal JIT generated code
Begin 00de0070, size 61
>>> 00DE0070 833D84102B0200   cmp         dword ptr ds:[022B1084h],0
00DE0077 750A             jne         00DE0083
00DE0079 B901000000       mov         ecx,1
00DE007E E889D75678       call        7934D80C 
         (System.Console.InitializeStdOutError(Boolean), mdToken: 0600070f)
00DE0083 8B0D84102B02     mov         ecx,dword ptr ds:[022B1084h]
00DE0089 8B153C302B02     mov         edx,dword ptr ds:[022B303Ch]
00DE008F 8B01             mov         eax,dword ptr [ecx]
00DE0091 FF90D8000000     call        dword ptr [eax+000000D8h]
00DE0097 33C9             xor         ecx,ecx
00DE0099 FF1520309100     call        dword ptr ds:[00913020h]
00DE009F 833D84102B0200   cmp         dword ptr ds:[022B1084h],0
00DE00A6 750A             jne         00DE00B2
00DE00A8 B901000000       mov         ecx,1
00DE00AD E85AD75678       call        7934D80C 
         (System.Console.InitializeStdOutError(Boolean), mdToken: 0600070f)
00DE00B2 8B0D84102B02     mov         ecx,dword ptr ds:[022B1084h]
00DE00B8 8B1540302B02     mov         edx,dword ptr ds:[022B3040h]
00DE00BE 8B01             mov         eax,dword ptr [ecx]
00DE00C0 FF90D8000000     call        dword ptr [eax+000000D8h]
00DE00C6 33C9             xor         ecx,ecx
00DE00C8 3909             cmp         dword ptr [ecx],ecx
00DE00CA FF1520309100     call        dword ptr ds:[00913020h]
00DE00D0 C3               ret
This code is longer that strictly "necessary" because the Console::WriteLine(string) method has been inlined. The two relevant snippets of code for 'call' and 'callvirt', including the setting of the 'this' argument to null, are as follows:
00DE0097 33C9             xor         ecx,ecx
00DE0099 FF1520309100     call        dword ptr ds:[00913020h]

00DE00C6 33C9             xor         ecx,ecx
00DE00C8 3909             cmp         dword ptr [ecx],ecx
00DE00CA FF1520309100     call        dword ptr ds:[00913020h]
Thus, the difference with CALLVIRT is that it tests the pointer by dereferencing it. That causes a hardware exception when the pointer is null, and that hardware exception gets propagated to the CLR via Windows SEH.

Something interesting that can be observed from this: the calls to the App.Test() method are through an indirection. A peek in the address shows the data:

>d -format:fourbytes 0x00913020
0x00913020  00de00e8 00de0070 00000080 022b1ec4  
0x00913030  912fd8b8 e9ed8900 ffa2eed0 912fe0b8  
0x00913040  e9ed8900 ffa2eec4 576f62e8 cccc5e79  
0x00913050  00912fe0 00000000 00000000 00000000  
One can then disassemble the code at the indirect location:
!u 0x00de00e8

Normal JIT generated code
Begin 00de00e8, size 44
>>> 00DE00E8 57               push        edi

... etc.
So, calls to non-virtual instance methods compiled with the current C# compiler get turned into CIL 'callvirt' instructions, which, with the current JIT compiler, test the 'this' argument with a CMP instruction. Other languages which use 'call' simply call straight through without the test.

Sunday, May 21, 2006

Delphi/Win32 and COM Interface casting

Over in borland.public.delphi.language.delphi.win32, Kevin Donn asked some questions about memory allocations caused by casting Delphi object instances to interfaces (which in Delphi/Win32, are always COM interfaces). My answer applies to Delphi/Win32 only.
Presumably this would not create a memory leak:
  i: IMyInterface
  o: TMyObject // supports IMyInterface
  i:=o as IMyInterface
The opposite may be true: it might free the object sooner than you think. Interfaces in Delphi/Win32 on classes that derive from TInterfacedObject follow COM rules. That basically means that mixing object references and interface references is dodgy. As soon as you cast or assign an object reference to an interface reference, it gets AddRef'd for the first time. When the last interface reference goes out of scope, it gets Release'd. If you still have an object reference to the object, then it will be a bad pointer - nasty.

It's best to either stick to COM rules and only access the object through interfaces (and thus get refcounted lifetime management), or else implement IInterface (aka IUnknown) yourself and use manual memory management.

But, wisdom aside, will the following cause a memory leak?
  p: pointer
  o: TMyObject // supports IMyInterface
  p:=pointer(o as IMyInterface)
This will create a temporary value (i.e. kind of an anonymous local variable) of type IMyInterface (which gets AddRef'd during this process), convert the interface address to a pointer, then Release's the interface. That may or may not free TMyObject. If it freed TMyObject, then both o and p will point to dead memory. If it didn't, then p is still valid, but it's just a pointer into o's memory. No memory is allocated in this process, but it might be freed, if there were no interfaces pointing to o in scope.

Pointers to interfaces are pointers into the middle of the object. A picture:

--- TMyObject ---
0: TMyObject metaclass pointer --->
// ...
n: TMyObject's IMyInterface vtable --->
// ...
// object data

--- TMyObject.IMyInterface vtable ---
// ... other IMyInterface methods
Normally, a pointer to a value of type TMyObject points to the start of the object, which itself points to the metaclass (i.e. TMyObject).

A pointer to an interface points to a vtable. This is defined by COM, which is a binary standard. This vtable is a list of function pointers. (These functions adjust the 'Self' pointer that is passed in as the first argument, and then jump to the real implementation of the methods.)

I'd draw better pictures, but it's very tedious in ASCII.

Both TMyObject and the vtables are statically allocated as part of the EXE or DLL image, and don't need to be freed.

Perhaps this program may make things clearer:

program Test;

uses SysUtils, Classes;


procedure Dump(Start: Pointer; Count: Integer);
  p: PPointer;
  p := Start;
  while Count > 0 do
    Writeln(Format('%p: %p', [p, p^]));

  o: TInterfacedObject;
  i: IInterface;
  o := TInterfacedObject.Create;
  Writeln('The Object');
  Dump(o, 4);
  Writeln('The Class');
  Dump(TInterfacedObject, 4);
  i := o;
  Writeln('The Interface');
  Dump(Pointer(i), 4);
When I run it on my system, this is what I get:
The Object
00A14E60: 0040111C
00A14E64: 00000000
00A14E68: 004010A1
00A14E6C: 00A14E81
The Class
0040111C: 6E495411
00401120: 66726574
00401124: 64656361
00401128: 656A624F
The Interface
00A14E68: 004010A1
00A14E6C: 00A14E81
00A14E70: 00000000
00A14E74: 00000001
Notice that the interface pointer is (in this case) at an offset of 8 from the object pointer. You can see that the vtable for TInterfacedObject's IInterface implementation is at $4010A1, while the metaclass is located at $40111C - relatively close together. Since .EXE images in Windows get linked so that their load address starts at $400000, you can infer from this that the metaclass and interface vtable are both part of the .EXE image.
More specifically, does the generation of an interface cause memory allocation and if so how does it get cleaned up?
The only memory used is part of the object, unless you've delegated the interface implementation to a property which returns an object derived from TAggregatedObject - which itself delegates AddRef and Release to its controller, the parent object.

I hope this makes it clearer. It's not a totally trivial question. You need to know what's going on beneath the hood to understand and use (and most especially implement) COM interfaces with any level of sophistication.

Wednesday, May 17, 2006

CLR TailCall Optimization (or lack thereof)

I've neglected this blog lately because I didn't know what I'd be putting in it. I'm not the type of person to post a public diary of my daily minutia, yet dry didactic posts like my first bits on compiler implementation don't excite me either. So, the new approach I'm going to take is to put up on here some of my more interesting problems, solutions and analyses. That way they'll be there for my own reference later, I'll be able to point to them when answering questions on newsgroups, and perhaps even the good burghers of the net may wander in via Google.

Tasos Vogiatzoglou asked a question on the microsoft.public.dotnet.framework.clr newsgroup, wondering why tail. call was a slow MSIL sequence on the current .NET 2.0 CLR. My reply, including analysis, is below.

I assume that if the tail. command is supported by the jit (I do not think that is supported) it will be used in rather rare conditions of fully trusted code and perhaps code that does not access the execution stack (via stacktrace or sth) .

I think it is not optimized because no mainstream language currently uses it. It certainly is implemented in so far as the stack does not grow when you use tail. call to jump to the start of a method.

514 ms (with tailcall) / 77 ms (without tailcall). I cannot understand this ... Can anyone provide any helpful insight ?
The fact is, tail. call is not optimized via JIT to a jump currently. And when you try to debug all this under VS 2005, it lies to you about the code!

I started with this:

using System;

class App
    static double ArithmeticSum(int number, double result)
        if (number == 0)
            return result;
        return ArithmeticSum(number - 1, number + result);
    static void Main()
        double result = 0;
        for (int i = 0; i < 10000; ++i)
            result = ArithmeticSum(10000, 1);
I disassembled with ildasm, and rewrote ArithmeticSum to use tail. call:
// ...
    IL_000c:  ldarg.1
    IL_000d:  stloc.0
    IL_000e:  br.s       exit
// ...
    IL_0015:  ldarg.1
    IL_0016:  add
    tail. call float64 App::ArithmeticSum(int32, float64)
  } // end of method App::ArithmeticSum
I reassembled with ilasm /debug=OPT and, like you, I found that the tail-call version was much slower. So, I started "devenv /debugexe Test.exe", and stepped into the code.

This is what VS says about the JIT compiled code in the disassembly window:

    IL_0000:  nop
00000000  push        ebp  
00000001  mov         ebp,esp 
00000003  push        edi  
00000004  push        esi  
00000005  push        ebx  
00000006  push        eax  
00000007  fld         qword ptr [ebp+8] 
    IL_0001:  ldarg.0
0000000a  test        ecx,ecx 
0000000c  setne       al   
0000000f  movzx       eax,al 
    IL_0009:  ldloc.1
00000012  test        eax,eax 
00000014  jne         00000018 

    IL_000c:  ldarg.1
00000016  jmp         00000039 

    IL_0010:  ldarg.0
00000018  mov         dword ptr [ebp-10h],ecx 
0000001b  fild        dword ptr [ebp-10h] 
0000001e  faddp       st(1),st 
00000020  sub         esp,8 
00000023  fstp        qword ptr [esp] 
00000026  dec         ecx  
00000027  mov         eax,dword ptr ds:[00923028h] 
0000002d  push        2    
0000002f  push        2    
00000031  push        1    
00000033  push        eax  
00000034  call        791B69B0         // <------- NOTE
00000039  pop         ecx  
0000003a  pop         ebx  
0000003b  pop         esi  
0000003c  pop         edi  
0000003d  pop         ebp  
0000003e  ret         8    
The important bit to note is the call to 791B69B0. Even in VS 2005 mixed mode debugging, it won't let you step into this code. When you try to step into it, the instruction pointer jumps back to the start of the method - effectively the call is implementing the tail call, but VS is "helpfully" hiding the details.

(Vance Morrison at MSFT shares my annoyance with this "feature".)

So, its time to crack open SOS: In immediate window:

.load sos
!u 791B69B0
And I got this:
Unmanaged code
791B69B0 F8               clc
791B69B1 ??               db          ffh
791B69B2 ??               db          ffh
791B69B3 FF0400           inc         dword ptr [eax+eax]
791B69B6 0000             add         byte ptr [eax],al
791B69B8 0100             add         dword ptr [eax],eax
791B69BA 0000             add         byte ptr [eax],al
791B69BC 0000             add         byte ptr [eax],al
791B69BE 0C02             or          al,2
791B69C0 1000             adc         byte ptr [eax],al
There's something fishy going on here: this code isn't meaningfully executable!

So, I looked up my dear friend App.ArithmeticSum:

!name2ee Test.exe App.ArithmeticSum
Module: 00922c14 (Test.exe)
Token: 0x06000001
MethodDesc: 00922fd8
Name: App.ArithmeticSum(Int32, Double)
JITTED Code Address: 00de0100

!u 00de0100
Normal JIT generated code
App.ArithmeticSum(Int32, Double)
Begin 00de0100, size 41
>>> 00DE0100 55               push        ebp
00DE0101 8BEC             mov         ebp,esp
00DE0103 57               push        edi
00DE0104 56               push        esi
00DE0105 53               push        ebx
00DE0106 50               push        eax
00DE0107 DD4508           fld         qword ptr [ebp+8]
00DE010A 85C9             test        ecx,ecx
00DE010C 0F95C0           setne       al
00DE010F 0FB6C0           movzx       eax,al
00DE0112 85C0             test        eax,eax
00DE0114 7502             jne         00DE0118
00DE0116 EB21             jmp         00DE0139
00DE0118 894DF0           mov         dword ptr [ebp-10h],ecx
00DE011B DB45F0           fild        dword ptr [ebp-10h]
00DE011E DEC1             faddp       st(1),st
00DE0120 83EC08           sub         esp,8
00DE0123 DD1C24           fstp        qword ptr [esp]
00DE0126 49               dec         ecx
00DE0127 8B0528309200     mov         eax,dword ptr ds:[00923028h]
00DE012D 6A02             push        2
00DE012F 6A02             push        2
00DE0131 6A01             push        1
00DE0133 50               push        eax
00DE0134 E877691B79       call        79F96AB0 (JitHelp:
00DE0139 59               pop         ecx
00DE013A 5B               pop         ebx
00DE013B 5E               pop         esi
00DE013C 5F               pop         edi
00DE013D 5D               pop         ebp
00DE013E C20800           ret         8
    call        79F96AB0 (JitHelp: CORINFO_HELP_TAILCALL)
This address, 79F96AB0, is different from the one in VS's most excellent diassembly view, 791B69B0.

A peek inside this method (which, as we can see from timings, must be quite expensive and is thus probably pretty complex):

!u 79F96AB0
Unmanaged code
79F96AB0 FF155012387A     call        dword ptr ds:[7A381250h]
79F96AB6 50               push        eax
79F96AB7 51               push        ecx
79F96AB8 52               push        edx
79F96AB9 F6058C44397AFF   test        byte ptr ds:[7A39448Ch],0FFh
79F96AC0 7409             je          79F96ACB
79F96AC2 F740045F000000   test        dword ptr [eax+4],5Fh
79F96AC9 7422             je          79F96AED
79F96ACB 68DDDDDDDD       push        0DDDDDDDDh
79F96AD0 68CCCCCCCC       push        0CCCCCCCCh
Wonder what this is doing? A quick grep through the SSCLI 2.0 sources gives this line:
./inc/jithelpers.h:    JITHELPER(CORINFO_HELP_TAILCALL, JIT_TailCall)
Another grep for this JIT_TailCall gives this:
./vm/i386/jithelp.asm:PUBLIC JIT_TailCall
A peek inside this file gives this information (excerpted):
        call    _GetThread  ; eax = Thread*
        push    eax         ; Thread*

        ; save ArgumentRegisters
        push    ecx
        push    edx

ExtraSpace      = 12    ; pThread, ecx, edx

        ; For GC stress, we always need to trip for GC
        test    _g_TailCanSkipTripForGC, 0FFh
        jz      TripForGC
        ; Trip for GC only if necessary
        test    dword ptr [eax+Thread_m_State], TS_CatchAtSafePoint_ASM
        jz      NoTripForGC


; Create a MachState struct on the stack

; return address is already on the stack, but is separated from stack 
; arguments by the extra arguments of JIT_TailCall. So we cant use it directly

        push    0DDDDDDDDh

; Esp on unwind. Not needed as we it is deduced from the target method

        push    0CCCCCCCCh
This looks like an exact match for the disassembled code above, especially given the magic numbers pushed.

So, if one is looking for reasons why tail. call is slow, one must look inside clr/src/vm/i386/jithelp.asm to see the work it is doing.

Here are some reasons:

  • The JIT doesn't attempt in any way to optimize for the tail. call case, so it doesn't generate machine code which would be compatible with a simple JMP.
  • The JIT_TailCall must work for all possible calls, in the presence of exception propagation, and be tolerent of possible GCs.
The best hope for getting Microsoft to optimize this is for people to complain that an important piece of software (i.e. language) that relies on tail. call runs too slowly on the CLR.

A good reason why most compilers don't normally produce tail. call for .NET already is that it removes a frame from the stack, and that interferes with code access security.