Sunday, July 09, 2006

Fun with asynchronous methods and scalability

I was reading Joe Duffy's article on the CLR threadpool, and I felt compelled to write a post about a gnawing feeling I'm getting about current asynchronous design patterns.

It's too easy for folks working on the CLR or ASP.NET or any other server technology to advise that if people want scalability, then they need to have no more threads than there are processors, and that they should keep the CPU as busy as possible at all times. Why? Asynchronous code is too hard to write - or rather, it's too different to the imperative way that most people have learned to program.

A common "problem" is that people write code that looks like this:

void AnOperation()
{
    A a = BlockingCallA();
    a.Foo(); // real "work"
    B b = BlockingCallB(b);
    BlockingCallC(a, b);
}

... where BlockingCall?() are operations which take a relatively long time but aren't CPU intensive: remote database operations, network I/O, disk I/O. The "problem" is that these calls tie up the thread while BlockingCall?() are executing, and threads are an expensive resource. They consume stack space (if not committed, then reserved, which can still be a significant fraction of virtual address space on 32-bit machines), there are scheduling costs, context switching costs, working set costs and cache costs (all that work priming the cache is lost once a call blocks and the CPU needs to be rescheduled). Thus, the "solution" is to make this code asynchronous.

Now, it might occur to some people to use the "asynchronous delegate" pattern to make the method asynchronous. It's slightly awkward to present an "asynchronous API", simulating the Begin/End pattern on the back of asynchronous delegates, but not too difficult. Here's one effort:

delegate void Procedure();

IAsyncResult BeginAnOperation(AsyncCallback callback, object state)
{
    Procedure proc = AnOperation;
    
    // Can't pass "callback" to BeginInvoke because we need the state
    // argument on the IAsyncResult for EndAnOperation.
    AsyncCallback inner = null;
    if (callback != null)
        inner = delegate
        {
            callback(state);
        };
    
    return proc.BeginInvoke(inner, proc);
}

void EndAnOperation(IAsyncResult ar)
{
    if (ar == null)
        throw new ArgumentNullException("ar");
    
    // Is this the right IAsyncResult? Who knows?
    Procedure proc = (Procedure) ar.AsyncState;
    proc.EndInvoke(ar);
}

Unfortunately, this approach won't add any scalability whatsoever. Why? Because the asynchronous delegate pattern is implemented using the .NET threadpool, so this just moves the blocking from one thread to another. The other thread will still be keeping a full thread alive and blocked in the blocking calls.

So, at a second attempt, the code can be rewritten as something roughly like the following, using anonymous delegates to simulate continuation passing style (CPS):

object _anOperationCookie = new object();

IAsyncResult BeginAnOperation(AsyncCallback callback, Object state)
{
    BasicAsyncResult result = new BasicAsyncResult(_anOperationCookie, state);
    BeginBlockingCallA(delegate(IAsyncResult ar)
    {
        A a = EndBlockingCallA(arA);
        a.Foo();
        BeginBlockingCallB(a, delegate(IAsyncResult arB)
        {
            B b = EndBlockingCallB(arB);
            BeginBlockingCallC(a, b, delegate(IAsyncResult arC)
            {
                EndBlockingCallC(arC);
                result.SetCompleted(null);
                if (callback != null)
                    callback(result);
            }, null);
        }, null);
    }, null);
    
    return result;
}

void EndAnOperation(IAsyncResult result)
{
    if (result == null)
        throw new ArgumentNullException("result");
    BasicAsyncResult basicResult = result as BasicAsyncResult;
    if (result == null)
        throw new ArgumentException("Wrong IAsyncResult");
    
    // if non-void result, then prefix with:
    // return (T) 
    basicResult.GetRetVal(_anOperationCookie);
}

class BasicAsyncResult : IAsyncResult
{
    object _lock = new object();
    ManualResetEvent _event;
    object _state;
    volatile bool _completed;
    object _retVal;
    object _cookie;
    
    public BasicAsyncResult(object cookie, object state)
    {
        _cookie = cookie;
        _state = state;
    }
    
    public void SetCompleted(object retVal)
    {
        lock (_lock)
        {
            _retVal = retVal;
            _completed = true;
            if (_event != null)
                _event.Set();
        }
    }
    
    public WaitHandle AsyncWaitHandle
    {
        get
        {
            lock (_lock)
            {
                if (_event == null)
                    _event = new ManualResetEvent(_completed);
                return _event;
            }
        }
    }
    
    public object AsyncState
    {
        get { lock (_lock) return _state; }
    }
    
    public bool CompletedSynchronously
    {
        get { return false; }
    }
    
    public bool IsCompleted
    {
        get { lock (_lock) return _completed; }
    }
    
    public object GetRetVal(object cookie)
    {
        lock (_lock)
        {
            if (_cookie != cookie)
                throw new ArgumentException("Wrong IAsyncResult");
            
            if (_completed)
            {
                if (_event != null)
                    _event.Close();
                return _retVal;
            }
        }
        
        AsyncWaitHandle.WaitOne();
        
        lock (_lock)
        {
            _event.Close();
            return _retVal;
        }
    }
}

Now, I wrote this code on the spot. It took about 20 minutes, most of it in BasicAsyncResult. I'm not sure if it's fully correct. It might have a subtle bug related to threading and memory visibility. I've probably put too many "lock (_lock)" in there, but these things are so tricky it's far better to err on the side of caution - memory visibility semantics are mind-bending.

I've made an effort to write BasicAsyncResult such that it could be reused for any Begin/End async pattern implementation. That's largely what the cookie is for - to detect subtle errors relating to passing the wrong IAsyncResult to an End* method. So, I'll discard the code of BasicAsyncResult when comparing.

So, 4 lines of code in one method have turned into around 18 lines in two methods. An advantage is that the translation is relatively automatic; but that's also a disadvantage, because the it means that you're writing manually templated code, and thus are liable to slip in a mistake by not paying attention. But stand back a moment: is this seriously what the folks on the .NET team are advising people do in order to achieve scalability in their business applications, running on ASP.NET?

This example is among the easiest code to translate. Unfortunately, not all code look like this. The asynchronous code may be deep in a call stack somewhere - it almost certainly will be if you've gotten your database operations nicely factored. The problem is that the asynchronous transformation described above can't pass through synchronous methods. In other words, if you've got a group of methods that look like:

void X()
{
    A a = BlockingCallA();
    B b = Y(a);
    BlockingCallC(a, b);
}

B Y(A a)
{
    Twiddle1(a);
    B b = Z(a);
    Twiddle2(b);
    return b;
}

B Z(A a)
{
    Frobnicate(a);
    return BlockingCallB(a);
}

Now, only X() and Z() contain blocking calls. Unfortunately, to make X() and Z() work asynchronously, all the methods on the path between X() and Z() must be asynchronous too. Making your code work asynchronously forces you to make this mechanical transformation to CPS across every path in the call graph from the original asynchronous entrypoint to a method containing a blocking call. If your code is well factored, there could be tens or hundreds of these methods - or even more. If you're like me, the "advice" is starting to sound a little like an idle platitude at this point. Naturally, it isn't often done.

This mechanical work is precisely what compilers were designed to do. Anonymous delegates made the above code at least possible for those who value their time. Without anonymous delegates, creating the nest of extra classes required to keep stack frame information and continuation code becomes far more tedious.

If the folks advising asynchronous code are serious about recommending this approach to scalability, then the enabling tools need to be available. It's conceivable that the CLR could do this automatically, with some kind of [AutoAsync] attribute, along with a couple of utility methods to access the generated Begin/End pair via reflection or whatever. That would keep the C# language clean while leaving the CLR open to a lazy implementation approach (using ThreadPool threads), but also with the power to create the transformations discussed above.

A lot of work? Yes, but better than people who write business code for a living having to work in a nest of hand-written continuations just to get scalability.

Update: Fixed bugs in CPS sample.
Update: Note that the above code doesn't deal with exceptions properly. Consider the above code as just an outline of an approach. I'll see if I can find the time to do a full article on this technique.

Friday, July 07, 2006

The not so lazy garbage collector

A poster on microsoft.public.dotnet.languages.csharp (the original post is here) wrote in the other day about a bug he had - an access violation. He wasn't using any unmanaged code, so my curiosity was piqued. The reason? It's theoretically impossible to AV without using unmanaged code, so he had almost certainly found a CLR or BCL bug. I was duly motivated to get to the root of the problem.

I'm very glad I stuck with it, because I found out something I wasn't aware of, something which will make me very careful in similar scenarios in the future. If you're not interested in the path to discovery, the summary is at the end.

I did some investigation, and I discovered that the problem lay in the Ping class. Here's some code which reproduces the problem:

using System;
using System.Net;
using System.Net.NetworkInformation;

class App
{
    static void Main()
    {
        try
        {
            for (;;)
                new Ping().Send(IPAddress.Loopback);
        }
        catch (Exception ex)
        {
            if (ex.InnerException != null)
                ex = ex.InnerException;
            Console.WriteLine("Error: {0} ({1})", ex.Message,
                                                  ex.GetType().Name);
        }
    }
}
(I've registered this as a bug on the Microsoft Connect site.)

This wasn't enough for me. I wanted to know how it happened. I looked at the code for Ping using .NET Reflector, but it wasn't obvious to me. I fired up WinDbg and ran the above code until the access violation exception was triggered (it can take some seconds, depending on your machine):

(a94.87c): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=04f20b74 ebx=00000000 ecx=00000008 edx=00000000 esi=04f20b54 edi=012e7580
eip=78144d3a esp=0012f278 ebp=0012f280 iopl=0         nv up ei pl zr na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010246
MSVCR80!memcpy+0x5a:
78144d3a f3a5            rep  movsd ds:04f20b54=???????? es:012e7580=00000000
So, it was failing in memcpy. Where was that called from? SOS and !dumpstack gives the right info (excerpted):
> .load sos
> !dumpstack

OS Thread Id: 0x87c (0)
Current frame: MSVCR80!memcpy+0x5a [F:\RTM\vctools\...\intel\memcpy.asm:188]
ChildEBP RetAddr  Caller,Callee
0012f280 79e808d5 mscorwks!memcpyNoGCRefs+0x11, // ...
0012f290 79f02572 mscorwks!MarshalNative::CopyToManaged+0x11b, // ...
0012f2e0 79f02491 mscorwks!MarshalNative::CopyToManaged+0x22, // ...
0012f340 7a615416 (/*...*/ System.Net.NetworkInformation.PingReply..ctor(// ...
0012f358 7a61474e (/*...*/ System.Net.NetworkInformation.Ping.InternalSend(// ...
0012f37c 79eec356 mscorwks!FastAllocateObject+0xa6, calling mscorwks!Object// ...
0012f3fc 7a6139b4 (/*...*/ System.Net.NetworkInformation.Ping.Send(System.// ...
0012f43c 7a613752 (/*...*/ System.Net.NetworkInformation.Ping.Send(System.// ...
0012f454 00ca009e (MethodDesc 0x922fd8 +0x2e App.Main()), calling 7a87f55c
// ...
I can see here that the last managed code was inside PingReply's constructor. Here's what the appropriate constructor overload looks like, roughly:
internal PingReply(IcmpEchoReply reply)
{
    this.address = new IPAddress((long) reply.address);
    this.ipStatus = (IPStatus) reply.status;
    if (this.ipStatus == IPStatus.Success)
    {
        this.rtt = reply.roundTripTime;
        this.buffer = new byte[reply.dataSize];
        Marshal.Copy(reply.data, this.buffer, 0, reply.dataSize);
        this.options = new PingOptions(reply.options);
    }
    else
    {
        this.buffer = new byte[0];
    }
}
The call to MarshalNative::CopyToManaged() is inside Marshal.Copy(), and it's copying data from reply.data (an IntPtr) to the managed array, this.buffer. This structure is filled by the Windows API function IcmpSendEcho(). The buffer is allocated by the caller, as is usual for Windows API functions. The buffer is owned through a SafeHandle descendant, SafeLocalFree, and it's owned by the Ping class.

The memcpy function compiles to an x86 string operation, with esi and edi being the source and destination registers. From the original exception data at the top, one can see that ESI (the source) is an invalid pointer - that's what caused the access violation. Thus, the buffer has been freed early.

How did that happen? One can use !dso (aka !dumpstackobjects) to find all the live objects rooted on the stack or registers:

> !dso

OS Thread Id: 0x87c (0)
ESP/REG  Object   Name
0012f2ac 012e7508 System.Net.NetworkInformation.PingReply
0012f2c0 012e7508 System.Net.NetworkInformation.PingReply
0012f32c 012e7578 System.Byte[]
0012f350 012e7508 System.Net.NetworkInformation.PingReply
0012f410 012e74b4 System.Byte[]
0012f438 0128142c System.Net.IPAddress
0012f448 012e74b4 System.Byte[]
Aha! There's no Ping object! Even though several instance methods of the Ping object (two Ping.Send overloads and Ping.InternalSend) are currently executing, Ping is in fact eligible for garbage collection!

It appears that's what's happened. Let's see if that can be confirmed. If one runs !dumpheap -stat, there's still lots of Pings on the heap:

> !dumpheap -stat

// ...
7a779154     2041        40820 System.Net.SafeLocalFree
7a778ec0     2041        40820 System.Net.SafeCloseHandle
7915ff38     2041        65312 System.Threading.SendOrPostCallback
79124418     2043        89864 System.Byte[]
7a7812e0     2041       179608 System.Net.NetworkInformation.Ping
// ...
Let's see if we can find the Ping that's on our call stack:
> !dumpheap -type Ping

// ...
012e7040 7a7812e0       88     
012e710c 7a7812e0       88     
012e71d8 7a7812e0       88     
012e72a4 7a7812e0       88     
012e7370 7a7812e0       88     
012e743c 7a7812e0       88     
012e7508 7a781580       32     
total 2042 objects
Statistics:
      MT    Count    TotalSize Class Name
7a781580        1           32 System.Net.NetworkInformation.PingReply
7a7812e0     2041       179608 System.Net.NetworkInformation.Ping
Total 2042 objects
Now, since there is only one thread allocating Pings and there's nothing keeping them alive, I can expect that the Ping in the highest address is the most recently allocated one. To confirm that, I'll test the last two Ping objects listed (012e743c and 012e7370):
> !gcroot 012e7370
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 87c
Scan Thread 2 OSTHread 828

> !gcroot 012e743c
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 87c
ESP:12f390:Root:0130e1cc(System.Net.NetworkInformation.Ping)->
012e7494(System.Threading.SendOrPostCallback)->
012e743c(System.Net.NetworkInformation.Ping)
ESP:12f3f8:Root:0130e1cc(System.Net.NetworkInformation.Ping)->
012e7494(System.Threading.SendOrPostCallback)
ESP:12f434:Root:0130e1cc(System.Net.NetworkInformation.Ping)->
012e7494(System.Threading.SendOrPostCallback)
ESP:12f450:Root:0130e1cc(System.Net.NetworkInformation.Ping)->
012e7494(System.Threading.SendOrPostCallback)
Scan Thread 2 OSTHread 828
Here it can be seen that of the two, only 012e743c is alive. Because of the pattern of the loop, it would be expected that any locations holding Ping would be overwritten each time around, so this Ping at 012e743c should be the right one. Peeking inside it with !do (aka !dumpobject):
> !do 012e743c

Name: System.Net.NetworkInformation.Ping
MethodTable: 7a7812e0
EEClass: 7a7e9100
Size: 88(0x58) bytes
 (C:\WINDOWS\assembly\GAC_MSIL\System\2.0.0.0__b77a5c561934e089\System.dll)
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
790f9c18  4000184        4        System.Object  0 instance 00000000 __identity
// ...
7a779154  4002c06       20 ...Net.SafeLocalFree  0 instance 012e74f4 replyBuffer
// ...
I'm really only interested in this guy, the replyBuffer - because that's the guy who's pointing to invalid memory. Drilling into that location:
> !do 012e74f4

Name: System.Net.SafeLocalFree
MethodTable: 7a779154
EEClass: 7a7db890
Size: 20(0x14) bytes
 (C:\WINDOWS\assembly\GAC_MSIL\System\2.0.0.0__b77a5c561934e089\System.dll)
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
790fe160  40005b4        4        System.IntPtr  0 instance 82971448 handle
790fed1c  40005b5        8         System.Int32  0 instance        3 _state
79104f64  40005b6        c       System.Boolean  0 instance        1 _ownsHandle
79104f64  40005b7        d       System.Boolean  0 instance        1 _fullyInitialized
7a779154  40025ec      948 ...Net.SafeLocalFree  0   static 00000000 Zero
Now, the layout of the _state field is explained in the SSCLI 2.0, in the file clr/src/vm/safehandle.cpp:
// So the state field ends up looking like this:
//
//  31                                                        2  1   0
// +-----------------------------------------------------------+---+---+
// |                           Ref count                       | D | C |
// +-----------------------------------------------------------+---+---+
// 
// Where D = 1 means a Dispose has been performed and C = 1 means the
// underlying handle has (or will be shortly) released.
This shows that a _state value of 3 means that it's been both disposed and released. So, even though the Ping and SafeLocalFree objects haven't actually been garbage collected, the eager finalization of SafeHandle objects has released the buffer early. I'll consider that definitive.

Summary

There's at least two lessons from this:
  1. In general, one cannot rely on the fact that an instance method is on the call stack to keep the instance alive. One needn't be too paranoid, though: the objects can only be collected if the current line in each instance method is near or at the end of the method, so that 'this' is no longer on the stack or enregistered.
  2. Thus, one needs to be extra careful to keep SafeHandle instances alive, either by passing them along with any buffer / handle value extracted, or by using SafeHandle.DangerousAddRef()/SafeHandle.DangerousRelease() in a try/finally block to be absolutely sure that an early release doesn't occur.
Finally, to make everything crystal clear - what will the following program print?
using System;

class App
{
    class A
    {
        public B Foo()
        {
            return new B();
        }
        
        ~A()
        {
            Console.WriteLine("A finalized.");
        }
    }
    
    class B
    {
        public B()
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();
            Console.WriteLine("B now returning.");
        }
    }
    
    static void Main()
    {
        new A().Foo();
    }
}
If you've followed everything, you'll know that this is what it prints (compiled in Release mode):
A finalized.
B now returning.
Maybe this was obvious to everybody - but it certainly wasn't obvious to me, nor to the designer of the Ping class!

Tuesday, July 04, 2006

Covariance and Contravariance in .NET, Java and C++

Prompted by a Microsoft Research paper I read recently and a post on the MS .NET newsgroups, I investigated covariance and contravariance support in .NET, and contrasted it with the support in Java.

I won't describe try to describe covariance or contravariance directly. Hopefully it'll be clear from the examples. A brief description of covariance and contravariance can be found here (Wikipedia).

To begin with, I'll enumerate some of the different kinds of variance supported in C#, Common Language Infrastructure (CLI), C++ and Java, as far as I know them, along with the definitions I'll use in this article. These definitions aren't official in any sense I'm aware of.

  1. Override Variance

    This variance refers to the parameters and return types of an overridden method in a descendant class. C++ supports override covariance of return types.

  2. Definition-site Generic Variance

    With this kind of variance, the generic type, as part of its definition, defines how the subtype relation applies to instantiations of the generic type when the type arguments are themselves related by the subtype relation. The CLI (and thus the CLR) supports definition-site generic variance.

  3. Use-site Generic Variance

    With this kind of variance, the generic variable declaration (i.e. parameter, local or field), as part of the declaration, defines whether or not it is assignment-compatible with generic instantiations whose parameter types are more derived (covariant) or less derived (contravariant). Java wildcards are an implementation of use-site generic variance.

  4. Array covariance

    Java supports covariance of arrays of object types. This covariance isn't fully sound with respect to the type system at compile time, because arrays are mutable. Thus, run-time checks are used to patch up the hole. C# and the CLI support this feature chiefly to support Java on the CLI. To demonstrate the hole:

    Dog[] dogs = new Dog[10];
    Mammal[] mammals = dogs;
    mammals[0] = new Cat();
    
    The above code is statically correct with respect to types at compile time, with the Java definition of array covariance, but of course it isn't actually statically type-safe.
  5. Delegate variance

    C# supports delegate variance only at the point of binding. The method to which a delegate value is bound may have a covariant return type and contravariant argument types. Once the delegate is bound, it is not assignment-compatible with another delegate type even where the underlying method would be compatible according to variance rules.

I'll drill a bit deeper into the first three of these, since most developers should be familiar with the last two.

Override Variance

This variance refers to the parameters and return types of an overridden method in a descendant class. C++ supports this for covariance of return types. Return types and out parameters may be covariant, input parameters may be contravariant, and in-out parameters must be invariant. C++ example:
class Mammal
{
public:
    virtual Mammal* GetValue();
};

class Dog : public Mammal
{
public:
    virtual Dog* GetValue();
};
The equivalent example to demonstrate contravariance of input parameters can't be written in C++, since C++ doesn't support it. If one could declare it, it would look a bit like the following C#:
class DogComparer
{
    public virtual int Compare(Dog left, Dog right)
    {
    }
}

class MammalComparer : DogComparer
{
    public override int Compare(Mammal left, Mammal right)
    {
    }
}
Note the difference: the inheritance relationship is the other way around. That's where the 'contra' comes in. The arguments' subtype relationship is the opposite of the outer type's subtype relationship.

It's also intuitively true. A comparer of mammals is naturally also a comparer of dogs, since dogs are a subtype of mammal - thus a comparer of mammals is a subtype of a comparer of dogs!

Neither C#, the CLI nor Java support override variance. C++ supports return type override covariance, but not override contravariant input arguments. C++/CLI doesn't support override covariant return types for managed classes. It gives this error:

error C2392: 'Dog ^Dog::GetValue(void)' : covariant returns types are
not supported in managed types, otherwise 'Mammal ^Mammal::GetValue(void)' would be overridden

Definition-site Generic Variance

The CLI (II 9.5) supports definition-site variance for interfaces and delegates, but not for reference classes and value types. It uses the syntax <+T> to denote covariance and <-T> for contravariance. Because covariance only works for output parameters, a generic type can specify covariance on type parameters which are used in output positions only. Similiarly, contravariance is allowed on type parameters which are used for input only.

Within these constraints, and pretending that C# supported this CLI feature, we could envision these types:

class Mammal { }
class Dog : Mammal { }

interface IReader<+T> // allows covariance
{
    T GetValue();
}

interface IWriter<-T> // allows contravariance
{
    void SetValue(T value);
}
With these definitions, covariance of generic parameters would allow this:
  IReader<Dog> dogReader = null;
  IReader<Mammal> mammalReader = dogReader;
Contravariance of generic parameters would allow this:
  IWriter<Mammal> mammalWriter = null;
  IWriter<Dog> dogWriter = mammalWriter;
These are both disallowed in C#, but allowed at the IL level. Here's an IL translation of the above imaginary C# which assembles and passes PEVerify:
.assembly extern mscorlib {}
.assembly Test {}

.class private auto ansi beforefieldinit Mammal
       extends [mscorlib]System.Object {}

.class private auto ansi beforefieldinit Dog
       extends Mammal {}

.class interface private abstract auto ansi IReader`1<+T>
{
  .method public hidebysig newslot abstract virtual 
          instance !T  GetValue() cil managed {}
}

.class interface private abstract auto ansi IWriter`1<-T>
{
  .method public hidebysig newslot abstract virtual 
          instance void  SetValue(!T 'value') cil managed {}
}

.class private auto ansi beforefieldinit App
       extends [mscorlib]System.Object
{
  .method private hidebysig static void Main() cil managed
  {
    .entrypoint
    .locals init (
             [0] class IReader`1<class dog> dogReader,
             [1] class IReader`1<class Mammal> mammalReader,
             [2] class IWriter`1<class Mammal> mammalWriter,
             [3] class IWriter`1<class Dog> dogWriter)
    
    ldnull
    stloc.0

    ldloc.0
    stloc.1

    ldnull
    stloc.2

    ldloc.2
    stloc.3

    ret
  }
}
If one switches around the assignments, to try and treat covariance contravariantly and vice versa, changing the body of the Main method to:
    ldnull
    stloc.1
    
    ldloc.1
    stloc.0
    
    ldnull
    stloc.3
    
    ldloc.3
    stloc.2
    
    ret
One then gets the following errors from PEVerify:
[IL]: Error: [App::Main][found ref 'IReader`1[Mammal]'][expected ref
'IReader`1[Dog]'] Unexpected type on the stack.
[IL]: Error: [App::Main][found ref 'IWriter`1[Dog]'][expected ref
'IWriter`1[Mammal]'] Unexpected type on the stack.
Similarly, if one tries to make IReader<+T> contravariant, i.e. change it to IReader<-T>, and similarly make IWriter<-T> covariant, one gets the following errors from PEVerify:
[token  0x02000004] Type load failed.
[token  0x02000005] Type load failed.
So, the covariant and contravariant support is there.

Use-site Generic Variance

This is the definition of variance at the use site, rather than the definition site. That means that when declaring variables of a generic type, one can make the variable declaration open to instances of generic types with more (covariant) or less (contravariant) derived type arguments.

To make the example concrete, I'll use Java 5, which supports covariance and contravariance through wildcards.

class Mammal {
    public Mammal() {
    }
}

// ---

class Dog extends Mammal {
    public Dog() {
    }
}

// ---

class Cat extends Mammal {
    public Cat() {
    }
}

// ---

public class Holder<T> {
    T _value;
    
    public Holder() {
    }
    
    public T getValue() {
        return _value;
    }
    
    public void setValue(T value) {
        _value = value;
    }
}
Given these definitions, I can make use of covariance thusly:
Holder<Cat> catHolder = new Holder<Cat>();
catHolder.setValue(new Cat());
// Use covariance to fit cat-holder into mammal-holder.
Holder<? extends Mammal> mammalHolder = catHolder;
// Can now access return values (covariant is only safe for out).
Mammal mammal = mammalHolder.getValue();
System.out.println(mammal);
// This won't work: covariance doesn't work for input parameters.
mammalHolder.setValue(new Dog());
Similarly, I can make use of contravariance:
Holder<Mammal> mammalHolder = new Holder<Mammal>();
// Use contravariance to fit mammal-holder into cat-holder.
Holder<? super Cat> catHolder = mammalHolder;
// Can now access input parameters (contravariance is only
// safe for input).
catHolder.setValue(new Cat());
// This won't work: contravariance doesn't allow output.
Cat cat = catHolder.getValue();

Summary

The CLI and Java have two quite different generic variance capabilities. C# hasn't exposed any of the CLI's generic variance. The delegate variance exposed by C# appears to be more a feature of the CLR / CLI's loosening of delegate binding restrictions, since it doesn't use generic variance functionality. Intuitively, it appears that use-site generic variance is a superset of the functionality of definition-site variance, since any definition-site variance scheme can be replaced by an equivalent use-site version, while the contrary isn't true (input arguments of covariant generic types and output arguments of contravariant generic types are strictly disallowed in definitions, but may be disallowed on a case-by-case basis at the use-point) - but I haven't tried to prove that.

There is, however, a cost associated with generic type variance - conceptual complexity. Covariance and contravariance are simple enough once one gets used to them, but they represent yet another barrier to be overcome for newcomers to a language.