Sunday, July 09, 2006

Fun with asynchronous methods and scalability

I was reading Joe Duffy's article on the CLR threadpool, and I felt compelled to write a post about a gnawing feeling I'm getting about current asynchronous design patterns.

It's too easy for folks working on the CLR or ASP.NET or any other server technology to advise that if people want scalability, then they need to have no more threads than there are processors, and that they should keep the CPU as busy as possible at all times. Why? Asynchronous code is too hard to write - or rather, it's too different to the imperative way that most people have learned to program.

A common "problem" is that people write code that looks like this:

void AnOperation()
{
    A a = BlockingCallA();
    a.Foo(); // real "work"
    B b = BlockingCallB(b);
    BlockingCallC(a, b);
}

... where BlockingCall?() are operations which take a relatively long time but aren't CPU intensive: remote database operations, network I/O, disk I/O. The "problem" is that these calls tie up the thread while BlockingCall?() are executing, and threads are an expensive resource. They consume stack space (if not committed, then reserved, which can still be a significant fraction of virtual address space on 32-bit machines), there are scheduling costs, context switching costs, working set costs and cache costs (all that work priming the cache is lost once a call blocks and the CPU needs to be rescheduled). Thus, the "solution" is to make this code asynchronous.

Now, it might occur to some people to use the "asynchronous delegate" pattern to make the method asynchronous. It's slightly awkward to present an "asynchronous API", simulating the Begin/End pattern on the back of asynchronous delegates, but not too difficult. Here's one effort:

delegate void Procedure();

IAsyncResult BeginAnOperation(AsyncCallback callback, object state)
{
    Procedure proc = AnOperation;
    
    // Can't pass "callback" to BeginInvoke because we need the state
    // argument on the IAsyncResult for EndAnOperation.
    AsyncCallback inner = null;
    if (callback != null)
        inner = delegate
        {
            callback(state);
        };
    
    return proc.BeginInvoke(inner, proc);
}

void EndAnOperation(IAsyncResult ar)
{
    if (ar == null)
        throw new ArgumentNullException("ar");
    
    // Is this the right IAsyncResult? Who knows?
    Procedure proc = (Procedure) ar.AsyncState;
    proc.EndInvoke(ar);
}

Unfortunately, this approach won't add any scalability whatsoever. Why? Because the asynchronous delegate pattern is implemented using the .NET threadpool, so this just moves the blocking from one thread to another. The other thread will still be keeping a full thread alive and blocked in the blocking calls.

So, at a second attempt, the code can be rewritten as something roughly like the following, using anonymous delegates to simulate continuation passing style (CPS):

object _anOperationCookie = new object();

IAsyncResult BeginAnOperation(AsyncCallback callback, Object state)
{
    BasicAsyncResult result = new BasicAsyncResult(_anOperationCookie, state);
    BeginBlockingCallA(delegate(IAsyncResult ar)
    {
        A a = EndBlockingCallA(arA);
        a.Foo();
        BeginBlockingCallB(a, delegate(IAsyncResult arB)
        {
            B b = EndBlockingCallB(arB);
            BeginBlockingCallC(a, b, delegate(IAsyncResult arC)
            {
                EndBlockingCallC(arC);
                result.SetCompleted(null);
                if (callback != null)
                    callback(result);
            }, null);
        }, null);
    }, null);
    
    return result;
}

void EndAnOperation(IAsyncResult result)
{
    if (result == null)
        throw new ArgumentNullException("result");
    BasicAsyncResult basicResult = result as BasicAsyncResult;
    if (result == null)
        throw new ArgumentException("Wrong IAsyncResult");
    
    // if non-void result, then prefix with:
    // return (T) 
    basicResult.GetRetVal(_anOperationCookie);
}

class BasicAsyncResult : IAsyncResult
{
    object _lock = new object();
    ManualResetEvent _event;
    object _state;
    volatile bool _completed;
    object _retVal;
    object _cookie;
    
    public BasicAsyncResult(object cookie, object state)
    {
        _cookie = cookie;
        _state = state;
    }
    
    public void SetCompleted(object retVal)
    {
        lock (_lock)
        {
            _retVal = retVal;
            _completed = true;
            if (_event != null)
                _event.Set();
        }
    }
    
    public WaitHandle AsyncWaitHandle
    {
        get
        {
            lock (_lock)
            {
                if (_event == null)
                    _event = new ManualResetEvent(_completed);
                return _event;
            }
        }
    }
    
    public object AsyncState
    {
        get { lock (_lock) return _state; }
    }
    
    public bool CompletedSynchronously
    {
        get { return false; }
    }
    
    public bool IsCompleted
    {
        get { lock (_lock) return _completed; }
    }
    
    public object GetRetVal(object cookie)
    {
        lock (_lock)
        {
            if (_cookie != cookie)
                throw new ArgumentException("Wrong IAsyncResult");
            
            if (_completed)
            {
                if (_event != null)
                    _event.Close();
                return _retVal;
            }
        }
        
        AsyncWaitHandle.WaitOne();
        
        lock (_lock)
        {
            _event.Close();
            return _retVal;
        }
    }
}

Now, I wrote this code on the spot. It took about 20 minutes, most of it in BasicAsyncResult. I'm not sure if it's fully correct. It might have a subtle bug related to threading and memory visibility. I've probably put too many "lock (_lock)" in there, but these things are so tricky it's far better to err on the side of caution - memory visibility semantics are mind-bending.

I've made an effort to write BasicAsyncResult such that it could be reused for any Begin/End async pattern implementation. That's largely what the cookie is for - to detect subtle errors relating to passing the wrong IAsyncResult to an End* method. So, I'll discard the code of BasicAsyncResult when comparing.

So, 4 lines of code in one method have turned into around 18 lines in two methods. An advantage is that the translation is relatively automatic; but that's also a disadvantage, because the it means that you're writing manually templated code, and thus are liable to slip in a mistake by not paying attention. But stand back a moment: is this seriously what the folks on the .NET team are advising people do in order to achieve scalability in their business applications, running on ASP.NET?

This example is among the easiest code to translate. Unfortunately, not all code look like this. The asynchronous code may be deep in a call stack somewhere - it almost certainly will be if you've gotten your database operations nicely factored. The problem is that the asynchronous transformation described above can't pass through synchronous methods. In other words, if you've got a group of methods that look like:

void X()
{
    A a = BlockingCallA();
    B b = Y(a);
    BlockingCallC(a, b);
}

B Y(A a)
{
    Twiddle1(a);
    B b = Z(a);
    Twiddle2(b);
    return b;
}

B Z(A a)
{
    Frobnicate(a);
    return BlockingCallB(a);
}

Now, only X() and Z() contain blocking calls. Unfortunately, to make X() and Z() work asynchronously, all the methods on the path between X() and Z() must be asynchronous too. Making your code work asynchronously forces you to make this mechanical transformation to CPS across every path in the call graph from the original asynchronous entrypoint to a method containing a blocking call. If your code is well factored, there could be tens or hundreds of these methods - or even more. If you're like me, the "advice" is starting to sound a little like an idle platitude at this point. Naturally, it isn't often done.

This mechanical work is precisely what compilers were designed to do. Anonymous delegates made the above code at least possible for those who value their time. Without anonymous delegates, creating the nest of extra classes required to keep stack frame information and continuation code becomes far more tedious.

If the folks advising asynchronous code are serious about recommending this approach to scalability, then the enabling tools need to be available. It's conceivable that the CLR could do this automatically, with some kind of [AutoAsync] attribute, along with a couple of utility methods to access the generated Begin/End pair via reflection or whatever. That would keep the C# language clean while leaving the CLR open to a lazy implementation approach (using ThreadPool threads), but also with the power to create the transformations discussed above.

A lot of work? Yes, but better than people who write business code for a living having to work in a nest of hand-written continuations just to get scalability.

Update: Fixed bugs in CPS sample.
Update: Note that the above code doesn't deal with exceptions properly. Consider the above code as just an outline of an approach. I'll see if I can find the time to do a full article on this technique.

No comments: