Sunday, May 18, 2008

In Defense of Steve Vinoski and Erlang

There's been a minor scuffle going on between Ted Neward and Steve Vinoski over the wisdom of Erlang's approach to concurrency: whether it should be baked into the language or not on one hand, and whether it should be running on the JVM or CLR on the other.

I've already articulated my position on VMs, and I think it makes a lot of sense, particularly for prototyping, to build a VM specifically for a language implementation, particularly if the language has some primitives that are not normally available in commodity VMs. And to be frank, if one's language doesn't have some interesting new primitives or combination of primitives, it is unlikely to be moving the state of the art forward.

Erlang uses the actor concurrency model combined with lightweight aka green threads, though the execution engine may spawn as many threads as needed in order to get genuine concurrency from an underlying parallel architecture, such as multiprocessor or multicore. Erlang works under a shared-nothing model, however, so the "green threads" are more like "green processes".

I think Ted misses some appreciation of the power of the Erlang model, and in particular, its choice of primitives. Ted points to an implementation of the Actor model using Lift (written in Scala), and in particular some ballpark performance numbers:

We also had an occasion to have 2,000 simultaneous (as in at the same time, pounding on their keyboards) users of Buy a Feature and we were able to, thanks to Jetty Continuations, service all 2,000 users with 2,000 open connections to our server and an average of 700 requests per second on a dual core opteron with a load average of around 0.24... try that with your Rails app.

One of the obvious problems with this comment is that it doesn't sound very impressive when actually compared with Yaws, implemented in Erlang:

Our figure shows the performance of a server when subject to parallel load. This kind of load is often generated in a so-called "Distributed denial of service attack".

Apache dies at about 4,000 parallel sessions. Yaws is still functioning at over 80,000 parallel connections.

What this disparity tells me is that the JVM and CLR are likely lacking some primitives that help Erlang achieve this kind of result. For straight-up processing code, Erlang isn't terribly fast, by all accounts. The reason it wins is likely because it is avoiding the context-switching overhead through the use of lightweight processes. This in turn suggests to me that the equivalent of green threads, or some kind of automatic CPS transformation or native support for continuations is needed for CLR and JVM to be credible target platforms for Erlang. Lift is currently using Jetty Continuations, and when you read up about its implementation:

Behind the scenes, Jetty has to be a bit sneaky to work around Java and the Servlet specification as there is no mechanism in Java to suspend a thread and then resume it later. The first time the request handler calls continuation.suspend(timeoutMS) a RetryRequest runtime exception is thrown. This exception propagates out of all the request handling code and is caught by Jetty and handled specially. Instead of producing an error response, Jetty places the request on a timeout queue and returns the thread to the thread pool.

When the timeout expires, or if another thread calls continuation.resume(event) then the request is retried. This time, when continuation.suspend(timeoutMS) is called, either the event is returned or null is returned to indicate a timeout. The request handler then produces a response as it normally would.

Thus this mechanism uses the stateless nature of HTTP request handling to simulate a suspend and resume. The runtime exception allows the thread to legally exit the request handler and any upstream filters/servlets plus any associated security context. The retry of the request, re-enters the filter/servlet chain and any security context and continues normal handling at the point of continuation.

... you can see that it's clearly a hack to work around the limitations of the JVM - i.e. the fact that it doesn't have user-schedulable green threads (on top of native threads, not as a replacement, a bit like fibers on Windows), or an automatic CPS with some AOP-style weaving, or native continuation support.

In conclusion, the primitives in the VM matter a great deal. With the right primitives, wholly different styles of application become possible, because of the radically different performance profiles.

19 comments:

Anonymous said...

Excellent Post. You are dead on, in that what makes Erlang so scalable is it's own green thread scheduler(s). It would be absolutely necessary for this mechanism to be baked into the JVM and CLR in order to properly port Erlang. Furthermore, when the JVM and CLR also add all the monitoring and supervision of these green threads, then maybe, just maybe can they build an Erlang like language that runs with near equivelent concurrency and reliability on those platforms.

Anonymous said...

Yes, and for you your VM is your compiler. As we discussed in ng it must exist a balance between the logic implemented in a parallel library and primitives on which it relies. Because parallel programming is already *the* way of programming (or, at least, it should), the parallel primitives must be part of the language and to occupy a central place (I mean performance, ease of use, expressiveness etc.) not left in responsibility of a library which stays on top and tries to overcome the shortcomings of the language.
Just my 2c,
m. Th.

Anonymous said...

Me again, sorry for double posting. I just wish to stress that nowadays isn't about programming languages which support parallel programming or not.

Imho, today one of the decisive factors for a language to be successful is how it leverages the available cores.

Craig said...

This in turn suggests to me that the equivalent of green threads, or some kind of automatic CPS transformation or native support for continuations is needed for CLR and JVM to be credible target platforms for Erlang.

Why? Erlang doesn't use OS thread/fiber primitives even when they are available. The first implementations were in Prolog, which didn't have concurrency support at all. It always implements its process switching in software. Perhaps there would be some advantage in implementing Erlang-style processes inside the VM insofar as it makes the separation of the VM and application code cleaner, but it's not at all clear to me why the lack of Erlang-style processes is a barrier to being a "credible target platform," given that no Erlang implementation I'm aware of (and I will freely knowledge that there are many Erlang implementations that I am not aware of) targets a platform that actually has this baked in.

Barry Kelly said...

@craig: Erlang does use OS threads, and has done for a couple of years now. It's a necessary requirement to take advantage of hardware concurrency (multicore & multiproc) without having to spawn multiple processes.

It likely doesn't use OS fibers because it has its own VM (the BEAM VM, but there's also a newer HiPE which does more native compilation), which provides green threading the way Erlang wants it, which may be both more efficient and more portable than the way the OS does it, since the Erlang VM knows much more about Erlang bytecode than the OS. The BEAM/HiPE VM does user-mode scheduling.

So, Erlang does indeed target a platform that has green threads baked in - it targets the Erlang VM.

Barry Kelly said...

@m. th.: IMHO, I don't think current Delphi is a good place to start from to try and build a solid concurrency-aware language for the future. There is too much legacy code which relies on primitives in the language and libraries which are not themselves threadsafe, and are far too mutable to make a good base going forward. This is my opinion, of course, not my employer's.

Craig said...

OK, I think we're kind of saying the same thing then. The "credible target platform" for Erlang is... OTP/Erlang. And you can implement OTP on top of just about any other platform.

You're right though, I was imprecise in my comment on Erlang's use of native threads. It does use native threads, but not on a 1:1 mapping with Erlang processes.

Unknown said...

Retlang implements the actor concurrency model on the CLR. It can either use dedicated threads or rely on a a pool. Hundreds or thousands of Actors can share a pool of a few threads. The pool will grow based upon demand and system resources. Unlike Erlang, message immutability cannot be enforced, but in practice it isn't difficult to adopt immutability as a convention when working with Actors.

Retlang started as an Erlang library for .NET, but has evolved into a more event driven design that better supports the strengths of C#.

Mike Rettig
Retlang Developer

Anonymous said...

:-) :-) :-) ...Don't get daunted, don't get daunted... we'll fix them, we'll fix them... little by little we'll fix them... :-)

Yeah, I understand what you are saying but I'd rather think in a complementary way. I'm just a user. You see, there are a plethora of frameworks which leverage the actor model (see, even as a comment here in your blog post you have one). And this, not to mention languages like Ada (a modified form, more suited for local messaging - a sort of C/S called 'Rendezvous'), Erlang, Scala etc. As you know, this model has the big advantage (among others) that you can program in the same way, with the same language primitives except the 'new' part which is inter-thread/process communication which is isolated (and centralized in some cases - see Ada, I put the link in a message for you in .non-technical, and also again here, for convenience: http://www.iuma.ulpgc.es/users/jmiranda/gnat-rts/node20.htm) by the rest of the code and hence the IPC which is the new, sensible part is an entire new 'module' which doesn't rely (too much) on existing language primitives and/or their properties (immutability, thread-safety etc.). So, indeed the *legacy* code (& co.) isn't a good place to start. IMHO, we must start from the frameworks. These are the VM/abstractions which now we (the programmers in the entire world) are using for prototyping, as you say. For ex. we (our team here) have already written such a distributed kind-a-parallel framework which we used to execute different tasks on different computers (think a 'chat' program, but not between humans, but between processes) for our in-house needs. Also there are much more powerful commercial frameworks which can be used for this written in Delphi for Delphi (RO, RTC, MsgConnect etc.). So, it's possible, and from our small experience - quite easy. Of course I'm reserved here because we built one only for our specific needs and we grow it incrementally on demand. Why don't go on this route? Build the "best" parallel framework (DPL) which leverages the Actor model / message passing and ask "What are the critical parts for which is good/profitable/worth to build language elements for this?" Of course we don't want parallel language elements for the sake of it... only if you "don't have anything else to do" (TM) ;-). See, we did our "VM" ;-) for our framework in Delphi but isn't the same if this can be accomplished right in the compiler - imho, of course. And don't be afraid to much about thread-safety. The community is on your side. There were numerous threads in the last (let's say) 15-20 months about this theme and the community's pulse is clear to the 'message-passing' metaphor. I remember now a huge thread called 'Top 3 wishes' (IIRC) and one 'wish' was Threadsafe VCL and Nick asked 'What do you mean by this?'. In hundreds of messages (literally) the community voice was that 'we need rather to pump messages with ease in a human understandable way from threads to UI thread and back, informing that an action takes/has taken place etc.'. Also were plenty of threads in which at questions of 'How to do multi-threading?' the answers were 'First, dump everything what you know about Synchronize etc. and implement a Provider-Consumer model like this:...'. Imho, the thread/shared memory model which we have and we _must_ have, is to be used only when the other two communication models (actor model and his variant 'provider-consumer') doesn't fit very well (eg. speed issues). So, imho, because you (CG) own the basic library, language specification (a compiler so you can do checks before run) and you have already many frameworks to learn from their goods and bads, you are in a good position. Only to be balanced and with good discrimination/judgment of facts, because istm we live now, at least in our industry, a wave of extremes. Take a look at the community. They will show you the path. Mind you, we don't try to build a research language, we want to help a concrete community in their pains. And this can be more challenging because is closer to reality.

Just my 2c,

m. Th.

David Pollak said...

Howdy,

I'd like to correct one of your comparisons... lift was servicing 2,000 connection where each connection was doing something frequently. I can tell you to a 99% degree of certainty that an Erlang program cannot service 700 calculated and meaningful requests per second (no matter how many *open* connections it can service) on the same hardware.

So, comparing open connections (Erlang) to open connections plus pages being serviced seems a bit unfair. Firing up an ErlyWeb app that does group chat and putting 9 people in each chat room and seeing if you can handle 80,000 users or even 2,000 users would be a better test (Buy a Feature is a lot more complex than simple chat).

As to how it's done under the covers, once again, I don't think you're making a fair comparison. You're comparing how Jetty (and Scala) implement continuations (Scala Actors use a similar mechanism for event-based Actors) to what? Why not dig into the BEAM C code to see how BEAM implements scheduling and inter-process messaging? I'm once again better that the BEAM C code is less than pretty, but it's stable and gets the job done.

So, let's do a comparison of what actually matters. Let's use the micro-benchmark at the Great Language Shootout for message passing. Yep... looks like Erlang is 9 times faster than Scala for message passing. So in this dimension, the cost of passing the message on the JVM is higher. How much of your program is actual message passing? How much of your application is message passing of a very simple data structure (because Erlang copies of message send, the cost of passing an XML document would have been much higher in Erlang, but the same in Scala)? How much does this matter? It depends on your application. But I'll trade a 9x hit in message passing (no matter what the mechanism for thread and stack allocation) for a 5x difference in common and CPU intensive applications (parsing XML, calculating results, generating XML, etc.) These are the things that cost a lot in my applications. These are the scalability factors that I see. Just being able to hold open a connection (and nowhere is it claimed that Jetty can't keep 80,000 connections open and I've done tests with 1M Scala Actors and they "Just Work").

But, to my mind, the fact that one can build Erlang-style Actors that perform well on top of the JVM is a good indicator for the JVM's flexibility. How many languages are implemented on top of BEAM? I think the answer is 4. For the JVM, the answer is much larger. On the JVM, there's very little that's done in native code for performance reasons, but in Erlang, if you want performance, you have to write C. This doesn't work for projects that are moderately complex. See Avi Bryant's discussion of Turtles.

Some of my friends are running a moderate traffic eJabberD server. It's not the most stable piece of their infrastructure. The fantasy that Erlang can run for years without crashing or starting to perform badly or somesuch is not borne out in my observations. Yes, you can run an Erlang process for a relatively long time. You can do the same with a JVM process. OTP is a better library for doing graceful failover than existing on the JVM, so the failures may be less noticeable. But the Erlang VM is not worlds more stable than the JVM.

With all this being said, there are some (and an increasing number of) excellent applications running on Erlang, including RabbitMQ. Erlang provides a great environment for people who like it to program in. I'm glad Erlang exists and some Erlang code and coders have been inspirations for me.

But, Erlang vs. Scala is kinda a dumb thing to be debating. It's emacs vs. vi. Both are good. Each has strengths and each has weaknesses. Wouldn't it be better to focus on the strengths of the paradigm and where each platform does well? Wouldn't it be better for us to be talking about how Erlang and Scala's paradigms are a better way at looking at the problems than are available in Java and C#? That way we raise awareness rather than look like a bunch of squabbling 5 year olds "My language is better than your language."

Barry Kelly said...

@david - I agree with you that language squabbles are pointless. However, the thrust of my post wasn't to say that Erlang is better than Scala, far from it. What I most objected to was Ted dismissing the gestalt of Erlang: in that the way Erlang is built, both with its combination of primitives and its combination of consciously missing primitives, resulted in something that isn't trivially reproducible on the JVM.

As to BEAM's implementation, it's a pretty standard interpreted virtual machine loop at its core in beam_emu.c (process_main), and its process swapping is pretty simple - just save the process state to a structure (SWAPOUT, SWAPIN). I still object to any suggestion that replaying an entire request, in the manner of Jetty Continuations, is an admirable way of implementing continuations.

On the whole, I find your own comment here argues much more strongly for a language food fight. I'm not debating the strengths and weaknesses of Erlang vs. Scala. I didn't talk about any of Scala's features. If I had to write an application server tomorrow, for me personally, Scala would be much higher on the list than Erlang. All I was saying is that Erlang has a set of primitives that produce a gestalt that isn't easily reproducible on the JVM or CLR.

David Pollak said...

Points taken. You're right... my post was a lot more food-fightish than it should have been.

Jon Harrop said...

An absurdity runs throughout this discussion. You have made observations about Erlang and JVM languages *only* and then drawn conclusions about the CLR. That is just silly.

Moreover, the exact problems you are complaining about have long since been fixed on the CLR.

CPS is hard on the JVM because it still lacks tail calls. There are workarounds but they are so nasty and slow that Scala simply forfeits tail calls (except for special cases that are irrelevant here). In contrast, the CLR has supported tail calls for years and F# (which is the CLR equivalent of Scala) has been using them for years.

Then you complain about a lack of green threads on the JVM and CLR but F# provides monadic-style asynchronous workflows with a custom syntax specifically designed to make high-performance Erlang-style concurrency as easy to use as possible.

Finally, the nail in Erlang's coffin for general purpose concurrent programming is that it copies everything that is passed. In contrast, the sophisticated concurrent GCs in both the JVM and .NET make it easy to share data structures between threads, eliminating the performance hit and space consumption associated with unnecessary copying.

Craig said...

"Flying Frog" wrote:

Finally, the nail in Erlang's coffin for general purpose concurrent programming is that it copies everything that is passed.

As it would be impossible to write an efficient telecom system in such an environment, this is clearly false.

In particular, look at the binaries feature of Erlang.

Scaler types are copied, yes, and this has some real advantages ( Along with the disadvantages, of course). In particular, it builds in an automatic fine granularity for garbage collection, allowing soft real-time performance without a garbage collector as sophisticated as something like Metronome.

Barry Kelly said...

Jon Harrop wrote:


An absurdity runs throughout this discussion. You have made observations about Erlang and JVM languages *only* and then drawn conclusions about the CLR. That is just silly.


For my purposes, talking about green threading as a primitive of the Erlang VM, the CLR and JVM are equivalent. It is true that the CLR provides a couple of more primitives, but I don't think they materially alter my position. Certainly tail calls help in implementing something like green threads, but it requires that you transform the program into CPS in the first place - something that is not usually possible if there is any non-VM code on the stack.


Then you complain about a lack of green threads on the JVM and CLR but F# provides monadic-style asynchronous workflows with a custom syntax specifically designed to make high-performance Erlang-style concurrency as easy to use as possible.


A syntax-directed transformation into CPS like monadic async workflow certainly has its uses, but again, it doesn't help you a lot when you have native or non-VM code on the stack.


Finally, the nail in Erlang's coffin for general purpose concurrent programming is that it copies everything that is passed. In contrast, the sophisticated concurrent GCs in both the JVM and .NET make it easy to share data structures between threads, eliminating the performance hit and space consumption associated with unnecessary copying.


The copying is an implementation detail; if the data is immutable, references can be shared; if the data is mutable, copying is what you want. Shared mutable data is the root of all concurrency evil. Having a VM which specifically *excludes* troublesome mutability can be a non-obvious advantage; other, more traditional, languages for the same VM can't introduce hidden sharing.

That said, carefully-constructed APIs implemented behind the scenes using mutable shared data are definitely useful, but as a general programming model, the complexity doesn't scale. We need different models at this point for concurrency in the large and the small; threads (shared-memory by definition) are a one-size-fits-all, and aren't the answer.

Jon Harrop said...

You say "The copying is an implementation detail; if the data is immutable, references can be shared..." but, as I said, F# already does that efficiently whereas Erlang almost certainly never will.

Then you say "Shared mutable data is the root of all concurrency evil" but there are two obvious problems with that. Firstly, databases are a ubiquitous and essential counter example. Secondly, real programs will almost always want to use shared mutable data structures for their non-concurrent parallel aspects. Erlang may be great for the concurrent parts but as long as it sucks for everything else it will not gain traction because it is domain specific.

Barry Kelly said...

Jon Harrop wrote:

You say "The copying is an implementation detail; if the data is immutable, references can be shared..." but, as I said, F# already does that efficiently whereas Erlang almost certainly never will.


I for one don't have enough information in this case to predict the future, so I'm not going to try to.


Then you say "Shared mutable data is the root of all concurrency evil" but there are two obvious problems with that. Firstly, databases are a ubiquitous and essential counter example.


RDBs are different oranges to the in-memory apples of programming languages; the usage models, and in particular the actual amount of state that is mutated concurrently with certain performance expectations, is a lot lower with RDBs than in-memory. Transactions have a logical model that is in part translatable to in-memory via transactional memory, but there are problems in that space too, that I'm not going to get into now (though see Cliff Click's recent post at STMs vs Locks). Every solution we have to concurrent situations has drawbacks; essentially, we have a patchwork of solutions.

Secondly, real programs will almost always want to use shared mutable data structures for their non-concurrent parallel aspects. Erlang may be great for the concurrent parts but as long as it sucks for everything else it will not gain traction because it is domain specific.

I don't have enough experience with Erlang to understand what its idioms are for all situations, so I'm not going to speculate.

Doktor Towerstein said...

Jon Harrop wrote:

You say "The copying is an implementation detail; if the data is immutable, references can be shared..." but, as I said, F# already does that efficiently whereas Erlang almost certainly never will.

As far as I know, data are lazy copied in erlang. So, if you send a list and the recipient is not altering it, then nothing is copied. Any modification will trigger copying some or all of the shared data. Please detail what mechanism is used in F# for these purposes. They can surely be incorporated into the BEAM VM.

Then you say "Shared mutable data is the root of all concurrency evil" but there are two obvious problems with that. Firstly, databases are a ubiquitous and essential counter example.

Yes, they are an example... until you come to scalability. Only shared nothing databases can scale to any data volume; take Teradata. You could also check mnesia (written in erlang) as a shared nothing DB that provides replication and distribution of individual fragments.

Secondly, real programs will almost always want to use shared mutable data structures for their non-concurrent parallel aspects.

I would say that's just an assumption. After doing some erlang I find it easier to code certain algorithms, since I don't have to care about when to clone a specific Java object to avoid modifications in different parts of the code.

As for purely parallel algorithms, erlang is definitely not the best language. But
if you consider a group of processes with the same code, following the same path, then a group of Nvidia threads in a single core could run all those "processes" at the same time.

Javier Torres

Jon Harrop said...

@Javier Torres

As far as I know, data are immutable in Erlang so your statements about copy-on-write cannot be correct because there can be no "write". I believe the main Erlang implementation copies aggressively because this makes distribution and garbage collection trivial.

F# is built upon the CLR which provides a performant concurrent collector. This collects unreachable objects from all threads simultaneously and handles sharing. The CLR is the only mature performant VM supporting tail calls with concurrent garbage collection in existence. It is simply too difficult to write and debug your own (many other language implementations have tried and failed, e.g. OCaml, Haskell).