Wednesday, May 21, 2008

Life with Cygwin: Paths, Notepad and Clipboard

I have in the past written about my Cygwin setup, but several weeks ago, while I was in Scotts Valley and having dinner with Adam Markowitz and Steve Trefethen, Steve mentioned that I should write a bit more about my setup.

While the defaults for a new Cygwin install today are better than they have ever been, there are still a lot of things to be desired. Using, as I do, a bash shell as my main command line, yet still being a Windows programmer running on Windows, means that I need to integrate with Windows command-line programs. Herein lies a problem: Cygwin uses Unix-like paths with '/' and no drive letter or colon (which is a path separator on Unix systems), while Windows inherits the usual CP/M/DOS traditions. Incidentally, I mount those drives to root letters to make converting between Windows and Cygwin paths easy:

$ mkdir /c
$ mount 'c:\' /c

Mounting removable drives like floppy disks and DVD-drives in the same way is more problematic, as the 'ls --color=auto' command (which wants to colour in files and directories corresponding to thier types) will try to read the contents of these directories, which of course will be mounted to removable drives. This would normally cause delays when doing a listing of the root directory, as the removable drives in the system spin up etc. Consequently, for removable drives I use a different technique. For example, my DVD-drive is 'O:' (because it's round, and because I frequently add and remove drives and I don't like drive letters changing because it breaks things), so this is how I integrate the DVD-drive into the Cygwin file system:

$ ln -s '/cygdrive/o' /o

This creates a symbolic link which will just be a broken link when there is no DVD in the drive. I do similar things for my iPods, pen drives, floppy drive (I still keep one around, just in case :), etc.

Anyway, back to Cygwin/Windows path interaction. Cygwin does provide a command to convert between paths, called 'cygpath'. It can be used fairly easily in an ad-hoc way on the command-line:

$ notepad /etc/bash.bashrc
# (this will fail, as notepad can't cope)

$ notepad $(cygpath -w /etc/bash.bashrc)
# (this will work here but fails when the Windows path has spaces)

$ notepad "$(cygpath -w /etc/bash.bashrc)"
# (this is more resilient)

winexec

Using cygpath manually is a bit of a pain, so I wrote a little bash script I call winexec to capture the pattern:

#!/bin/bash

function usage
{
    echo "usage: $(basename $0) [options]  [...]"
    echo "Executes an executable with arguments, converting non-options into Win32 paths."
    echo "Options:"
    echo "  -f    Only convert paths to files or directories which actually exist."
    echo "  -s    Use cygstart to execute detached from console."
    echo "  -k    Skip converting paths until '**' found in arguments (and remove the '**')."
    echo "  --    Terminate $(basename $0) options processing."
    exit 1
}

# Process options to winexec itself.

while [ "$1" ]; do
    case "$1" in
        -f)
            ONLY_FILES=1
            ;;
        
        -s)
            USE_CYGSTART=1
            ;;
        
        -k)
            SKIP_TO_STAR=1
            ;;
        
        --)
            shift
            break
            ;;
        
        -*)
            # Give an error on unknown switches for future compat.
            usage
            ;;
        
        *)
            break
            ;;
    esac
    shift
done

EXECUTABLE="$1"
shift

test -z "$EXECUTABLE" && usage

# Options conversion and caching.

declare -a OPTS
function add_opt
{
    OPTS[${#OPTS[@]}]="$1"
}

function add_file_opt
{
    if [ -n "$SKIP_TO_STAR" ]; then
        if [ "$1" = "**" ]; then
            SKIP_TO_STAR=
            # Eat '**' but don't add.
        else
            # Haven't seen star yet, so add unconverted.
            add_opt "$1"
        fi
    else
        if [ -n "$ONLY_FILES" ]; then
            if [ -f "$1" -o -d "$1" ]; then
                add_opt "$(cygpath -w "$1")"
            else
                add_opt "$1"
            fi
        else
            add_opt "$(cygpath -w "$1")"
        fi
    fi
}

# Process arguments to executable.

while [ "$1" ]; do
    case "$1" in
        -*)
            add_opt "$1"
            ;;
        
        *)
            add_file_opt "$1"
            ;;
    esac
    shift
done

# Actually start the executable.

if [ "$USE_CYGSTART" ]; then
    cygstart -- "$EXECUTABLE" "${OPTS[@]}"
else
    "$EXECUTABLE" "${OPTS[@]}"
fi

For an example of how I use that, I have another script called 'dir', for when I feel like I need classic 'dir' options:

#!/bin/bash

winexec -f -k cmd /c dir '**' "$@"

All these scripts, BTW, go in my ~/bin directory and are chmod'd 0755 to make them executable:

$ mkdir ~/bin
$ chmod -R 0755 ~/bin/*

My system's PATH (i.e. the Windows PATH, from System Properties | Advanced | Environment Variables) includes my home directory's bin directory before the Cygwin bin directories, but it also includes those. There can be some knots here though, which I won't get into today. The scripts also need to use Unix line-endings, though Cygwin was less strict about this in the past. It's easily enough done, though: the dos2unix command will normalize to Unix any text files given as arguments.

n

Notepad is a classic programmer's tool - as in "all I need is Notepad and the compiler" (or maybe just Notepad ;), etc. Since Notepad doesn't react so well to multiple file arguments, it isn't completely suitable to the winexec trick. I have a customized script for Notepad:

#!/bin/bash

if [ -z "$1" ]; then
    echo "usage: $(basename $0) ..."
    echo "Starts notepad on the file(s)."
    echo "If  is -, then standard input is redirected to a temp file and opened."
    exit 1
fi

for file in "$@"; do
    if [ "$file" = "-" ]; then
        file=$(mktemp)
        cat '-' > $file
        (
            notepad "$(cygpath -w "$file")"
            rm $file
        ) &
    else
        cygstart -- notepad "$(cygpath -w "$file")"
    fi
done

Having created this little utility, I can open multiple files in notepad just using the bash wildcards:

$ n /c/windows/*.txt
# (there aren't too many of these guys)

Similarly, I can capture a program's output into Notepad for reference in a separate window and possible printing:

$ dir /c | n -
# (opens a notepad window containing the directory listing for C:\)

Copy and Paste

Finally (for now), good Windows integration requires good clipboard integration. The native-Windows rxvt terminal which ships with Cygwin already support automatic copy on selection and paste with middle-cilck or Shift+Ins, familiar to Unix console and X users. However, I often want to copy the output of a command to the clipboard, or get a copied piece of text into a file, or transform the contents of the clipboard (perhaps to do a search and replace on it), etc. Thus, I wrote two little utilities in Delphi, copy-clipboard.dpr and paste-clipboard.dpr:

Copy

{$APPTYPE CONSOLE}

uses
  SysUtils, Classes, Clipbrd;

var
  list: TStringList;
  line: string;
begin
  try
    list := TStringList.Create;
    while not Eof(Input) do
    begin
      Readln(line);
      list.Add(line);
    end;
    Clipboard.AsText := list.Text;
  except
    on e: Exception do
      Writeln(ErrOutput, e.Message);
  end;
end.

(Freeing objects that have no external effect when freed before you're about to exit the program is the height of pointlessness, in case you were wondering.)

Paste

{$APPTYPE CONSOLE}

uses
  SysUtils, Clipbrd;

begin
  Write(Clipboard.AsText);
end.

These two utilities, having been compiled, renamed to c.exe and p.exe, and moved to my ~/bin directory, come in very handy. For example, should I myself have wanted to copy one of the above scripts, I normally just select and copy script text, and:

$ p > ~/bin/winexec
$ chmod 0755 ~/bin/winexec

Similarly, I sometimes want to search and replace on text on the clipboard:

$ p
Similarly, I sometimes want to search and replace on text on the clipboard:
# (showing what's on the clipboard)
$ p | sed 's| |_|g' | c
# (replace all spaces with underscores)
$ p
Similarly,_I_sometimes_want_to_search_and_replace_on_text_on_the_clipboard:

A not usually unwelcome side-effect of my clipboard commands is that they normalize line endings and add a newline sequence at the end of the text, if there isn't one already.

I hope I've given a few folk some ideas about optimizing their environment, particularly if they're command-line junkies like me.

Sunday, May 18, 2008

In Defense of Steve Vinoski and Erlang

There's been a minor scuffle going on between Ted Neward and Steve Vinoski over the wisdom of Erlang's approach to concurrency: whether it should be baked into the language or not on one hand, and whether it should be running on the JVM or CLR on the other.

I've already articulated my position on VMs, and I think it makes a lot of sense, particularly for prototyping, to build a VM specifically for a language implementation, particularly if the language has some primitives that are not normally available in commodity VMs. And to be frank, if one's language doesn't have some interesting new primitives or combination of primitives, it is unlikely to be moving the state of the art forward.

Erlang uses the actor concurrency model combined with lightweight aka green threads, though the execution engine may spawn as many threads as needed in order to get genuine concurrency from an underlying parallel architecture, such as multiprocessor or multicore. Erlang works under a shared-nothing model, however, so the "green threads" are more like "green processes".

I think Ted misses some appreciation of the power of the Erlang model, and in particular, its choice of primitives. Ted points to an implementation of the Actor model using Lift (written in Scala), and in particular some ballpark performance numbers:

We also had an occasion to have 2,000 simultaneous (as in at the same time, pounding on their keyboards) users of Buy a Feature and we were able to, thanks to Jetty Continuations, service all 2,000 users with 2,000 open connections to our server and an average of 700 requests per second on a dual core opteron with a load average of around 0.24... try that with your Rails app.

One of the obvious problems with this comment is that it doesn't sound very impressive when actually compared with Yaws, implemented in Erlang:

Our figure shows the performance of a server when subject to parallel load. This kind of load is often generated in a so-called "Distributed denial of service attack".

Apache dies at about 4,000 parallel sessions. Yaws is still functioning at over 80,000 parallel connections.

What this disparity tells me is that the JVM and CLR are likely lacking some primitives that help Erlang achieve this kind of result. For straight-up processing code, Erlang isn't terribly fast, by all accounts. The reason it wins is likely because it is avoiding the context-switching overhead through the use of lightweight processes. This in turn suggests to me that the equivalent of green threads, or some kind of automatic CPS transformation or native support for continuations is needed for CLR and JVM to be credible target platforms for Erlang. Lift is currently using Jetty Continuations, and when you read up about its implementation:

Behind the scenes, Jetty has to be a bit sneaky to work around Java and the Servlet specification as there is no mechanism in Java to suspend a thread and then resume it later. The first time the request handler calls continuation.suspend(timeoutMS) a RetryRequest runtime exception is thrown. This exception propagates out of all the request handling code and is caught by Jetty and handled specially. Instead of producing an error response, Jetty places the request on a timeout queue and returns the thread to the thread pool.

When the timeout expires, or if another thread calls continuation.resume(event) then the request is retried. This time, when continuation.suspend(timeoutMS) is called, either the event is returned or null is returned to indicate a timeout. The request handler then produces a response as it normally would.

Thus this mechanism uses the stateless nature of HTTP request handling to simulate a suspend and resume. The runtime exception allows the thread to legally exit the request handler and any upstream filters/servlets plus any associated security context. The retry of the request, re-enters the filter/servlet chain and any security context and continues normal handling at the point of continuation.

... you can see that it's clearly a hack to work around the limitations of the JVM - i.e. the fact that it doesn't have user-schedulable green threads (on top of native threads, not as a replacement, a bit like fibers on Windows), or an automatic CPS with some AOP-style weaving, or native continuation support.

In conclusion, the primitives in the VM matter a great deal. With the right primitives, wholly different styles of application become possible, because of the radically different performance profiles.

The point of VMs

There was a post on the Delphi newsgroups that stuck in my head for some reason, and I felt I had to write a reply. The reply ended up being a lot longer than I originally intended, because I felt I had to justify my stance. I'm reposting it here in edited form.

"Virtual machine" has acquired pejorative overtones due to historical and social reasons that are probably too emotive to go into. Suffice it to say that I think it's another case of "good ideas don't win, proponents of bad ideas die out instead".

The way I see it, a virtual machine (in the context of programming language implementations) is a software implementation of an abstract machine with a closed-by-default set of semantics.

Let's take that definition apart:

  • software implementation: Here, I don't mean that the machine cannot be implemented in hardware. Rather, I mean that if it's going to be "virtual", it is usually implemented in software, which gives rise to certain characteristics, which in turn imbue "virtual machine" with extra shades of meaning. It turns out that software implementation is better than implementing in hardware, largely because of flexibility.
  • abstract machine: Every programming language has an abstract machine implicit or explicit in its definition, or otherwise its promised semantics are meaningless - you need a machine at some point to actually do things, and have effects. So, the abstract machine bit isn't controversial; it's its qualities that matter. Note that I differentiate between two different abstract machine concepts: a language's abstract machine, which it uses to model effectful operations, and a platform as an abstract machine. A CPU (+ memory + etc.) specification is an abstract machine, and a platform; the physical device, however, is a real machine, running on the laws of physics.
  • closed-by-default semantics: Here, I mean that at the abstraction level of the abstract machine in question, undefined behaviour is outlawed. In defining our machine, we humbly accept our human frailties, and do our best to prevent "unknown unknowns" becoming a problem by reducing the scope of the problem domain. We limit the power of the machine, in other words.
    Since we do, eventually, want to be able to talk to hardware, legacy software and the rest of the real world, there do need to be carefully controlled holes and conduits built-in. But they're opt-in, not opt-out.

Let's look at some of the ramifications of this conception of VMs.

  • Software implementation delivers a tremendous amount of flexibility. Some examples: runtime metaprogramming (e.g. runtime code generation, eval); dynamic live optimization (e.g. Hotspot JVM [1]); auto-tuning garbage collection; run-time type-aware linking (solving the template instantiation code-bloat problem); rich error diagnostics (e.g. break into REPL in dynamic languages).
  • Abstract machine: Developments in programming language fashions have made object orientation come to the fore (perhaps even too much to the fore). However, our physical machines map much closer to procedural code and a separation between code and data than the trends in language and architecture design.
    In other words, the platforms that historically popular type-unsafe [2] languages (like C++ and Delphi) have targeted aren't a close match for those languages' abstract machines. When they want to interoperate, either with other modules or with modules written in different languages, they face barriers, because their common denominator is the abstraction of the physical CPU. Hence C-level APIs being de facto industry standards, along with limited attempts to raise the abstraction level with COM (largely defined at the binary level in terms of C, explicitly referring to vtable concepts that are otherwise just hidden implementation details of other languages).
    So, moving the abstraction level of the target machine closer to the average language abstract machine makes compiler implementation easier, reduces interoperation barriers, and provides more semantic content for the (typically) software implementation to work its flexibility magic.
  • Closed-by-default eliminates whole categories of bugs. Type-safety can be guaranteed by the platform. Never again [3] have a random memory overwrite that shows up as a crash 5 minutes or 5 hours later. It also improves security [4] by having a well-defined whitelist of operations, rather than trying to wall things in with blacklists and conventions ("this structure is opaque, only pass to these methods" etc.).

[1] Some notable optimizations that become feasible when the program is running live include virtual method inlining, lock hoisting and removal, redundant null-check removal (think about argument-checking at different levels of abstraction), etc. Steve Yegge's latest blog post, while rambling, covers many optimizations that apply equally to static languages running in a virtual machine and to dynamic languages (but of course he's interested in promoting them as the apply to dynamic languages):
http://steve-yegge.blogspot.com/2008/05/dynamic-languages-strike-back.html

[2] Any language that has dynamic memory allocation that it expects to be reclaimable (i.e. no infinite memory) and doesn't have a GC isn't type-safe. A single dangling pointer to deallocated memory kills your type safety: if a value of a different type gets allocated at the same location, you have a type violation.

[3] Unfortunately, RAM may occasionally flip bits due to cosmic rays etc. So, we want to use ECC RAM and checksum critical structures when it matters. Edge case nit.

[4] IMO, the capability-based security model is the best of those available, ideally including eliminating ambient authority.
http://en.wikipedia.org/wiki/Capability-based_security
Guess what: you need a type-safe virtual machine to make some strong guarantees about capabilities, otherwise someone could come along and steal all your capabilities by scanning your memory.
See Capability Myths Demolished for more info:
http://srl.cs.jhu.edu/pubs/SRL2003-02.pdf

Friday, May 16, 2008

In an odd coincidence...

Jeff Atwood has a new post up about forking open source projects, and in particular, pointing out how difficult it is. This very closely corresponds with the point I made yesterday...

The new "Open" isn't as open as it seems to be

I was reading Scott Koon's (lazycoder) post about RIAs (rich internet applications) being a platform play, what with Adobe AIR, Microsoft Silverlight and JavaFX. Scott noted that these platforms are all "open enough" that independent implementations can be made.

This idea reminded me of a podcast on open source business models with Dirk Riehle that I listened to a number of weeks back. The core ideas are also on the Dirk's website.

Dirk makes clear a number of features of the open software market. In particular, he distinguishes between community open source and commercial open source, where a single vendor controls the direction of the project and employs the key contributors. Even though e.g. Silverlight isn't exactly open source, Microsoft has been open enough to let Novell's Moonlight be implemented. The strategies of Sun and Microsoft for their new platforms appear to be based around the ideas of commercial open source. Their profit items are hardware and services in the case of Sun, and OS and database licenses in the case of Microsoft. What Adobe's long-term strategy for monetizing its platform play isn't yet clear to me, but when you've got folks locked in, you can start selling your captive audience one way or another.

Anyhow, my point is that this new openness isn't as open as it seems to be. Community open source needs strong personalities to deliver direction , and even then, it seems to work best when reimplementing a previously proprietary, closed-source technology. Commercial open-source has several key advantages in controlling the platform, because focus and longer-term strategy means it can usually force any potential competitors into trying to keep up with it, rather than forking and going their own way. Open platforms tend to rally around key personalities, which essentially become brands, and the commercial style means that the body corporate owns that brand. Microsoft can come out with all the IronPython and IronRuby they like, but they are unlikely to get much acceptance if they try to introduce features that Guido or Matz don't agree with. More corporately, it's unlikely that many competitors to Sun would ever become "go-to guys" for issues with the Java platform (though if it would be anybody, I'd bet on Azul in the long term).

Upshot is, going open isn't as open (or risky) as it might seem, provided that you've got something else to sell alongside.

Update: Fuzzyman in the comments has convinced me that I was wrong to suggest that the most used open source isn't that innovative, hence the overstrike. However, that wasn't really essential to the point of the post...

Tuesday, May 13, 2008

1001 Books you must read before you die

A long list of books, with many titles where I've said "I must read that some day". I read a lot more when I was a kid, before I got into this whole computer racket, and started focusing on more non-fiction. I've only read 50 or so from the list. At this rate, I'll never get more than 15% or so.

I wonder how much you can tell about a person from their reading list? Here are the ones I read, I may have missed one or two though:

Slow Man – J.M. Coetzee
Choke – Chuck Palahniuk
Super-Cannes – J.G. Ballard
Memoirs of a Geisha – Arthur Golden
The Information – Martin Amis
Time’s Arrow – Martin Amis
London Fields – Martin Amis
The Long Dark Teatime of the Soul – Douglas Adams
Dirk Gently’s Holistic Detective Agency – Douglas Adams
The Old Devils – Kingsley Amis
Money: A Suicide Note – Martin Amis
The Hitchhiker’s Guide to the Galaxy – Douglas Adams
High Rise – J.G. Ballard
One Hundred Years of Solitude - Gabriel García Márquez
The Third Policeman – Flann O’Brien
One Day in the Life of Ivan Denisovich – Aleksandr Isayevich Solzhenitsyn
A Clockwork Orange – Anthony Burgess
Stranger in a Strange Land – Robert Heinlein
Catch-22 – Joseph Heller
The Tin Drum – Günter Grass
Breakfast at Tiffany’s – Truman Capote
The Lord of the Rings – J.R.R. Tolkien
Lord of the Flies – William Golding
Foundation – Isaac Asimov
The Catcher in the Rye – J.D. Salinger
I, Robot – Isaac Asimov
Nineteen Eighty-Four – George Orwell
Animal Farm – George Orwell
The Glass Bead Game – Herman Hesse
Of Mice and Men – John Steinbeck
The Hobbit – J.R.R. Tolkien
Brave New World – Aldous Huxley
The Castle – Franz Kafka
The Trial – Franz Kafka
A Passage to India – E.M. Forster
Heart of Darkness – Joseph Conrad
The Hound of the Baskervilles – Sir Arthur Conan Doyle
The War of the Worlds – H.G. Wells
The Invisible Man – H.G. Wells
Dracula – Bram Stoker
The Island of Dr. Moreau – H.G. Wells
The Time Machine – H.G. Wells
The Adventures of Sherlock Holmes – Sir Arthur Conan Doyle
The Adventures of Huckleberry Finn – Mark Twain
The Brothers Karamazov – Fyodor Dostoevsky
Around the World in Eighty Days – Jules Verne
Through the Looking Glass, and What Alice Found There – Lewis Carroll
War and Peace – Leo Tolstoy
Crime and Punishment – Fyodor Dostoevsky
Alice’s Adventures in Wonderland – Lewis Carroll
A Christmas Carol – Charles Dickens
Castle Rackrent – Maria Edgeworth
Aesop’s Fables – Aesopus

HT to Jason Kottke.