Thursday, September 11, 2008

Exogenous Exceptions (oh and another Vista rant :)

Eric Lippert has just posted an entry about "vexing exceptions", talking about the various buckets he classifies exceptions into and different approaches for handling them:

Vexing exceptions are the result of unfortunate design decisions. Vexing exceptions are thrown in a completely non-exceptional circumstance, and therefore must be caught and handled all the time.

[...]

The classic example of a vexing exception is Int32.Parse, which throws if you give it a string that cannot be parsed as an integer. But the 99% use case for this method is transforming strings input by the user, which could be any old thing, and therefore it is in no way exceptional for the parse to fail.

[...]

And finally, exogenous exceptions appear to be somewhat like vexing exceptions except that they are not the result of unfortunate design choices. Rather, they are the result of untidy external realities impinging upon your beautiful, crisp program logic. [Example of file-not-found follows, and points out that File.Exists check would only be a race.]

However, I strongly disagree on his suggestion that you try to catch exogenous exceptions, and somewhat disagree on the "vexing" exceptions (here, it's probably just the specific example he chose that I don't really agree with).

In my opinion, exceptions should, in general, be propagated all the way to a user-visible dialog box / notification message, if the exception represents an error by the user. So, for example, if a user enters a floating-point value or a string value in a box that should semantically be an integer, it is fine for an exception in Int32.Parse to bubble its way back into the user's view - so long as the message in the exception is meaningful to the user, and is written in the correct language / jargon etc. If not, then certainly the exception should be wrapped in another exception and re-propagated, but the exception itself shouldn't be just caught.

Of course, if there is valid logic for a failure case that doesn't simply mean telling the user about a problem in their input / what they're trying to do, then that's a situation where TryParse etc. makes sense.

On the exogenous question, catching these is extremely problematic, because when the user eventually sees the message, they'll need to figure out what went wrong. So, if you have a file access problem, the very first exception raised - the exogenous one when the CreateFile call at the heart of things failed - should have a message associated with it indicating that such-and-such file couldn't be accessed. This is the same message that should eventually be propagated to the user, either in an event log, a dialog box, or other notification mechanism, particularly if the high-level attempted action was user-initiated.

To do otherwise leaves the user stranded with a generic, non-specific (and thus meaningless) "I couldn't do something" message. I've seen far too much MS software that gives non-specific and non-actionable error messages in failure situations that this kind of advice really annoys me unless it is very carefully qualified and described.

If anything, exceptions should be wrapped rather than caught, with higher-level semantic information wrapping the underlying reason. So, if you're trying to e.g. modify a record in a data entry application, the chain of failure might be "couldn't modify record because" -> "couldn't contact database for locking because" -> "couldn't connect to database server because" -> "remote server name FooBar could not be found". This kind of message has information about every level of the stack, and should a user have to e.g. contact IT, the full message (the technical details can be hidden in a pop-out dialog etc.) is 100% actionable, and even regular users, let alone power users, may find it actionable.

Software does not have AI-level capabilities, and is very far from it. Describing what went wrong is 100x more useful than presenting something vaguely actionable. Unless the error case is very common, and thus you are very certain what the fix is, you should not try to present "actionable" advice over describing what went wrong, simply because to give good actionable advice in general, you need to embed an expert system; without populating the expert system with data, it needs to include IT-support-level AI, which like I said, isn't happening any time soon.

Here's a specific example that really burned me just the other day. Vista Ultimate has a full PC backup capability. I found out that my main HD is failing (SMART alert). I wanted to restore that backup onto a different disk of the same make and size (actually, right down to consecutive serial numbers). However, my machine is rather complex - there are 7 SATA disk devices in the machine. The Vista OS DVD failed to restore the backup to my perfectly-matched disk, but I have no idea why. All I do know is that the error message was "vaguely actionable":

Error details: There are too few disks on this computer or one or more of the disks is too small. Add or change disks so they match the disks in the backup and try the restore again. (0x80042401)

This message is completely and utterly useless, because it does not describe what went wrong, only "how" to fix it - but because the software isn't AI-level and doesn't include an expert system, it itself can't produce a specific set of instructions.

In this machine, I had 4x1TB disks, 1x400GB disk, and 1x200GB disk; the backup was on the 400GB disk and the target of the restore was the 200GB disk. 2 of the 1TB disks were blank and formatted. Thus, there was no lack of disks or space. Similarly, the target of the restore was at Disk 0, achieved through careful selection of the SATA connection on the motherboard. Still didn't work though, and I can't fix it because the error message is following the wrong philosophy for our current knowledge of AI.

FWIW, here's another user's experience with this problem. Notice the procedure to actually show the user the actual error, rather than the useless message:

  • Boot from the Vista DVD
  • Go to Repair Computer -> Command Prompt
  • Go into Regedit
  • Under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\
  • Create the key: Asr
  • Under Asr create the key LogFileSetting
  • Under LogFileSetting create the dword EnableLogging with the value 1
  • Under LogFileSetting create the string LogPathName (string) with a value such as D:\Asr.log
  • - you should specify a physical drive (e.g. I used the drive you are going to restore from) not the ramdrive (X:) so that the log is saved after reboot.
  • Exit Regedit
  • From the Repair menu launch Complete PC Restore
  • Attempt the Complete PC Restore
  • When you get the error, check the logging path to be sure the Asr.log file exists. I did that by going back to the Command Prompt and getting a directly listing before rebooting.

This is disgraceful, and frankly, unforgivable. Don't do this with your exogenous exceptions.

No comments: