Sunday, March 22, 2009

ZFS/Solaris as a NAS

I've finally got a semblance of a Solaris system up and running with a nice fat ZFS storage pool. It was neither trivial nor pleasant, however.

Here's the executive summary of what I learned:

  • Don't use OpenSolaris 2008.11; use at least NexentaCore 2 instead; others may work.
  • Don't use an Intel or Realtek network adapter; if you must, use Realtek with an older version of the gani driver.
  • Don't use Solaris's built-in CIFS implementation, smb/server, aka hereafter "cifs" (to contrast with "samba").
  • Don't let ZFS to default to 128KB max block size unless all your files are big (greater than about 2MB) or tiny (less than about 64KB).

OpenSolaris 2008.11, the version I tried, is not production ready unless you have extremely specific hardware and quite narrow requirements. I finally ended up using NexentaCore Unstable v2.0 Beta 2.

The prime requirements for a working NAS are:

  1. Lots of attachment points for disks
  2. Good physical network connectivity
  3. Good network protocol implementation
  4. Good filesystem implementation
There is no one of these that I didn't have issues with.

The first one, attachment points, was a nice-to-know gotcha that it took me a while to figure out. My motherboard (an old Abit AW9D-MAX) has 7 internal SATA attachment points, as well as 1 PATA channel for two extra PATA devices. This being insufficient, I procured an extra PCIe two-port SATA adapter. I was up and running fine with 5 disks attached (1 boot drive and 4 drives in a raidz ZFS pool - effectively RAID 5 without the need for NVRAM), and in preparation for creating my second raidz vdev (virtual device in ZFS lingo), I started attaching more drives, one by one so that I could label them (and thereby know which one failed when it fails - which it will, eventually). When I had 7 drives attached, the OS occasionally failed to boot, but would reliably hang upon execution of the 'format' command (the handiest way of figuring out the drive device name under Solaris).

The gotcha, however, is that the BIOS defaults to IDE emulation mode, which, for this particular BIOS, supports a maximum of 6 devices. It packs the first 2 hard drives into pretend-PATA (complete with "master" and "slave") even if they aren't actually PATA, and then labels the next 4 drives as SATA 1 through 4. I had to change the BIOS's mode to AHCI to get it to support more drives. Luckily, my boot drive is an old PATA one, so it didn't need compatibility to stay in the same place (master on the first PATA channel).

Physical network connectivity was a far, far harder task, and one that still isn't quite complete. My motherboard has two physical 1Gb Ethernet adapters built in. Unfortunately, OpenSolaris defaults to a driver it calls rge for the chipset (RTL8111/8168B PCIe), but this driver has big issues, that I suspect are related somehow to duplex operations. Big transfers using big buffers using cifs (the Windows-compatible protocol Solaris ships with) will work with reasonable performance (30MB/sec or so), but streaming video (small reads that expect low latency) performs abysmally. Even worse, at non-deterministically random times (as far as I could ascertain), the rge driver would fall into a mode where transfers were extremely slow, as in less than 1MB/sec slow. I took to keeping a ssh session open to the Solaris machine and simply echoing '.' back and forth (echo '.' loop piped to ssh cat -), and mysteriously enough, so long as there was small back and forth traffic, transfer performance rocketed back up again.

This being unsatisfactory, I tried to use the gani driver. I never could get gani-2.6.3 to work with my chipset, however. I didn't know if it was a problem with the driver, a conflict with my chipset, or my poor configuration skills with Solaris. All I know is that I tried every technique I could find, up to and including patching driver_aliases and running sys-unconfig to start things off from a fresh basis.

So, I bought an Intel 1000/PRO Gigabit ethernet adapter. Folks online, in response to reports of rge not working correctly, seemed to say that the Intel adapter "just worked", but I should have dug deeper... The driver Solaris uses for this is called e1000g. The version of this driver shipping in OpenSolaris 2008.11 doesn't work. It drops packets. Simply pinging the Solaris machine from the outside shows packet loss exceeding 5%. With a ssh session open to Solaris, simply holding down '.' on the keyboard and watching the character getting echoed in the terminal is sufficient to demonstrate the hiccups: the pauses are visible and disturbing. It actually affects typing in ssh, almost as if one was at the end of a dodgy Internet connection, perhaps some half-baked cafe wifi link in Marrakesh. A recursive diff of about 100MB of data across about 1000 files using cifs took a whole 45 minutes. A trivial test using dd to copy a 640KB blob of data using a 64KB buffer size took about 15 seconds and dd reported transfer rate of 45KB/sec.

It was this network connectivity that prompted me to start trying out other distributions. NexentaCore 2 was first in line. It defaults to the gani driver for the Realtek adapter, but an earlier version, 2.6.2. This works, but only at 100Mbps; try as I might, I can't convince it to go higher. (Windows running on the same hardware, same patch cable and same switch, never, ever dropped down to 100Mbps - something I can't say about the other Windows machines I have here.) Alos with significant importance, NexentaCore doesn't ship with a GUI (so no need to disable it), and does ship with apt-get and a reasonable selection of packages, including my preferred Linux/Unix editor, joe (very similar to old Borland DOS IDE text editors). I detest vi - I've never used a program that gleefully punished more heavily every transgression of the user.

Anyhow, that's how I got one reasonably-working 100Mbps connection out of three 1Gbps adapters, four driver versions and two operating system installs.

Next, the network protocol. Solaris defaults to using something that is colloquially called cifs, although the usual identifiers used when e.g. enabling and disabling with svcadm are smb/server and smb/client. I can't say much about smb/client - I'm sure it works well enough, but I don't have much use for it - but smb/server barely works at all. It's implemented as a kernel module, which unfortunately means it has odd and painful limitations compared to how user-level programs view the file system. In particular, only one ZFS filesystem can be navigated from a cifs share, significantly harming the usefulness of creating lots of ZFS filesystems. A ZFS file system mounted inside an existing filesystem, where the mountpoint is visible from a cifs share, will show up as an empty directory to clients, whereas it appears correctly mounted locally on the Solaris box.

Another limitation is symbolic links. By apparent design, cifs prohibits what Samba calls "wide links": symbolic links that resolve to locations that are outside the share's hard-linked subtree. Such symbolic links look much like dead links from Windows, i.e. the text little files they are implemented as. Samba defaults to "wide links" on for performance reasons if nothing else.

A final limitation is hard links created from Windows using CreateHardLink. Cygwin 'ln' ultimately uses this API. For whatever reason (and I didn't dig too deeply, like investigating Wireshark traces), Cygwin determines that 'ln' isn't supported on cifs shares and falls back to copying instead. Cygwin 'ln' works correctly on Windows shares, however, and it also works correctly over Samba.

The verdict: cifs isn't worth it. Needs more time to bake. Use Samba instead.

Finally, ZFS itself. ZFS is the big draw, and the reason I chose to use Solaris. Userspace implementations are available in Linux via FUSE, and also in FreeBSD, but I had bad performance experiences of NTFS-3G/FUSE in Linux, and FreeBSD's implementation sounded dangerously non-production ready. ZFS largely works as advertised. The primary limitation is the inflexibility in removing drives from pools. Raidz vdevs can't have drives removed at all. A word of warning, though: ZFS has what I can only consider a bad bug for files in the 128KB..2MB range. If you have a file of about 129KB, depending on how it was created, it ought to be using up about 100% more space than it should be. In some largeish directories of files I was seeing wastage of about 35% (as measured by 'du' versus 'du --apparent-size'), whereas NTFS on Windows had wastage in the region of 5%. Paring back the ZFS default block size (dynamically settable on ZFS - yay! - but only affects subsequent file operations) to 8KB or 16KB improves things immensely, but still not quite as space-efficient as NTFS.

In conclusion, ZFS/Solaris as a NAS can work well with very carefully chosen hardware and select software configuration. If you get it working better than I describe herein, be very careful what you touch, and whatever you do, don't upgrade your zpools until you're sure that whatever step forward you take is better than what you had previously.

Friday, March 06, 2009

OpenSolaris, ZFS, Dvorak and VI

Been experimenting with OpenSolaris to try out ZFS. OpenSolaris is not a pleasant experience even after using an average Linux.

Its termcap and terminfo databases don't include complete information for xterm as implemented by the very gnome-terminal it defaults to, much less gnome-terminal itself; nor does it include any entries for rxvt derivatives (so I have to export TERM=xterm upon ssh in from Cygwin). Thus, there is much fun with Del, Home, End, PgUp, PgDn etc. brokenness: Ctrl+A/E instead of Home/End, Ctrl+D instead of Del, f/b in less rather than PgUp/PgDn, etc. The terminfo DB is spartan even by comparison with Cygwin, and termcap rather, eh, antique:

solaris$ find /usr/share/lib/terminfo/ | wc
    1716    1716   58553
cygwin$ find /usr/share/terminfo/ | wc
   2544    2544   79257

solaris$ grep 1982 /usr/share/lib/termcap
# From research!ikeya!rob Tue Aug 31 23:41 EDT 1982
# From jwb Wed Mar 31 13:25:09 1982 remote from ihuxp
# Extensive changes to c108 by arpavax:eric Feb 1982

Also, there is apparently no text-mode Dvorak layout shipped with OpenSolaris. I managed to patch something rudimentary together reverse engineering the contents of /usr/share/lib/keytables and applying with loadkeys. Within X, xmodmap can load an appropriate translation, but getting that working before gdm comes up for the login and password screen is tedious. Getting a vnc server session running gdm is easier.

Finally, I have to resort to using vi to edit basic files. Vi is a line editor masquerading as a text editor. Now I feel like I'm living in the 80s. The early 80s.