Sunday, March 22, 2009

ZFS/Solaris as a NAS

I've finally got a semblance of a Solaris system up and running with a nice fat ZFS storage pool. It was neither trivial nor pleasant, however.

Here's the executive summary of what I learned:

  • Don't use OpenSolaris 2008.11; use at least NexentaCore 2 instead; others may work.
  • Don't use an Intel or Realtek network adapter; if you must, use Realtek with an older version of the gani driver.
  • Don't use Solaris's built-in CIFS implementation, smb/server, aka hereafter "cifs" (to contrast with "samba").
  • Don't let ZFS to default to 128KB max block size unless all your files are big (greater than about 2MB) or tiny (less than about 64KB).

OpenSolaris 2008.11, the version I tried, is not production ready unless you have extremely specific hardware and quite narrow requirements. I finally ended up using NexentaCore Unstable v2.0 Beta 2.

The prime requirements for a working NAS are:

  1. Lots of attachment points for disks
  2. Good physical network connectivity
  3. Good network protocol implementation
  4. Good filesystem implementation
There is no one of these that I didn't have issues with.

The first one, attachment points, was a nice-to-know gotcha that it took me a while to figure out. My motherboard (an old Abit AW9D-MAX) has 7 internal SATA attachment points, as well as 1 PATA channel for two extra PATA devices. This being insufficient, I procured an extra PCIe two-port SATA adapter. I was up and running fine with 5 disks attached (1 boot drive and 4 drives in a raidz ZFS pool - effectively RAID 5 without the need for NVRAM), and in preparation for creating my second raidz vdev (virtual device in ZFS lingo), I started attaching more drives, one by one so that I could label them (and thereby know which one failed when it fails - which it will, eventually). When I had 7 drives attached, the OS occasionally failed to boot, but would reliably hang upon execution of the 'format' command (the handiest way of figuring out the drive device name under Solaris).

The gotcha, however, is that the BIOS defaults to IDE emulation mode, which, for this particular BIOS, supports a maximum of 6 devices. It packs the first 2 hard drives into pretend-PATA (complete with "master" and "slave") even if they aren't actually PATA, and then labels the next 4 drives as SATA 1 through 4. I had to change the BIOS's mode to AHCI to get it to support more drives. Luckily, my boot drive is an old PATA one, so it didn't need compatibility to stay in the same place (master on the first PATA channel).

Physical network connectivity was a far, far harder task, and one that still isn't quite complete. My motherboard has two physical 1Gb Ethernet adapters built in. Unfortunately, OpenSolaris defaults to a driver it calls rge for the chipset (RTL8111/8168B PCIe), but this driver has big issues, that I suspect are related somehow to duplex operations. Big transfers using big buffers using cifs (the Windows-compatible protocol Solaris ships with) will work with reasonable performance (30MB/sec or so), but streaming video (small reads that expect low latency) performs abysmally. Even worse, at non-deterministically random times (as far as I could ascertain), the rge driver would fall into a mode where transfers were extremely slow, as in less than 1MB/sec slow. I took to keeping a ssh session open to the Solaris machine and simply echoing '.' back and forth (echo '.' loop piped to ssh cat -), and mysteriously enough, so long as there was small back and forth traffic, transfer performance rocketed back up again.

This being unsatisfactory, I tried to use the gani driver. I never could get gani-2.6.3 to work with my chipset, however. I didn't know if it was a problem with the driver, a conflict with my chipset, or my poor configuration skills with Solaris. All I know is that I tried every technique I could find, up to and including patching driver_aliases and running sys-unconfig to start things off from a fresh basis.

So, I bought an Intel 1000/PRO Gigabit ethernet adapter. Folks online, in response to reports of rge not working correctly, seemed to say that the Intel adapter "just worked", but I should have dug deeper... The driver Solaris uses for this is called e1000g. The version of this driver shipping in OpenSolaris 2008.11 doesn't work. It drops packets. Simply pinging the Solaris machine from the outside shows packet loss exceeding 5%. With a ssh session open to Solaris, simply holding down '.' on the keyboard and watching the character getting echoed in the terminal is sufficient to demonstrate the hiccups: the pauses are visible and disturbing. It actually affects typing in ssh, almost as if one was at the end of a dodgy Internet connection, perhaps some half-baked cafe wifi link in Marrakesh. A recursive diff of about 100MB of data across about 1000 files using cifs took a whole 45 minutes. A trivial test using dd to copy a 640KB blob of data using a 64KB buffer size took about 15 seconds and dd reported transfer rate of 45KB/sec.

It was this network connectivity that prompted me to start trying out other distributions. NexentaCore 2 was first in line. It defaults to the gani driver for the Realtek adapter, but an earlier version, 2.6.2. This works, but only at 100Mbps; try as I might, I can't convince it to go higher. (Windows running on the same hardware, same patch cable and same switch, never, ever dropped down to 100Mbps - something I can't say about the other Windows machines I have here.) Alos with significant importance, NexentaCore doesn't ship with a GUI (so no need to disable it), and does ship with apt-get and a reasonable selection of packages, including my preferred Linux/Unix editor, joe (very similar to old Borland DOS IDE text editors). I detest vi - I've never used a program that gleefully punished more heavily every transgression of the user.

Anyhow, that's how I got one reasonably-working 100Mbps connection out of three 1Gbps adapters, four driver versions and two operating system installs.

Next, the network protocol. Solaris defaults to using something that is colloquially called cifs, although the usual identifiers used when e.g. enabling and disabling with svcadm are smb/server and smb/client. I can't say much about smb/client - I'm sure it works well enough, but I don't have much use for it - but smb/server barely works at all. It's implemented as a kernel module, which unfortunately means it has odd and painful limitations compared to how user-level programs view the file system. In particular, only one ZFS filesystem can be navigated from a cifs share, significantly harming the usefulness of creating lots of ZFS filesystems. A ZFS file system mounted inside an existing filesystem, where the mountpoint is visible from a cifs share, will show up as an empty directory to clients, whereas it appears correctly mounted locally on the Solaris box.

Another limitation is symbolic links. By apparent design, cifs prohibits what Samba calls "wide links": symbolic links that resolve to locations that are outside the share's hard-linked subtree. Such symbolic links look much like dead links from Windows, i.e. the text little files they are implemented as. Samba defaults to "wide links" on for performance reasons if nothing else.

A final limitation is hard links created from Windows using CreateHardLink. Cygwin 'ln' ultimately uses this API. For whatever reason (and I didn't dig too deeply, like investigating Wireshark traces), Cygwin determines that 'ln' isn't supported on cifs shares and falls back to copying instead. Cygwin 'ln' works correctly on Windows shares, however, and it also works correctly over Samba.

The verdict: cifs isn't worth it. Needs more time to bake. Use Samba instead.

Finally, ZFS itself. ZFS is the big draw, and the reason I chose to use Solaris. Userspace implementations are available in Linux via FUSE, and also in FreeBSD, but I had bad performance experiences of NTFS-3G/FUSE in Linux, and FreeBSD's implementation sounded dangerously non-production ready. ZFS largely works as advertised. The primary limitation is the inflexibility in removing drives from pools. Raidz vdevs can't have drives removed at all. A word of warning, though: ZFS has what I can only consider a bad bug for files in the 128KB..2MB range. If you have a file of about 129KB, depending on how it was created, it ought to be using up about 100% more space than it should be. In some largeish directories of files I was seeing wastage of about 35% (as measured by 'du' versus 'du --apparent-size'), whereas NTFS on Windows had wastage in the region of 5%. Paring back the ZFS default block size (dynamically settable on ZFS - yay! - but only affects subsequent file operations) to 8KB or 16KB improves things immensely, but still not quite as space-efficient as NTFS.

In conclusion, ZFS/Solaris as a NAS can work well with very carefully chosen hardware and select software configuration. If you get it working better than I describe herein, be very careful what you touch, and whatever you do, don't upgrade your zpools until you're sure that whatever step forward you take is better than what you had previously.


Anonymous said...

I was thinking about having a NAS setup for my storage, and I had almost the same hardware configuration as you described in mind, but I was thinking about Linux/XFS.
I'm wondering if you chose Solaris/ZFS out of curiosity, or did ZFS really have considerable benefits? It's because whenever I've moved toward Unix implementations (openSolaris, FreeBSD and openBSD) I've always had very unpleasant experience & have moved back to Linux or even Windows.

Like you, I'll use the machine mostly for big Video & Audio files, or Database Backups, and will have no source code on it, with Terabyte sized disks with RAID 5, as new Hard disks tend to fail more than older hard disks.

Can you give any insights into your particular OS/FS selection, specially as it sounds it gave you quite a headache?

Barry Kelly said...

I wanted to have redundancy, but not so much as mirroring, so I was considering RAID-5. However, I have had bad experiences with hardware raid, and am in general leery of the quality of hardware that tries to be intelligent, so I wanted a software implementation. For normal RAID-5, however, that requires non-volatile memory between the controllers and the drives - this is the RAID-5 write hole.

ZFS has a pool strategy it calls RAID-Z, which is basically RAID-5 but with atomic update logic, so that transactional semantics apply to the writing - either it all gets through, or nothing gets through. In addition, ZFS uses checksums on disk to detect bit rot, i.e. silent corruption of the data.

This, combined with the fact that ZFS can pool storage from multiple virtual devices (each of which may be mirrored, raidz, etc.), and treat the storage like a big bunch of drives, was attractive because it meant that I wouldn't have to worry about how much data was on which drive, and shuffling things around. That can get very tedious, particularly since transferring lots of files take a long time, even at good sequential read and write rates (100MB/sec), because of non-adjacent directory structure I/O etc.

Thorsten Engler said...

From personal experience I can very much recommend a 3ware 9650SE + battary backup module as a RAID6 hardware solution that just works. And works well.

I've got 2 of 'em with 8 1TB drives in SATA hotswap bays each in a server running W2K8.

The BBU keeps the cache alive for up to 72 hours in case of power outage.

The RAID6 gives double redundancy (some time ago I had a RAID5 fail fatally when a 2nd drive failed while the first was rebuilding after a fail, so I'm only going for RAID6 now).

The controllers perform a full verify every night (takes about 2.5 hours at low priority) which detects failing drives early (and rebuilds any defective sectors found using the reserve sectors to keep redundancy till the failing drive can be replaced).

Replacing failed drives while the system is in use works flawlessly (full rebuild takes about 4-5 hours).

Performance is excellent with about 600MB/s when copying between the 2 arrays.

Anonymous said...

I have always worked in Windows centric environments. At the same time I have opted for the last 6 years to use Samba for storage. I have used Samba at home for 8 years. Currently, I am using ZFS/CIFS at home. It's a breeze and works wonderfully on a home server. 5x 1TB drives on an Intel dual core desktop with 4GB ram, total investment under $1,200 USD.

To simply say Samba is better, is somewhat incomplete. ZFS/CIFS has features that aren't available with ZFS/Samba implementation, the inverse is true as well. Either implementation will have caveats.

OpenSolaris is a wise choice for storage. Buying RAID cards for *nix OSes is silly in many situations. If one is using Windows or using apps that are write and processor intensive then buy a RAID card.

All in all I must admit, implementing an OpenSolaris storage device for home or business is fairly simple if one checks the HCL, reviews the features and compares them to their needs. Just like anything, a little analysis will smooth the implementation and maintenance process.

Anonymous said...

Dude, thanks for putting this together. I have 6x500Gb HDs around, and was to install them on a new machine, which was to use Solaris + ZFS - saving me the $$$$ of a RAID card.

Now looks like it might be less painful to spend the money on said card instead . . .