About ZFS and high-end storage

I had an interesting discussion today that inspired me to write this blog entry. The discussion was about ZFS and those large storage boxes from EMC, HDS (and their OEM customers Sun and HP) or IBM. I talk about their top-end systems … not their mid-range modular storage offerings or the entry level arrays. Those components have a certain disadvantage: The disk space is more expensive when compared to midrange storage or JBODs. There are many good reasons to spend this money, and when you need those reason they are worth their bucks. However … this is not the point of this blog entry.

A basic question - some basic answers

The question is a different one, a more basic one: Should you use the RAID in the box and just provide a single LUN to the system and work without redundancies provided by ZFS, or should a customer use RAID in ZFS to aggregate many small LUNs.
Of course: The last thing is somehwhat out of question, when you have such big disk boxes. You already have them and often you did it for the data services available on the box, which aren’t available at ZFS or other less sophisticated combinations of filesystems/volume managers. It doesn’t sound reasonable just to provide small LUNs by the box to RAIDZ them in the server. You should have bought JBODs when you just want to use your high-end storage as and expensive disks rack ;) The other extreme - the single LUN configuration - has a problem, too: ZFS can detect errors without redundancies, but it can’t repair them. Thus when you just provide a single LUN by the big-storage-box, you can’t repair corrupted data. You have to go to your tapes to recover the missing data.

An attempt of giving not so basic answers

Now you can say “Hey, we’ve spend a lot of money for the box, they are perfect, they are proven. They never corrupted a single bit”. Now let’s assume that this is true for a second. Some high end storage goes really great lengths to protect the data it receives on its ports to the outside world indeed. ECC everywhere and so on. But can you say the same for the CPU, the Southbridge, the PCI-e busses, the HBA (hardware and firmware), the GBIC, the fibre, the switch (hardware and firmware), the next fibre, the GBIC in the storage box? Do you really want to put your hand for all the components you purchased from the cheapest vendor, that there was never a single corrupted bit and never will? Really? The value of the checksums of ZFS is the point that they start at the CPU, not at the storage. At the end even the most sophisticated storage system is just a bibo device, when you put bullshit into it, it will give you bullshit back, albeit perfectly protected. Hey when you really think that all the way to the disks is safe, you could as well switch of checksums and save many CPU cylces … ;) High-end storage may protect you against some failure modes, but not all … especially not those not under the control of the high end storage. But by using just a single LUN you give up an essential feature of ZFS: The auto repair capability. It simply doesn’t have the data to copy or recalculate the correct data. With RAID1 you could check if the data on the other mirror is correct … with RAIDZ you could do the combinatorial reconstruction. But using just a single LUN doesn’t allow this. It’s not the way that people using just one LUN are morons. Not at all and thats the point where the big storage boxes come into the game. With the prices of storage in the high-end, you try to safe as much capacity as possible and creating a mirror to automatically repair data is something you really think about before going to your management to ask for the money to double the capacity. I think the answer is somewhere in between. Like building a ZFS RAID-1 or RAID-Z out of two or more RAID-5 or RAID-0 LUNS provided by the storage system where you can’t afford to go to the tapes. Combining the advantages of both worlds. And a single LUN where such a situation is acceptable. Of course the sweet spot of ZFS is the usage of cheap storage, you just use cheap disks, many of them and the problems of cheap storage disappear in the software. But even with high-end storage you have to think about using some of the mechanism that are indispensable when using cheap storage. As an alternative you could use a special feature of ZFS: By setting the property copies with the zfs command you can tell ZFS to write copies of your data on the same LUN. As you control this on a “per dataset” granularity you can use it just for your most important data. But the basic problem for many people is the same: It’s like RAID1 on a single disk … so you double the space consumption of your data. So the basic challenge for people with big storage arrays with an high $-value per gigabyte is still the same. You have data redundancy (RAID1,RAID5,RAID6) in your box to protect the availability of the data and then you add redundancy (copies) on top to protect the end-to-end integrity of your data. While this similar to a ZFS RAID1 out of two RAID5 LUNs, a RAIDZ out of RAID-0 LUNs is more space efficient. However: For notebooks and desktop this feature is really great. The disks in those devices got too big a long time ago. Do you really need half a terabyte in your notebook? You can use all this capacity to increase the availability of your data. However it doesn’t help you if your hard disk dies completely.

The fruit of knowledge

But do me favor: Don’t think this this is a problem of ZFS, because no other filesystem gives you this thoughts. ZFS is just another bite in the fruit of knowledge: Adam and Eve knew that they were naked … we know that data moving around is at risk of being corrupted. Corruption happens. And even when every component has some protections by checksums, there is still a probability that a corruption isn’t detected. And even the most sophisticated storage array doesn’t protect you against errors in the firmware of the various components involved at writing data, in the os or just by a flaky GBIC … perhaps writing perfectly correct data at a perfectly incorrect location.

Conclusion

At the end it’s all about decisions again. Engineering (and sizing and the planing of system architectures is engineering) is to a large part the science of acceptable trade-offs. What’s the probability of an error and the costs of the recovery of an unrecoverable error and what’s the additional costs of implementing storage redundancy by ZFS. Can the customer live with going to the tapes or does he need the corrected data right here, right now? The customer has to answer this question. That’s the point where Sun can help, but can’t decide, because at the end it’s your data, not ours ;)