MAID, ZFS and some further thoughts ...
I had an interesting discussion with a blog reader and customer via icq (oh wonderful new world of business). It started with a question about the J4500 and it ended in the the discussion, if MAID is really a viable way for storing data of if it´s just one of this hypes working their way through the storage business every year.
MAID? What is MAID? MAIDs are Massive Arrays of Idle Disk. The idea is to save power by switching of disks in the case they aren´t in use. You can buy such components from companies like Copan.
At first, i don´t really think this technology justify companies, that do MAID for a living. It´s too easy to integrate this feature in other system. At the end MAID is just hard disk power management in action. When a disk isn´t used, switch it off. Just in the case you need data from them, switch them on. As it would be senseless to switch off a disk after two seconds just to reactivate it 2 seconds later (or it wouldn´t switch the disk off at all in the case you choosed a longer interval) the concept of cache drives was introduced to the concept MAID. This drives cache the most often requested blocks. In opposition to the MAID drives, this drives are active at any time. As these cache drives should satisfy a large amount of disk requestsm, the MAID drives are not activated, the power management of the hard disk can keep them idle. At the end, this is all. Well … then i started to think about it and came to an interesting conclusion. Every OpenSolaris user has the components to build it´s own MAID. Let´s assume you have an array with 48 disk, the Sun Fire X4540 There are three important functions in Opensolaris that helps to implement a MAID like storage array:
- the L2 ARC
- the separated ZIL
- the copy-on-write behaviour of ZFS
Let´s assume the follwing configuration. You take the CF card of the X4540 to boot the system. You take 4 disks to build a 4 TB L2ARC, you take the 2 disk to use them for the seperated ZIL thus. A well primed L2ARC should satisfy most of the requests from the cache. The separated ZIL deflects the writes from the data pool until the ZIL is written to the pool. At this moment the COW behaviour kicks in. The data is never overwritten at it´s original location. New blocks are written on a new position. Let´s assume the pool is constructed out of a RAID0 of RAID1 the writes should be concentrated to two disks. Thus the standard power management of the hard disk should kick in. Other systems could use this MAID by Opensolaris functions: CIFS, NFS, iSCSI, FC Target and later iSER and SAS target. But there is one problem at the moment: ZFS is too intelligent. It tries to level the load on the disk by directing the write to the least loaded vdev. At this moment we need a real concat that writes data at the first free block of a disks and just jumps to the next block in the case no free space is on the actual disks (Okay, i assume it´s RFE time). Besides of the last component, everything is already implemented in ZFS and Solaris (and for the last problem i even have an idea for a quick hack: simple and plain just-in-time addition of new disks when the old disks are 95% full). This is the reason why i really think that companies that do MAID for a living are indeed living on borrowed time. But the discussion ended with a different thought: I have my problems with the MAID concept. I´m now 15 years in this business. And my experience tells me, that electronic devices (especially hard disks) will die in the moment you switch them on … not while they running. Now MAID implements a mechanism that deliberatly switches electronic devices on and off, on and off and so on. I´m not really convinced about the long time reliability of such a concept. But perhaps i´m too old fashioned with my standpoint that data is only safe when you wrote them to two different tapes written with two different tape drives and stored at two different locations.