No, ZFS really doesn't need a fsck

There is a discussion at osnews.com about a simple question: “Should ZFS Have a fsck Tool?”. The answer is simple: No. I could stop now, as this answer is pretty obvious when you work a while with ZFS, but i want to explain my position. And i want to ask a different question at the end. I wrote this explanation already in the comment section of osnews.com, so the following text will be a reworked version of my comment at that site.
(Update: There is a new article from March 2013 about this topic - No, ZFS still doesn’t need a fsck. Really!)

Filesystems, checks and promises

Yes, there are situations that put a zfs pool in a state rendering it unimportable, thus “crashing” the filesystem. And yes, i know … we say, ZFS is crash-proof and always consistent and so on. Both is true. We talk about a situation that occurs, when HW components aren’t telling the truth about what they do and how they do it. That is in now way the normal use case of any filesystem. And still the statement “consistent and crash-proof” is true, too. Some of the characteristics of ZFS allows you recover into a consistent state even under this circumstances, it just need help to do so. But it doesn’t need the help of a fsck tool. It doesn’t need such a tool because of difference in its internals when compared to other combinations of LVM and filesystems.

The plague of sub-sub-substandard components

To allow ZFS to be crash proof (in the sense of automatically reacting to such a situation), there must be certain really basic mechanisms implemented in a way, that adheres to specifications and standards. For example: FLUSH CACHE should only return, when the cache is flushed. But there are dirt cheap converter chips that sends the FLUSH CACHE to disk, but returns a successful FLUSH CACHE in the same moment back to the OS (of course without having NVRAM on disk or in a controller as this would allow to ignore CACHE FLUSH). Or interface converters reordering commands in really funny ways. By such reordering it may happen, that the uberblock is written to disk, before the rest of the structure has been written to disk. Filesystems are hardened against many failures today, but in regard of devices that doesn’t even adhere to basic standards, you have no chance. This can happen with any filesystem. Storage components not adhering to the most basic standards are really a plague in IT and i don’t have an idea, why companies get away with it. Perhaps because everybody points to the universal scapegoat “Windows” in this case instead of the real root of the problems. But perhaps it’s just because of the fact, that there has to be a special situation to see this problem: You have to stop writing the data before all data has been written to disk, for example by unplugging power or USB from the disk. Because just in this case the point that the new uberblock has been written before the rest of the data is harmful. Normally the new uberblock is written as the last block of any transaction. As long this new uberblock isn’t written to disk, the valid on-disk-state is still the old one. Until the new uberblock is on disk, the new blocks are just rubbish after a crash and not part of the on-disk-state. But now think about the situation, that the uberblock is written before the data because of a converter reordering commands at it’s own will. This uberblock points to a state that isn’t complete. But as it is the highest ueberblock with a correct checksum, the import of the filesystem starts here. Gotcha - unimportable state when the disk power fails exactly in this moment. There was another question on osnews.com why NTFS or ext3 aren’t complaining about that situation, too. Well … they have the same problem, but it has different effects, that aren’t as visible as in ZFS. ZFS can end with a uberblock pointing to nowhere (that’s easily repairable), other filesystems end with an inconsistent state of the data or storage for a database not adhering to the necessities of ACID (“Ladies and Gentleman, start the tapes”). And by the way: With NTFS or ext3 you can’t know if your data is consistent and unharmed. You have no way to check it. There is no scrub for this filesystems, as there are no checksums. You just can say, that your filesystem metadata is consistent. And even more important: Linux has similar problems with problematic hard- and software like components not honoring write barriers. Interestingly one of this problematic components is the LVM. Support for write-barriers in linear modes isn’t that old and only in really fresh versions of Linux the LVM honors write-barriers in more complex RAID levels. Thus in Linux you had to trade in the integrity of your data for the availability or you had to sacrify performance by switching of write caching on the disks. But that’s a different story. At the end all operating systems suffer from components because of bugs or cutting corners in firmware implementations because of a shorter time to market. And we talk about those problems here. Not about the normal “power failure while writting” stuff. Most filesystems are hardended against this problem reasonably well. And ZFS has several protections against the normal dangers in filesystem operation: For example copies of metadata blocks, the described method of a always-consistent on-disk state, checksums on metadata, autorepair functionalities and so on. No … we talk about a greater and deeper problem.

Why does ZFS need no fsck?

With ZFS you can tackle the problem from a different perspective. You do not repair it, you jump to a consistent state slightly before the crash. How does this works? At first you have to keep two things in mind (sorry, simplifications ahead): ZFS works with transaction groups and ZFS is copy-on-write. Furthermore you have to know that there isn’t a single uberblock, there are 128 of them, (transaction group number modulo 128 is the exact uberblock used for a certain transaction group). Given this points, there is a good chance, that you have a consistent state of your filesystem shortly before the crash and that it hasn’t overwritten since due to the COW mechanism in ZFS. An effect of the transactional behaviour is the point, that you have older consistent states on your disks as well. At the end the on-disk state is just the consequence of all the transactions that took place from transaction group 0 to the last transaction group. Sound obvious, but this is the key to solution. ZFS doesn’t overwrite live data, so it’s perfectly possible that you have several perfectly mountable and perfectly consistent older states of the data in your pool on the disks. In normal situation you don’t need this effect, but things are different at the moment: We have the problem, that the last state was corrupted by functionally challenged disks subsystems. ZFS supports this recovery by another behaviour of it’s implementation: As far as i understand the inner workings of ZFS, freed blocks aren’t even used again immediately, so getting back to a importable state is almost guaranteed. The reuse of blocks was defered with the putpack of 2009/479. (Okay, at most you can have 127 consistent old states, because there are 128 old versions of the uberblock, but it sounds unlikely that the normal write operations doesn’t touch a single freed block while writing back the last 128 transactions group commits and the system doesn’t defere reuse that long) So you just have to rollback the transaction-groups until you have a state that can be scrubbed without errors. Resulting out of this steps you have a recovered state that is consistent and with validated integrity - metadata as well as normal data. You just lost the last few transactions. So, you don’t need a fsck tool, you need a tool for a transaction rollback of your filesystem. And tools to check, if you recovered to a consistent state. Anyway: You do not repair the state last state of the data. And in my opinion: You should not try to repair it … at least not by automatic means. Such a repair would be risky in any case. Do you really know what the disk has done in which sequence? When it or components on the way to the disks doesn’t even adhere to the basics? You have to keep in mind, that we got into this situation because of disks you can’t really trust. In this situation i would just take my money from the table and call it a day. You may lose the last few changes, but your tapes are older.

PSARC 2009/479

Anyway … the outcome of PSARC 2009/479 is such a tool mentioned above. You can use those zpool commands to import a unimportable pool, by rolling back transactions made to your filesystem to get back to an importable state, thus other means of correcting the filesystem can work (like redundancies, self-healing et al). Some may call the results of PSARC 2009/479 something like an fsck tool, but it isn’t. It just leverages the transactional behaviour of ZFS to enable other tools to do their work. It just activates another uberblock as the current one. Of course it would be easy now just to add a functionality to search for the first importable state, but the decision to do a recovery should be an sentient one, not one made by a machine. Perhaps you want to make a disk image of your disk before you try to recover it … or you want to see a short summary of the stuff the recovery would do without doing the recovery actually. Surely it took a little long to get such a functionality into the ZFS toolset, but I can just assume, that it wasn’t clear, how commonplace such functionally challenged components are.

Just an idea: Rolling Recovery Snapshots

By the way … just had an idea: Rolling Recovery Snapshots. The last 10 or 100 transaction groups commits are automatically available as snapshots. As blocks used in snapshots will not used again until the snapshots gets deleted by the automatism, it’s ensured, that you have a recovery state in the filesystem that wasn’t harmed by the write operations of the interupted transaction group commit in the case you’ve freed blocks in that shortly before the last transaction group commit. I think, i will write an RfE for it …

You can't do this with a fsck

All the stuff above isn’t done by a fsck tool. You can’t rollback transactions with such tool. You can’t guarantee the the integrity of the data after the system reported back to you that the filesystem has been recovered. After a successful check of filesystem, you have exactly this: A recovered filesystem. Nothing less, but nothing more as well. At the end, the filesystem is just a tool to enable you to get data from you disk without writing down the sector numbers to get the data back via dd. And it just takes care of it’s own problems, not the one of the data. It checks the filesystem, but not the data. It’s called fsck and not datack for a reason. So we end of with a mountable filesystem, but the data in it … that’s a different story.

Conclusion and a provocation?

ZFS doesn’t need a fsck tool, because it doesn’t solve the real problem. ZFS needs something better and with all the features of ZFS in conjunction with PSARC 2009/479 it obviously delivers something better. At the end the solution has to start somewhere else: At first you should to throw the sub-sub-substandard hardware in the next available trash bin after copying the the data to a storage subsystem of better quality and wiping the old disks. Perhaps after all the correct question should be: How long can other filesystems get away just with a filesystem check but without a data check?