Some thoughts about deduplication

Deduplication is one of this big hype topics at the moment. Deduplication sounds good, but … well … hmmm … let´s say it this way: This technology has to mature. At first there was a link in a internal mailing list. Deduplication is pretty simple … at least if you have checksums for every block in your storage. Deduplication is just a very clever usage of checksums. Such a checksum is nothing else than a hash. It´s a safe assumption, that a block has a higher probability of beeing duplicate if an older block has the same checksum. But as we all know: Hash algorithims are suceptible to collisions. Before simply linking your new block to the old one, you have to check if they are really the same. And there starts the problem. Let´s assume your storage is a large heap of boxes with screws. To accelerate searching you sort the heaps of boxes in rows: Small screws, Large screws, blade head, torx head. You mark you boxes with the combinations, but there are several boxes with the same name: Small Torx T9 come in several shapes and each shape get each own box. You want to sort a large torx screw with a silver finish and Whitworth threading. in your boxes, you can hash the screw (“Oh, it´s a large torx”) look at your screw boxes. You have open every box and look if they are really the same. First box: Okay, that´s the black ones. Next box: Cheese … that the one withs with the metric scew thread. This can take a while. Now let´s assume you have a more sophisticated hashing … eeerr … sorting scheme: You have a box for each size, each type of head, each type of threading and so on. Albeit there are still possible collisions, you have to open fewer boxes to find out which box is really the right one for your screw. Now there are two ways to dedupe: You can to it right at the delivery of the screws or you can put them in their own box at first and sort them later. If it´s easy to find the correct box, it´s possible to do it immediately. If it´s hard to find the box, you should do it later. You can´t explain to a mechanic, that you have to sort screws at first, before you can deliver his screw at first. It´s pretty much the same for storage and data deduplication. The sorting method is the hash, the boxes with screws are the blocks on your screws. And there are exactly such problems. NetApp for example has a 16-bit checksum. Let´s assume you use the biggest possible dedupable volume … 16 TB. Your checksum has 65.000 possible values. Assuming an even distribution on all the possible blocks, this would lead to 65536 collisions per value. To really check if a block is identical, you have to check up to 65536 blocks. This isn´t something i found out: Jered Floyd pointed to this fact in blog entry Jet Engine on a Duck: You Can’t Retrofit Dedupe. It´s an interesting read, he has done the math about this hash collisions there. Reading this article i thought at first, that there are additional fingerprints, but a technical director at NetApp wrote:

Accordingly, every block of data on disk is protected with a checksum. NetApp uses this checksum as its fingerprint. Since we were going to compute it anyway, we get it "for free"—there is no additional load on the system.

. Of course there is no additional load because of checksum generation for them at write time, but you pay for it at the deduplication. The cure of this problem is easy: Use stronger hashes. Let´s assume a deduplication for ZFS. ZFS uses 256-bit checksums. That leads to 1.15792089 × 1077 states. 16 TB are round about four billion blocks at 4k per block. Even a 32 bit checksum would result in the same of states in the hash giving each possible block potentially an own hash value). ZFS has 256-bit one. We can detect with a probability of 0.9999999999999999999 the validity of a block thus we could detect with the same probability the duplicity of a block (validity and duplicity are different sides of the same medal). It´s safe to assume, that we would have to check only a single old block for every block of data for duplicity. You have to do the cross check: When you delete the other block, 0.9999999999999999999 isn´t enough. But it´s a difference if you have to check 1 block or if you have to check 65536 blocks. By the way: Deduplication is the reason, why Linux need a checksumming filesystem and thus brtfs badly for the future. Without checksums in the filesystem, you have to to the duplication detection outside of the filesystem (thus having additional effort to do dedup) and to ensure that your fingerprint database is in sync with the real filesystem. Doesn´t look efficient. And there is another problem: You loose redundancy with deduplication. When dozens of files use the same blocks and this block goes corrupt you have a really huge problem. Or what happens if you read an incorrect block, the checksum doesn´t detect it and cache it and don´t read it again. With deduplication all files would be incorrect at once. This is another tradeoff. I don´t think of duplicated file as wasted space. It adds redundancy to your data and you have to decide of you want to take the risk even if it´s small or or not. I wouldn´t to dedup without RAID-1 (duplicate the deduplication ;) ) and really strong checksums to be sure. And when you really thing about it: People want dedup because of the costs for fast, enterprise disks per TB. But: 2 TB disks start to emerge and you could use SSD for speed. So you can implement huge amounts of storage with high speed at small costs. The concept of hybrid storage pools put the need for deduplication some time into the future. As i have written before: Deduplication is a technology has has to mature be use a commodity technology. And in some cases you have to do your homework in the underlying technology. Sun has solved the underlying problems “en passant” with ZFS, thus we have started to develop “Deduplication … done right” some time ago.