ZFS Dedup Internals
Questions about deduplication are quite frequent. Not about the basic function, but about some implications of the implementation. Over the time i wrote a cheat sheet for myself to answer such questions.
- Deduplication in ZFS is in-band. The deduplication occurs when you write to the disk.
- The instrument to do deduplication is to use checksums. If a block has the same checksum as a block already written to the pool, it's considered as a duplicate and thus just a pointer to the already stored block is written to disk. To say it simplified)
- As it's not reasonable to scan through all blocks on disk to read the checksums in order find a duplicate, the checksums are stored in a structure called deduplication table or DDT.
- The DDTs are a so called AVL tree (that's obvious as there are lot of calls to the system wide avl tree code). This helps searching, so it doesn't just grep through the DDT. It's more intelligent.
- The DDTs are managed as a ZAP object on-disk
- ZAP is short for the ZFS Attribute processor
- A ZAP object stores key value pairs used in many tasks organizing data stored in ZFS
- ZAP data is metadata
- Metadata is cached by the ARC
- As it's cached in ARC, the ZAP blocks and the the data inside those blocks is object to the cache eviction policies of ARC
- Just increasing memory size in order to get more room for the DDT may be insufficient, as the space metadata can occupy in ARC is limited.
- Per default it's limited to 25% of the maximum ARC size (see line 3478 of arc.c)
- The maximum ARC size is 3/4 of the systems memory or all-1GB, whatever is higher.
- Both values are user-tunable.
- You can set the amount of memory for the metadata to a user-chosen value with
zfs:zfs_arc_meta_limit=
in /etc/system. - That may be necessary when you know that you do a lot of deduplication tasks and the performance increase by keeping all or most of the DDT in memory outweights the loss of memory for caching data. For example mass importing data from a non-deduplicated storage may such a task.
- The size of the ddt may vary over versions, at snv133 the size of object is
Or decimal - 376 Bytes. - Such an entry exists for each unique block written to disk while deduplication is activated
- Thus blocks written while deduplication is deactivated are not getting deduplicates (obviously), however you can't deduplicate other locks with them as well
- So obviously a pool just using 128k blocks may need less memory to store the ddt than a pool just using 512 byte blocks
- Everything stored in ARC may be moved to the L2ARC. So an SSD may be used to increase the space to cache the DDT. Due to the nature of SSD (no head positioning latency) is a much better place to search for the nescessary DDT parts than a rotating head.
- However keep in mind that L2ARC doesn't come for free memorywise. Everything stored in L2ARC still leaves some information in the ARC. The relevant metadata structure is
arc_buf_hdr_t
(see line 431 of arc.c: Or decimal - 178 bytes.
- Just before you are asking about substituting a 376 Byte object by an 178 byte object - the ARC doesn't store a single dde entry in a ZAP block, they are storing multiple ones in it.
- The deduplication is not done when writing to the ZIL, it's done when writing it finally into the pool. Thus the deduplication is not done when the OS blocks for the write when doing a sync write, instead it's part of an async process afterwards. However it's possible that you will see write throttling kicking in, when the system isn't capable to to write all the dirty pages away timey.
- Access pattern for the blocks containing DDT data not in memory is very random.
- System with large pools with small memory areas will not work well, as each write will result in a larger number of reads in order to get the data from the DDT.
- The more unique blocks you have in the DDT, the less fast dedup will work, given a small amount of memory. A 1 gig memory in a the box and several TB of storage with very small recordsizes will be a sure way to yield horrible performance.
- Performance will be best, when the complete DTT fits in memory.
- Systems with medium memory size and SSD will yield good results
- However everything depends on the number of blocks writen after activation of the deduplication, as the size of the DDT.
- In short: Small pool, lot of memory - yeah, really big pool, minimal memory - nah.Truth is somewhat between that... </Ul>