Written by J. Moellenkamp
on May 02, 2011
Reading time: 3 minutes
Solaris English

ZFS Dedup Internals

Questions about deduplication are quite frequent. Not about the basic function, but about some implications of the implementation. Over the time i wrote a cheat sheet for myself to answer such questions.

Deduplication in ZFS is in-band. The deduplication occurs when you write to the disk.
The instrument to do deduplication is to use checksums. If a block has the same checksum as a block already written to the pool, it's considered as a duplicate and thus just a pointer to the already stored block is written to disk. To say it simplified)
As it's not reasonable to scan through all blocks on disk to read the checksums in order find a duplicate, the checksums are stored in a structure called deduplication table or DDT.
The DDTs are a so called AVL tree (that's obvious as there are lot of calls to the system wide avl tree code). This helps searching, so it doesn't just grep through the DDT. It's more intelligent.
The DDTs are managed as a ZAP object on-disk
ZAP is short for the ZFS Attribute processor
A ZAP object stores key value pairs used in many tasks organizing data stored in ZFS
ZAP data is metadata
Metadata is cached by the ARC
As it's cached in ARC, the ZAP blocks and the the data inside those blocks is object to the cache eviction policies of ARC
Just increasing memory size in order to get more room for the DDT may be insufficient, as the space metadata can occupy in ARC is limited.
Per default it's limited to 25% of the maximum ARC size (see line 3478 of arc.c)
The maximum ARC size is 3/4 of the systems memory or all-1GB, whatever is higher.
Both values are user-tunable.
You can set the amount of memory for the metadata to a user-chosen value with zfs:zfs_arc_meta_limit= in /etc/system.
That may be necessary when you know that you do a lot of deduplication tasks and the performance increase by keeping all or most of the DDT in memory outweights the loss of memory for caching data. For example mass importing data from a non-deduplicated storage may such a task.
The size of the ddt may vary over versions, at snv133 the size of object is
```
jmoekamp@hivemind:~# echo ::sizeof ddt_entry_t | mdb -k 
 sizeof (ddt_entry_t) = 0x178
```
Or decimal - 376 Bytes.
Such an entry exists for each unique block written to disk while deduplication is activated
Thus blocks written while deduplication is deactivated are not getting deduplicates (obviously), however you can't deduplicate other locks with them as well
So obviously a pool just using 128k blocks may need less memory to store the ddt than a pool just using 512 byte blocks
Everything stored in ARC may be moved to the L2ARC. So an SSD may be used to increase the space to cache the DDT. Due to the nature of SSD (no head positioning latency) is a much better place to search for the nescessary DDT parts than a rotating head.
However keep in mind that L2ARC doesn't come for free memorywise. Everything stored in L2ARC still leaves some information in the ARC. The relevant metadata structure is arc_buf_hdr_t (see line 431 of arc.c:
```
jmoekamp@hivemind:~# echo ::sizeof arc_buf_hdr_t | mdb -k
sizeof (arc_buf_hdr_t) = 0xa8
```
Or decimal - 178 bytes.
Just before you are asking about substituting a 376 Byte object by an 178 byte object - the ARC doesn't store a single dde entry in a ZAP block, they are storing multiple ones in it.
The deduplication is not done when writing to the ZIL, it's done when writing it finally into the pool. Thus the deduplication is not done when the OS blocks for the write when doing a sync write, instead it's part of an async process afterwards. However it's possible that you will see write throttling kicking in, when the system isn't capable to to write all the dirty pages away timey.
Access pattern for the blocks containing DDT data not in memory is very random.
System with large pools with small memory areas will not work well, as each write will result in a larger number of reads in order to get the data from the DDT.
The more unique blocks you have in the DDT, the less fast dedup will work, given a small amount of memory. A 1 gig memory in a the box and several TB of storage with very small recordsizes will be a sure way to yield horrible performance.
Performance will be best, when the complete DTT fits in memory.
Systems with medium memory size and SSD will yield good results
However everything depends on the number of blocks writen after activation of the deduplication, as the size of the DDT.
In short: Small pool, lot of memory - yeah, really big pool, minimal memory - nah.Truth is somewhat between that...

← → Top