I had this discussion with a number of customers in the past, because it’s one of the not so obvious consequences of the inner workings of ZFS, of how ZFS allocates blocks for data it writes to the pool. I decided to write this article because I had this discussion a few times the last year alone despite being that way since the beginning of ZFS in Solaris 10, so I thought it could be a good idea to write my thoughts on this down.

How to increase the pool size

Let’s assume you have zpool consisting out of 32 disks. Let’s assume you have mirrored pool. So you 16 pairs. Let’s further assume that you just need a little bit of extra space and so you want to expand it.

The obvious thought is: „Where is the problem? I will just drop two disks into it, and add them as a mirrored pair.“ You are right, this will work. But doing so has implications you should think about before. Those implications are obvious when you know a little bit about the inner workings. That said, for most of us, this isn’t an issue, but you should think about it.

There are two ways to expand it: You could you increase the size of all the devices. When all devices of a top level vdev have been enlarged, the vdev will use this additional storage. The other way is: you can add additional devices.

If you are not using LUNs from a centralized storage that you can resize at will, you will probably opt for adding some devices as capacity of a physical disk is something you only can change by physically swapping the disk to a larger model.

What road you chose should be well considered, and there are some factors, that are not obvious at first. For example: You can’t shrink the vdevs back, but you can remove top-level vdevs, so there may be additional reasons for adding LUNs instead of increasing the size of the LUNs for example.

A coffin corner issue

Just to make it clear at the beginning. We are talking here about an issue when you are sitting in that coffin corner of the usage envelope of ZFS.

Using a 1 GBit/s Ethernet for your 4 disk backup/fileserver? You aren’t in th coffin corner. Bandwidth of your network will limit you before this issue can. Have dual 25 GBit/s interface into your fileserver with 96 disk and extending it with 48 disk, you aren’t in the coffin corner, you will probably won’t see this the bandwith side of the issues and if you are IOPS bound you should probably use write/read accelerating SSD anyway. If you doing your backup on them and still on 10 GBit/s, you can most often simply ignore the situation, even with 40 GBit/s is quite an effort to see it.

With all this you have to keep in mind that you have to be able to pump enough data into the system in order to see it and have an extremely unbalanced configuration (adding an extremely small number of disks to a very large number of disks). It simply doesn’t matter if you can’t even put enough load through your network pipes to even load the new disks.

Adhere to the best practices, and you wont see it. And that’s what i’m wanting to say with “It’s a coffin corner situation”.

The root of the considerations

So, what is the issue? I’m simplifying it significantly, but when writing stuff ZFS prefers the least used disk in regard of capacity. This has a very good reason. It wants to level out how much data is on all disks you have in a pool, because on a well used pool this will lead to a balanced use of all disks and thus to better performance at the end.

Obviously a new disk will always be the disk with the least amount of data on it because it starts as a clean slate, so the allocation mechanism will aways prefer those new disks. So until everything levels out again after enough writing, you may confronted with an rather unbalanced load. ZFS doesn’t actively rebalances data, this is simply done by writing to the disk thanks to the COW properties of ZFS. ZFS never overwrites active old blocks to change data in them, it write the data to a new location. And when the new location is on the least used disk from a capacity standpoint everything eventually will level out. So the issue will eventually disappear, you just don’t do it actively.

However the other side of the medal is that probably all new writes and consequently all reads for this newly written data will go to your new disks. And this is where the „interesting performance dynamics“ have their root cause.

Even when there wouldn’t be this preferring behaviour you would see this kind of situation, because at a point in time you would have only have free blocks on the new disks, because you have simply used up all the free blocks on the old disks.

Don’t have the issue

So … what can you do about it when you are in the situation of having the need of more capacity: Well … at first you go the „simply not have the issue“ route. If you increase the sizes of all the vdevs, all the vdevs have roughly the same amount of free space thus the allocation will roughly be like before and the performance will be the same.

Ignore it

Obviously you can simply accept that it will be slower. This is the „Ignore it“ solution. That can be really an acceptable way to go. Depending on the load the situation will disappear after a while. The persistency of the situation is however vastly different between doing for example a backup with a retention of 30 days where all data has been newly written after 30 days or if you have a filestore where data is never changed, essentially kept forever and it’s only slowly filling the added space with new data. Keep in mind that the performance dynamics in the later case may be … well … very interesting, especially when you additionally just read the newest data most of the time.

Think about a large mailserver for example: You don’t change emails, you store them forever and usually you just read the new mails as the email clients downloads them anyway for local caching. but the performance dynamics are probably complex here as well, if your users are frequently gathering their emails like every minute, they will probably in ARC or L2ARC and thus you have much less impact than you would assume. However with such a load the “ignore it” road may be not a feasible way.

A backup server may be a good choice for the “ignore it” road but not always.

Circumvent it

A solution feasible as well would be the „circumvent it“ solution. Often it’s not necessary to have all the files in the same pool. You simply could create another filesystem, copy or shadow-migrate data to it and delete it on the other side. Or could could adapt your RMAN script to use more NFS exports to use both pools by separating a few systems on this pool. you can move old data easily via shadow migration into the new pool.

I would especially go the „circumvent it“ route if I have a significant age difference between the new disks in the pool and the old disks. The performance dynamics of a mixed 2TB and 14 TB pool should be interesting as well and it is a best practice not to mix different sizes.

However in this case I would most probably just swap the 2 TB disks for a equal number of 14 TB disks, just because the disks are probably really, really old and hard disks have a shelf life either they are laying around on a shelf in their anti static wrapper or used inside a shelf (obviously the first may be much longer than the second one, but I’m not sure about it, being in the spares racks with a admin searching for things in a hurry is probably a hard life for the disk as well ;) )

And mixing those disks would be contrary to the best practice as documented here.

Use similar size disks so that I/O is balanced across devices The reason for this suggestion is exactly the reason this blog entry is about.

However, what to do when you neither want or can not to “not have it”, “ignore it” or “circumvent it”. Of course there are correct courses of action in regard of this issue. What are your options?

The perfect solution

The perfect solution would be to always add the same number of disks you have already in the pool. So when you have two disks in the pool, add two of the same size. If you have 32 disks in the pool, add 32 disks with the same size. This totally rules out any problems. Obviously. The new disks have the same performance as the old disks and even if all requests go to the new disks, you will have at least the performance of the old disks.

That said if you want to go this route you could go the „simply not have the issue“ route as well. If I have to purchase the same number of disks then I already have, I obviously can swap out the disks if there is a capacity difference between the new and the old disks. So for example if the old disks are 8 TB and the new ones are 14 TB and my already existing disk bays are supported with the new disk size I can simply throw out the old disks and user the new ones and thus don’t have to invest in new trays for example. On the other side: More trays means probably more connections to the server and thus improved performance and perhaps your disks are not that old that you want to throw them to the trash heap already, especially as simply throwing them onto the trash heap isn’t possible as you have to invest time and effort to securely whipe the disks. The tree of decisions is often not simple.

The pragmatic solution

That said, the a perfect solution is not always the efficient solution and I already hear you „Thats expensive!“. Swapping all my disks? Purchasing all my disk new or doubling the number of disks is not cheap.

While in theory the perfect solution is totally correct, in practice you can often go a slightly different way.

It depends a little bit how you have sized your storage. If you have sized your storage for performance and you need every single IOPS or MByte/s out of it, the perfect solution is the way to go.

However obviously you should check in such cases, if

  • full flash storage pool is perhaps a more sensible choice anyway.
  • if a adaequate hybrid storage pool configuration can’t shave off a significant part of the IOPS budget you need and thus allow you to plan with a signifcantly reduced IOPS budget from the rotating rust.

Both could lead to need of less disks for the extension of a storage pool.

In all other cases the pragmatic solution is more than valid. If you have sized it for capacity because for example you needed 200 Terabyte and performance was secondary, and you just need 20 additional terabyte, you will most probably get away with much less disks than doubling the storage. Most probably by really just adding enough disks for the capacity.

There is just one simple rule for this: Know your performance requirements for the pool and check if the new disks could carry this load on the extreme alone. It’s a worst case planing of course, because it’s not the way that the other disk won’t get any requests at all, but I like to plan worst case. If the minimal number of disk necessary to fulfill your additional capacity needs can’t cope with all the performance requirements, put the minimal number of disks into the system,that would be able to do so.

And with LUNs on centralized storage ?

It gets a little more complex and simpler when you are not directly adding disks to the zpool but LUNs from an intelligent storage. If you are adding a LUN from a intelligent storage device which is dispersed on many disks behind the controller head (and not just mapping a complete disk on the inside to a LUN on the outside) and it is on the same set of disks, you can think differently about this.

Let’s assume you have 120 disks behind the heads of your storage and the LUNs are just logical entities carving the blocks out of this pool of 120 disks. Then from a disk IOPS perspective all LUNs have essentially the same disk IOPS budget at their disposal (short of other bottlenecks). You have an budget of 12000 IOPS on all LUNS (100 IOPS per disks as an assumption). And to vastly simply it, it doesn’t matter that much, if you make your withdrawals from this budget with one LUN, two LUN or three LUNs from this budget. However due to the increased load on the new disks you will probably see more interaction with those new LUNs. It’s a probably a very good idea to increase vdev_max_pending, so this LUN gets enough commands in-flight to get some load to your disks.

The reason is quite simple, with the default you have 10 commands to a single disks in flight at a given time. That’s one of the reason why a single large LUN used for a ZFS pool is often a less optimal idea.

When you have let’s say 10 LUNs, you can have 100 commands in-flight. When you add for example two disks, you can have only 20 commands in-flight to the LUNs that may take the brunt of the load for a while. By increasing the number of commands in flight, you give the system the opportunity to have enough commands to get the disks to work, no matter if you have two, one or three queues to the disks. Anyway: After doing such an extension I would have a very keen eye on the queue lengths in iostat for the disks.

The „Do I really look as I would care“ situation?

If you have an abundance of IOPS and massive bandwidth (like with a system full of NVMe-SSD) you can most probably wear your best „Do I really look as I would care“ face, albeit the bandwidth limit of a pair of device is still there.However it’s way harder to reach than the 200 read IOPS limit of a single two disk mirror you have just added to an 120 disk pool.


It may look difficult, however most people don’t have the problem because they have capacity driven pools at their servers and not performance driven pools and those who have performance driven pools have often SSD, where the situation is theoretically the same but because of the massive performance of SSD appear at a much later time if at all and hybrid storage pools change the equation as well.

It’s just for consideration for you when you are sitting in that coffin corner.

Written by

Joerg Moellenkamp

Grey-haired, sometimes grey-bearded Windows dismissing Unix guy.