Increasing ZFS pool sizes

I feel obliged to point out that this blog post is roughly 5 years and 3 months old. People change, opinions evolve. In just a few years, vast technological landscapes can shift. And don't get me started on config files. Please consider this text in the context of its time.

I had this discussion with a number of customers in the past, because it’s one of the not so obvious consequences of the inner workings of ZFS, of how ZFS allocates blocks for data it writes to the pool. I decided to write this article because I had this discussion a few times the last year alone despite being that way since the beginning of ZFS in Solaris 10, so I thought it could be a good idea to write my thoughts on this down.

How to increase the pool size

Let’s assume you have a zpool consisting of 32 disks. Let’s assume you have a mirrored pool. So you have 16 pairs. Let’s further assume that you just need a little bit of extra space and so you want to expand it.

The obvious thought is: “Where is the problem? I will just drop two disks into it, and add them as a mirrored pair.” You are right, this will work. But doing so has implications you should think about before. Those implications are obvious when you know a little bit about the inner workings. That said, for most of us, this isn’t an issue, but you should think about it.

There are two ways to expand it: You could increase the size of all the devices. When all devices of a top level vdev have been enlarged, the vdev will use this additional storage. The other way is: you can add additional devices.

If you are not using LUNs from a centralized storage that you can resize at will, you will probably opt for adding some devices as capacity of a physical disk is something you only can change by physically swapping the disk to a larger model.

What road you choose should be well considered, and there are some factors that are not obvious at first. For example: You can’t shrink the vdevs back, but you can remove top-level vdevs, so there may be additional reasons for adding LUNs instead of increasing the size of the LUNs for example.

A coffin corner issue

Just to make it clear at the beginning. We are talking here about an issue when you are sitting in that coffin corner of the usage envelope of ZFS.

Using a 1 GBit/s Ethernet for your 4 disk backup/fileserver? You aren’t in the coffin corner. Bandwidth of your network will limit you before this issue can. Have dual 25 GBit/s interfaces into your fileserver with 96 disks and extending it with 48 disks, you aren’t in the coffin corner, you will probably not see the bandwidth side of the issues and if you are IOPS bound you should probably use write/read accelerating SSDs anyway. If you are doing your backup on them and still on 10 GBit/s, you can most often simply ignore the situation, even with 40 GBit/s it is quite an effort to see it.

With all this you have to keep in mind that you have to be able to pump enough data into the system in order to see it and have an extremely unbalanced configuration (adding an extremely small number of disks to a very large number of disks). It simply doesn’t matter if you can’t even put enough load through your network pipes to even load the new disks.

Adhere to the best practices, and you won’t see it. And that’s what I’m wanting to say with “It’s a coffin corner situation”.

The root of the considerations

So, what is the issue? I’m simplifying it significantly, but when writing stuff ZFS prefers the least used disk in regard to capacity. This has a very good reason. It wants to level out how much data is on all disks you have in a pool, because on a well used pool this will lead to a balanced use of all disks and thus to better performance in the end.

Obviously a new disk will always be the disk with the least amount of data on it because it starts as a clean slate, so the allocation mechanism will always prefer those new disks. So until everything levels out again after enough writing, you may be confronted with a rather unbalanced load. ZFS doesn’t actively rebalance data, this is simply done by writing to the disk thanks to the COW properties of ZFS. ZFS never overwrites active old blocks to change data in them, it writes the data to a new location. And when the new location is on the least used disk from a capacity standpoint everything eventually will level out. So the issue will eventually disappear, you just don’t do it actively.

However the other side of the medal is that probably all new writes and consequently all reads for this newly written data will go to your new disks. And this is where the “interesting performance dynamics” have their root cause.

Even when there wouldn’t be this preferring behaviour you would see this kind of situation, because at a point in time you would only have free blocks on the new disks, because you have simply used up all the free blocks on the old disks.

Don’t have the issue

So — what can you do about it when you are in the situation of having the need of more capacity: Well — at first you go the “simply not have the issue” route. If you increase the sizes of all the vdevs, all the vdevs have roughly the same amount of free space thus the allocation will roughly be like before and the performance will be the same.

Ignore it

Obviously you can simply accept that it will be slower. This is the “Ignore it” solution. That can really be an acceptable way to go. Depending on the load the situation will disappear after a while. The persistency of the situation is however vastly different between doing for example a backup with a retention of 30 days where all data has been newly written after 30 days or if you have a filestore where data is never changed, essentially kept forever and it’s only slowly filling the added space with new data. Keep in mind that the performance dynamics in the latter case may be — well — very interesting, especially when you additionally just read the newest data most of the time.

Think about a large mailserver for example: You don’t change emails, you store them forever and usually you just read the new mails as the email client downloads them anyway for local caching. But the performance dynamics are probably complex here as well, if your users are frequently gathering their emails like every minute, they will probably be in ARC or L2ARC and thus you have much less impact than you would assume. However with such a load the “ignore it” road may not be a feasible way.

A backup server may be a good choice for the “ignore it” road but not always.

Circumvent it

A solution feasible as well would be the “circumvent it” solution. Often it’s not necessary to have all the files in the same pool. You simply could create another filesystem, copy or shadow-migrate data to it and delete it on the other side. Or you could adapt your RMAN script to use more NFS exports to use both pools by separating a few systems on this pool. You can move old data easily via shadow migration into the new pool.

I would especially go the “circumvent it” route if I have a significant age difference between the new disks in the pool and the old disks. The performance dynamics of a mixed 2TB and 14 TB pool should be interesting as well and it is a best practice not to mix different sizes.

However in this case I would most probably just swap the 2 TB disks for an equal number of 14 TB disks, just because the disks are probably really, really old and hard disks have a shelf life either they are laying around on a shelf in their anti-static wrapper or used inside a shelf (obviously the first may be much longer than the second one, but I’m not sure about it, being in the spares racks with an admin searching for things in a hurry is probably a hard life for the disk as well ;) )

And mixing those disks would be contrary to the best practice as documented here.

Use similar size disks so that I/O is balanced across devices

The reason for this suggestion is exactly the reason this blog entry is about.

However, what to do when you neither want nor can “not have it”, “ignore it” or “circumvent it”? Of course there are correct courses of action in regard to this issue. What are your options?

The perfect solution

The perfect solution would be to always add the same number of disks you have already in the pool. So when you have two disks in the pool, add two of the same size. If you have 32 disks in the pool, add 32 disks with the same size. This totally rules out any problems. Obviously. The new disks have the same performance as the old disks and even if all requests go to the new disks, you will have at least the performance of the old disks.

That said if you want to go this route you could go the “simply not have the issue” route as well. If I have to purchase the same number of disks that I already have, I obviously can swap out the disks if there is a capacity difference between the new and the old disks. So for example if the old disks are 8 TB and the new ones are 14 TB and my already existing disk bays are supported with the new disk size I can simply throw out the old disks and use the new ones and thus don’t have to invest in new trays for example. On the other side: More trays means probably more connections to the server and thus improved performance and perhaps your disks are not that old that you want to throw them to the trash heap already, especially as simply throwing them onto the trash heap isn’t possible as you have to invest time and effort to securely wipe the disks. The tree of decisions is often not simple.

The pragmatic solution

That said, the perfect solution is not always the efficient solution and I already hear you “That’s expensive!”. Swapping all my disks? Purchasing all my disks new or doubling the number of disks is not cheap.

While in theory the perfect solution is totally correct, in practice you can often go a slightly different way.

It depends a little bit how you have sized your storage. If you have sized your storage for performance and you need every single IOPS or MByte/s out of it, the perfect solution is the way to go.

However obviously you should check in such cases, if

a full flash storage pool is perhaps a more sensible choice anyway.
if an adequate hybrid storage pool configuration can’t shave off a significant part of the IOPS budget you need and thus allow you to plan with a significantly reduced IOPS budget from the rotating rust.

Both could lead to a need for less disks for the extension of a storage pool.

In all other cases the pragmatic solution is more than valid. If you have sized it for capacity because for example you needed 200 terabytes and performance was secondary, and you just need 20 additional terabytes, you will most probably get away with much less disks than doubling the storage. Most probably by really just adding enough disks for the capacity.

There is just one simple rule for this: Know your performance requirements for the pool and check if the new disks could carry this load on the extreme alone. It’s a worst case planning of course, because it’s not the case that the other disks won’t get any requests at all, but I like to plan worst case. If the minimal number of disks necessary to fulfil your additional capacity needs can’t cope with all the performance requirements, put the minimal number of disks into the system that would be able to do so.

And with LUNs on centralized storage?

It gets a little more complex and simpler when you are not directly adding disks to the zpool but LUNs from an intelligent storage. If you are adding a LUN from an intelligent storage device which is dispersed on many disks behind the controller head (and not just mapping a complete disk on the inside to a LUN on the outside) and it is on the same set of disks, you can think differently about this.

Let’s assume you have 120 disks behind the heads of your storage and the LUNs are just logical entities carving the blocks out of this pool of 120 disks. Then from a disk IOPS perspective all LUNs have essentially the same disk IOPS budget at their disposal (short of other bottlenecks). You have a budget of 12000 IOPS on all LUNs (100 IOPS per disk as an assumption). And to vastly simplify it, it doesn’t matter that much if you make your withdrawals from this budget with one LUN, two LUNs or three LUNs from this budget. However due to the increased load on the new disks you will probably see more interaction with those new LUNs. It’s probably a very good idea to increase vdev_max_pending, so this LUN gets enough commands in-flight to get some load to your disks.

The reason is quite simple, with the default you have 10 commands to a single disk in flight at a given time. That’s one of the reasons why a single large LUN used for a ZFS pool is often a less optimal idea.

When you have let’s say 10 LUNs, you can have 100 commands in-flight. When you add for example two disks, you can have only 20 commands in-flight to the LUNs that may take the brunt of the load for a while. By increasing the number of commands in flight, you give the system the opportunity to have enough commands to get the disks to work, no matter if you have two, one or three queues to the disks. Anyway: After doing such an extension I would have a very keen eye on the queue lengths in iostat for the disks.

The “Do I really look as if I would care” situation?

If you have an abundance of IOPS and massive bandwidth (like with a system full of NVMe-SSDs) you can most probably wear your best “Do I really look as if I would care” face, albeit the bandwidth limit of a pair of devices is still there. However it’s way harder to reach than the 200 read IOPS limit of a single two disk mirror you have just added to a 120 disk pool.

Closing

It may look difficult, however most people don’t have the problem because they have capacity driven pools at their servers and not performance driven pools and those who have performance driven pools have often SSDs, where the situation is theoretically the same but because of the massive performance of SSDs appear at a much later time if at all and hybrid storage pools change the equation as well.

It’s just for consideration for you when you are sitting in that coffin corner.

How to increase the pool size

A coffin corner issue

The root of the considerations

Don’t have the issue

Ignore it

Circumvent it

The perfect solution

The pragmatic solution

And with LUNs on centralized storage?

The “Do I really look as if I would care” situation?

Closing

Joerg Moellenkamp

Solaris 11.4 SRU 60: Sample separation in iostat

Tiered storage with ZFS

Solaris 11.4 and user_reserve_hint_pct

c0t0d0s0.org

Recent posts

25 Jahre

The Day My Heart Stood Still

Menu

Increasing ZFS pool sizes

How to increase the pool size

A coffin corner issue

The root of the considerations

Don’t have the issue

Ignore it

Circumvent it

The perfect solution

The pragmatic solution

And with LUNs on centralized storage?

The “Do I really look as if I would care” situation?

Closing

Joerg Moellenkamp

You may also like...

Solaris 11.4 SRU 60: Sample separation in iostat

Tiered storage with ZFS

Solaris 11.4 and user_reserve_hint_pct

25 Jahre

The Day My Heart Stood Still