PSARC 2010/296 - tunable read-modify-write for flash

Bo Steven Zhou proposed an interesting feature for Solaris in the “PSARC 2010/296: Add tunable to control RMW for Flash Devices”. RMW? Okay, this time we dig deep into the sd driver. Normally much of this stuff is hidden by other components of solaris. Okay … RMV … it’s the read-modify-write (RMW) cycle you have to do when you want to change for example 512 bytes on an device that just can change the stored data in 4096 byte blocks. You have to read the 4096 bytes, modify the 4096 in memory with the new 512 bytes and you have to write back the 4096 bytes. That’s one of the basic reasons, why you should align your block sizes and block positions to the physical ones on the device, even if the block device offers you a smaller one. And afterwards you should work with the device with writes in the size of the physical blocks. And all the warnings for the F20/F5100 to ensure that the writes are 4k aligned are there for a reason. That said the world isn’t perfect, and you have to do writes in a size smaller than the physical blocksize (especially when the drive tells you it has a different one). So you have to do this read-modify-write. Or to be exact …. a component in your system has to do it: You can leave it to the small computer called flash memory controller or you can just pass 4096 byte blocks already modified to your flash drive to ensure that the drive has just to write it to you. Most of the time this ist the fastest way to do so. But sometimes flash-controllers is rather slow at doing RMW and in such situation it can be positive to leave the RMW to the server. With the tunable proposed by Bo Zhou, you can tell the sd driver of Solaris to do the RMW stuff or just to pass the data to the disk and leave it to the firmware of the disk. Bo Steven Zhou reports some interesting performance jumps in microbenchmarks for the F20/F5100:

- ZFS random write, random fs block size test shows
100x improvement compared to running RMW in f/w. - ZFS random write, 4K-aligned fs block size test shows
230x improvement compared to running RMW in f/w.

As i wrote before … this proposal is really interesting. But i think there may be another point where such an feature could be worth the effort. However i didn’t thought about it into all directions: Many SSD have memory. Partly it’s used to to do this RMW stuff. However this memory is the achilles heel of many SSD, as some disks report the successful write to the system, as soon as the data is in the memory. However the memory isn’t battery buffered, so you can loose data when the power fails in this period of time before the flash memory controller finally writes the data to disk but after the controller reported the successful write. When you do the RMW in the system, the controller doesn’t have to do it, so you don’t need to cache the data inside the flash memory controller to do the RMW, thus switching of the write-cache off the SSD to ensure data-integrity has possibly a smaller impact to performance, as the FMC can directly write the block to disk and doesn’t have to read the block before writing when it’s not in the cache of the device. A server has the advantage of an potentially much larger cache. As i wrote at the beginning … this is just a thought … and i didn’t thought it to an end so far. By the way: Bo Zhou held an interesting talk about the problems of the migration from 512 to 4k bytes at the Storage Developer Conference 2009 in Santa Clara. You can download it from the SNIA website.