Less known Solaris Features: SamFS - Part 2: The theory of Hierarchical Storage Management

I think, i should start with the theory of Hierarchical Storage Management(HSM). The technology of managing data and their position on different media isn´t really a widespread feature for most adminstrators. Thus this introduction will be a quite long one.

First Observation: Data access pattern

There is a single fact in data management, that was true the last 40 years and it will be true until we won´t store data anymore. The longer you didn´t access data, the lesser the probability of an access gets. This matches with daily obeservation: You work on a text document, you access it every few minutes, you finalize your work and store it on your hard disk. And the microsquaremeters of rotating rust with your document rotate with it. Perhaps you access it a few times in the next weeks to print it again. But after a year it get more and more improbable, that you will access the data. The problems: 1. Murphy´s Law: A nanosecond can be defined as the time between the time of deletion of a file and the moment the boss comes into your office wanting the document. 2. In most countries you will find regulations who prohibits the deletion of a file.

Second observation: The price of storage

When you look at the prices for storage, you will see that the price of it will rise with it´s speed. An enterprise-class 15k rpm FC drive is an expensive device. An enterprise-class 7.2k rpm SATA disk is much cheaper but you can´t operate it 24h near of it´s full load. A LTO4 tape can store 800 GB for 60-70 Euro, wheras even a cheap consumer disk with 750 GB costs round about 100 Euro. The disadvantage: It takes quite a time to get the first byte. You have to load it, wind it into the drive, wind it to the correct position.

Third observation: Capacity

This observation is so simple and so obvious, i won´t have to explain it: The amount of data to store never get´s smaller. When i remember my first PC, purchased by my dad, it had an 20 Megabyte harddisk. And i thought: Thats pretty much. Today, i need this capacity to store 2 raw images from my DSLR. And in company it´s the same. Independently from the provided space, it isn´t enough. I know some companies where small amounts of free space on a departmental server is a subject of interdepartment quid-pro-quo deals.

Hierarchical Storage Management

The basic idea of hierarchical storage management leverages these observations. It makes sense to store actual data on the fastest storage available, but it doesn´t make sense to use it for data, that didn´t have accessed for a year or so. The other way round: It will drive you mad, if actual your text document is stored on a DLT-Tape in a big autoloader, needing 2 minutes to stream the first byte to the client. Hierarchical Storage Management can use a multitude of storage devices with different access behaviour to store your data. You can establish an hierarchy of storage systems. You can use ultra fast FC disks for data in regular usage, cheaper SATA disks for files you need a few times a week, tapes for data needed a few times a year. The kicker behind HSM: The inner workings of HSM are invisible for the user or the application. You just use the file, and the system gathers it for for from other media.

An analogy in computer hardware

Most of you already use hierarchical mechanism for storing data. It´s the hierachy of memory in your computer. You have ultrafast memory in your processor called registers, then you have an almost as fast memory called cache (most of the time you have several stages like Level 1 to 3 Caches), then you have a much slower main memory. At the end you have the swap space on your harddisk, much slower again than the main memory. HSM does the same for storage. The advantages and the challenges are pretty much the same: By using the hierarchy in an intelligent manner, you can speed up the access to your data without spending to much for buying only the fastest memory. The challenge: You have to find a way to find the best place for your data, and you have to carefully size the amount of the stages. When it´s to small, access times get longer, as you have to access a slower storage/memory.

SamFS

SamFS is an implementation of this concept. It isn´t the only one, but from my view it´s the best implementation in the unix world. SamFS stands for Storage Archive Manager File Ssystem. It´s an fully posix compliant file system, thus an user or an appliaction won´t see a different to an UFS for example, with a rich feature set. I would suggest, that you look at the Sun Website for the Sun StorageTek SamFS website for an overview.