The CeBIT magic

I promised to publish a walk through to my ZFS demonstration at the CeBit 2009 booth … it´s the stuff Ingo Frobenius called magic. Well, it isn´t really magic, but perhaps impressive when you demo it at a high speed. For the people used to the virtues ZFS this speed seems normal, but you have to consider, that most people know it otherwise …. vastly slower, vastly less integrated and vastly more uncomfortable.
So … what was my demo case for ZFS at the CeBIT? As i had just one disk in my CeBIT system, i´ve used the trick with using files as devices. So i had to create the file devices first.

# mkfile 128m /testfile1 
# mkfile 128m /testfile2
# mkfile 128m /testfile3
# mkfile 128m /testfile4

Okay, now i create my testpool.

# zpool create tp mirror /testfile1 /testfile2

To show how mighty the zfs create is, i showed the mount table afterwards to present the already mounted filesystem.

# mount
[..]
/tp on tp read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=2d90013 on Fri Mar 20 12:58:24 2009

Afterwards i´ve created some filesystem. As i´m a child of the beginning 80ies i use Muppet Show names most of the times:

# zfs create tp/statler
# zfs create tp/gonzo  
# zfs create tp/waldorf
# zfs create tp/kermit

ZFS filesystem creation is so fast, that i show the mount table again to show the audience, that i´ve really created file systems.

# mount
[..]
/tp/statler on tp/statler read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=2d90017 on Fri Mar 20 12:59:34 2009
/tp/gonzo on tp/gonzo read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=2d90018 on Fri Mar 20 12:59:37 2009
/tp/waldorf on tp/waldorf read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=2d90019 on Fri Mar 20 12:59:40 2009
/tp/kermit on tp/kermit read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=2d9001a on Fri Mar 20 12:59:42 2009

The concept of the storage pool was new to many people at the CeBIT booth so i told them to observe the third column.

# zfs list | grep "tp/"
tp/gonzo                 18K  90.8M    18K  /tp/gonzo
tp/kermit                18K  90.8M    18K  /tp/kermit
tp/statler               18K  90.8M    18K  /tp/statler
tp/waldorf               18K  90.8M    18K  /tp/waldorf

90.8 M free. Now let´s create a file in one of the directory.

# mkfile 10m /tp/gonzo/testfile

Okay, yet another look to the filesystems.

# zfs list | grep "tp/"
tp/gonzo               5.27M  85.6M  5.27M  /tp/gonzo
tp/kermit                18K  85.6M    18K  /tp/kermit
tp/statler               18K  85.6M    18K  /tp/statler
tp/waldorf               18K  85.6M    18K  /tp/waldorf

As all four filesystems share the same pool, all have the same reduced amount of storage. Okay, now let´s extend the pool. A short look about the current configuration.

# zpool status tp
  pool: tp
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        tp              ONLINE       0     0     0
          mirror        ONLINE       0     0     0
            /testfile1  ONLINE       0     0     0
            /testfile2  ONLINE       0     0     0

errors: No known data errors

We have a mirror of two devices. Okay, let´s add the other two filesystem.

# zpool add tp mirror /testfile3 /testfile4

Let´s have another look to our pool structure.

# zpool status tp
  pool: tp
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        tp              ONLINE       0     0     0
          mirror        ONLINE       0     0     0
            /testfile1  ONLINE       0     0     0
            /testfile2  ONLINE       0     0     0
          mirror        ONLINE       0     0     0
            /testfile3  ONLINE       0     0     0
            /testfile4  ONLINE       0     0     0

errors: No known data errors

We have now a stripe of two mirrors. And when you look at the filesystem, all filesystems of the pool have the same increased amount of storage.

# zfs list | grep "tp/"
tp/gonzo               20.0M   194M  20.0M  /tp/gonzo
tp/kermit                18K   194M    18K  /tp/kermit
tp/statler               18K   194M    18K  /tp/statler
tp/waldorf               18K   194M    18K  /tp/waldorf

Let´s play around with filesystem snapshots. I´ve used the example of working in your home directory.

# cd /tp/gonzo
# touch monday
# touch tuesday

Okay, it would be nice to protect your work against mishaps. Let´s do a snapshot.

# zfs snapshot tp/gonzo@tuesdayevening

The storyline of my demo repeats this for a while:

# touch wednesday
# zfs snapshot tp/gonzo@wednesdayevening
# touch thursday 
# zfs snapshot tp/gonzo@thursdayevening
# rm monday      
# touch friday    
# zfs snapshot tp/gonzo@fridayevening

It´s Saturday and the boss needs the results of Monday.

# ls -l
total 4
-rw-r--r--   1 root     root           0 Mar 20 13:11 friday
-rw-r--r--   1 root     root           0 Mar 20 13:10 thursday
-rw-r--r--   1 root     root           0 Mar 20 13:10 tuesday
-rw-r--r--   1 root     root           0 Mar 20 13:10 wednesday

Fsck … you´ve deleted them. But you could use the snapshots:

# cd .zfs
# cd snapshot
# cd tuesdayevening/
# ls -l
total 40979
-rw-r--r--   1 root     root           0 Mar 20 13:10 monday
-rw------T   1 root     root     20971520 Mar 20 13:05 testfile
-rw-r--r--   1 root     root           0 Mar 20 13:10 tuesday
# cp monday /tp/gonzo/monday

You can just go to the .zfs directory in the root of your filesystem and access a snapshot with it´s name as a directory name. Okay, most people are really impressed now, but we can do more than that. We can do the same for raw devices. At first i showed them the creation of sparse provisioned directories with the storyline “Imagine, you have collegue telling you, that he need a raw volume as large as 5 gigabyte but you know he needs only 128 megabytes. How to give him 5 gigabyte without giving him 5 gigabyte worth of hardisks”. So i create such a volume.

# zfs create -V 5g -s tp/ufsvolume
# zfs list | grep "tp/"
tp/gonzo               20.1M   194M    19K  /tp/gonzo
tp/kermit                18K   194M    18K  /tp/kermit
tp/statler               18K   194M    18K  /tp/statler
tp/ufsvolume             16K   194M    16K  -
tp/waldorf               18K   194M    18K  /tp/waldorf

It´s really a device. Just look at the device path:

# ls -l /dev/zvol/dsk/tp/ufsvolume 
lrwxrwxrwx   1 root     root          35 Mar 20 13:13 /dev/zvol/dsk/tp/ufsvolume -> ../../../../devices/pseudo/zfs@0:6c

Let´s format it with UFS just as an example. You could export it with iSCSI and format it with NTFS as well.

# newfs /dev/zvol/dsk/tp/ufsvolume 
newfs: construct a new file system /dev/zvol/rdsk/tp/ufsvolume: (y/n)? y
Warning: 2082 sector(s) in last cylinder unallocated
/dev/zvol/rdsk/tp/ufsvolume:    10485726 sectors in 1707 cylinders of 48 tracks, 128 sectors
        5120.0MB in 107 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
 9539744, 9638176, 9736608, 9835040, 9933472, 10031904, 10130336, 10228768,
 10327200, 10425632
#

We need some mountpoint.

# mkdir /mountpoint1
# mkdir /mountpoint2
# mkdir /mountpoint3

I initially mount the filesystem and create an timestamp file in it.

# mount /dev/zvol/dsk/tp/ufsvolume /mountpoint1
# date >> /mountpoint1/timestamp
# cat /mountpoint1/timestamp 
Fri Mar 20 13:19:04 CET 2009

I´m unmounting it, make a snapshot of it, remout it and create another timestamp file just to show that it´s still writeable.

# umount /mountpoint1
# zfs snapshot tp/ufsvolume@template      
# mount /dev/zvol/dsk/tp/ufsvolume /mountpoint1
# date >> /mountpoint1/timestamp2

Now let´s have a short look to the contents of our UFS filesystem.

# ls -l /mountpoint1/
total 20
drwx------   2 root     root        8192 Mar 20 13:16 lost+found
-rw-r--r--   1 root     root          29 Mar 20 13:19 timestamp
-rw-r--r--   1 root     root          29 Mar 20 13:21 timestamp2

There are two timestamp files in it as expected. Now we mount our snapshot. As snapshots are read-only by definition, we can just mount it read only.

# mount -o ro /dev/zvol/dsk/tp/ufsvolume@template /mountpoint2
# ls -l /mountpoint2
total 18
drwx------   2 root     root        8192 Mar 20 13:16 lost+found
-rw-r--r--   1 root     root          29 Mar 20 13:19 timestamp

But we can look in and the the version at the time of the snapshot … but now we want to have a writeable version of the filesystem. We have to clone the snapshot. No problem.

# zfs clone tp/ufsvolume@template tp/workingvolume
# mount /dev/zvol/dsk/tp/workingvolume /mountpoint3
# cd /mountpoint3
# ls -l
total 20
drwx------   2 root     root        8192 Mar 20 13:16 lost+found
-rw-r--r--   1 root     root          29 Mar 20 13:19 timestamp
# mkfile 1k testfile

Initially it has the same concent as in our snapshot. But when we create an additional file the filesystem starts to be different. The nice thing. The cloned filesystem just takes the storage needed for the modifications, not for the a complete copy.

# ls -l
total 20
drwx------   2 root     root        8192 Mar 20 13:16 lost+found
-rw------T   1 root     root        1024 Mar 20 13:26 testfile
-rw-r--r--   1 root     root          29 Mar 20 13:19 timestamp

This was my ZFS CeBIT showcase. For many people it was a really impressive show.