= ZFS =

ZFS support was added to Ubuntu Wily 15.10 as a technology preview and comes fully supported in Ubuntu Xenial 16.04.  Note that ZFS is only supported on 64-bit architectures.  Also note that currently only [[https://blog.ubuntu.com/2018/10/15/deploying-ubuntu-root-on-zfs-with-maas|MAAS allows ZFS]] to be installed as a root filesystem.

A minimum of 2 GB of free memory is required to run ZFS, however it is recommended to use ZFS on a system with at least 8 GB of memory.

To install ZFS, use:

{{{
sudo apt install zfsutils-linux
}}}

Below is a quick overview of ZFS, this is intended as a getting started primer.  For further information on ZFS, please refer to some [[https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux|excellent documentation]] written by Aaron Toponce.

= NOTE =

For the sake of brevity, devices in this document are referred to as /dev/sda /dev/sdb etc.  One should avoid this and use a full device name path using /dev/disk/by-uuid to uniquely identify drives to avoid boot time failures if device name mappings change.

= ZFS Virtual Devices (ZFS VDEVs) =

A VDEV is a meta-device that can represent one or more devices. ZFS supports
7 different types of VDEV:

 * File - a pre-allocated file
 * Physical Drive (HDD, SDD, PCIe NVME, etc)
 * Mirror - a standard RAID1 mirror
 * ZFS software raidz1, raidz2, raidz3 'distributed' parity based RAID
 * Hot Spare - hot spare for ZFS software raid.
 * Cache - a device for level 2 adaptive read cache (ZFS L2ARC)
 * Log - ZFS Intent Log (ZFS ZIL) 

VDEVS are dynamically striped by ZFS. A device can be added to a VDEV, but cannot
be removed from it.

== ZFS Pools ==

A zpool is a pool of storage made from a collection of VDEVS. One or more 
ZFS file systems can be created from a ZFS pool.

In the following example, a pool named "pool-test" is created from 3
physical drives:

{{{
$ sudo zpool create pool-test /dev/sdb /dev/sdc /dev/sdd
}}}

Striping is performed dynamically, so this creates a zero redundancy RAID-0
pool.

'''Notice:''' If you are managing many devices, it can be easy to confuse them, so
you should probably prefer /dev/disk/by-id/ names, which often use serial numbers
of drives. The examples here should not suggest that 'sd_' names are preferred.
They merely make examples herein easier to read.

One can see the status of the pool using the following command:

{{{
$ sudo zpool status pool-test
}}}

...and destroy it using:

{{{
$ sudo zpool destroy pool-test
}}}

=== A 2 x 2 mirrored zpool example ===

The following example, we create a zpool containing a VDEV of 2 drives in a mirror:

{{{
$ sudo zpool create mypool mirror /dev/sdc /dev/sdd
}}}

next, we add another VDEV of 2 drives in a mirror to the pool:

{{{
$ sudo zpool add mypool mirror /dev/sde /dev/sdf -f

$ sudo zpool status
  pool: mypool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
}}}

In this example:
 * /dev/sdc, /dev/sdd, /dev/sde, /dev/sdf are the physical devices
 * mirror-0, mirror-1 are the VDEVs
 * mypool is the pool

There are plenty of other ways to arrange VDEVs to create a zpool.

=== A single file based zpool example ===

In the following example, we use a single 2GB file as a VDEV and make a zpool from just this one VDEV:

{{{
$ dd if=/dev/zero of=example.img bs=1M count=2048
$ sudo zpool create pool-test /home/user/example.img
$ sudo zpool status
  pool: pool-test
 state: ONLINE
  scan: none requested
config:

	NAME                      STATE     READ WRITE CKSUM
	pool-test                 ONLINE       0     0     0
	  /home/user/example.img  ONLINE       0     0     0
}}}

In this example:
 * /home/user/example.img is a file based VDEV
 * pool-test is the pool

== RAID ==

ZFS offers different RAID options:

=== Striped VDEVS ===

This is equivalent to RAID0. This has no parity and no mirroring to rebuild the data. This is not recommended because of the risk of losing data if a drive fails. Example, creating a striped pool using 4 VDEVs:

{{{
$ sudo zpool create example /dev/sdb /dev/sdc /dev/sdd /dev/sde
}}}

=== Mirrored VDEVs ===

Much like RAID1, one can use 2 or more VDEVs. For N VDEVs, one will have to have N-1 fail before data is lost. Example, creating mirrored pool with 2 VDEVs

$ sudo zpool create example mirror /dev/sdb /dev/sdc

=== Striped Mirrored VDEVs ===

Much like RAID10, great for small random read I/O.  Create mirrored pairs and then stripe data over the mirrors. Example, creating striped 2 x 2 mirrored pool:

{{{
sudo zpool create example mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde
}}}

or:

{{{
sudo zpool create example mirror /dev/sdb /dev/sdc
sudo zpool add example mirror /dev/sdd /dev/sde
}}}

=== RAIDZ ===

Like RAID5, this uses a variable width strip for parity.  Allows one to get the most capacity out of a bunch of disks with parity checking with a sacrifice to some performance. Allows a single disk failure without losing data.  Example, creating a 4 VDEV RAIDZ:

{{{
$ sudo zpool create example raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde
}}}

=== RAIDZ2 ===

Like RAID6, with double the parity for 2 disk failures with performance similar to RAIDZ. Example, create a 2 parity 5 VDEV pool:

{{{
$ sudo zpool create example raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf
}}}

=== RAIDZ3 ===

3 parity bits, allowing for 3 disk failures before losing data with performance like RAIDZ2 and RAIDZ. Example, create a 3 parity 6 VDEV pool:

{{{
$ sudo zpool create example raidz3 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg
}}}

=== Nested RAIDZ ===

Like RAID50, RAID60, striped RAIDZ volumes. This is better performing than RAIDZ but at the cost of reducing capacity.  Example, 2 x RAIDZ:

{{{
$ sudo zpool create example raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde
$ sudo zpool add example raidz /dev/sdf /dev/sdg /dev/sdh /dev/sdi
}}}

== ZFS Intent Logs ==

ZIL (ZFS Intent Log) drives can be added to a ZFS pool to speed up the
write capabilities of any level of ZFS RAID.  One normally would use
a fast SSD for the ZIL.  Conceptually, ZIL is a logging mechanism where data and metadata 
to be the written is stored, then later flushed as a transactional write. In reality, the ZIL is
more complex than this and [[http://nex7.blogspot.co.uk/2013/04/zfs-intent-log.html|described in detail here]]. 
One or more drives can be used for the ZIL. 

For example, to add a SSDs to the pool 'mypool', use:

{{{
$ sudo zpool add mypool log /dev/sdg -f
}}}

== ZFS Cache Drives ==

Cache devices provide an additional layer of caching between main memory and disk.
They are especially useful to improve random-read performance of mainly static data.

Fox example, to add a cache drive /dev/sdh to the pool 'mypool', use:

{{{
$ sudo zpool add mypool cache /dev/sdh -f
}}}


== ZFS file systems ==

ZFS allows one to create a maximum of 2^64 file systems per pool. In the following
example, we create two file systems in the pool 'mypool':

{{{
sudo zfs create mypool/tmp
sudo zfs create mypool/projects
}}}

and to destroy a file system, use:

{{{
sudo zfs destroy mypool/tmp
}}}

Each ZFS file systems can have properties set, for example, setting a maximum
quota of 10 gigabytes:

{{{
sudo zfs set quota=10G mypool/projects
}}}

or adding using compression:

{{{
sudo zfs set compression=on mypool/projects
}}}

== ZFS Snapshots ==

A ZFS snapshot is a read-only copy of ZFS file system or volume. It can be used
to save the state of a ZFS file system at a point of time, and one can roll back
to this state at a later date.  One can even extract files from a snapshot and not
need to perform a complete roll back.

In the following example, we snapshot the mypool/projects file system:

{{{
$ sudo zfs snapshot -r mypool/projects@snap1
}}}

..and we can see the collection of snapshots using:

{{{
$ sudo zfs list -t snapshot
NAME                     USED  AVAIL  REFER  MOUNTPOINT
mypool/projects@snap1   8.80G      -  8.80G  -
}}}

Now lets 'accidentally' destroy all the files and then roll back:

{{{
$ rm -rf /mypool/projects
$ sudo zfs rollback mypool/projects@snap1
}}}

One can remove a snapshot using the following:

{{{
$ sudo zfs destroy mypool/projects@snap1
}}}


== ZFS Clones ==

A ZFS clone is a writeable copy of a file system with the initial content of the
clone being identical to the original file system.  A ZFS clone can only be created
from a ZFS snapshot and the snapshot cannot be destroyed until the clones created
from it are also destroyed.

For example, to clone mypool/projects, first make a snapshot and then clone:

{{{
$ sudo zfs snapshot -r mypool/projects@snap1
$ sudo zfs clone mypool/projects@snap1 mypool/projects-clone
}}}


== ZFS Send and Receive ==

ZFS send sends a snapshot of a filesystem that can be streamed to a file or
to another machine.  ZFS receive takes this stream and will write out the copy
of the snapshot back as a ZFS filesystem. This is great for backups or sending copies
over the network (e.g. using ssh) to copy a file system.

For example, make a snapshot and save it to a file:

{{{
sudo zfs snapshot -r mypool/projects@snap2
sudo zfs send mypool/projects@snap2 > ~/projects-snap.zfs
}}}

..and receive it back:

{{{
sudo zfs receive -F mypool/projects-copy < ~/projects-snap.zfs
}}}

== ZFS Ditto Blocks ==

Ditto blocks create more redundant copies of data to copy, just for more
added redundancy. With a storage pool of just one device, ditto blocks are
spread across the device, trying to place the blocks at least 1/8 of the disk
apart.  With multiple devices in a pool, ZFS tries to spread ditto blocks 
across separate VDEVs. 1 to 3 copies can be can be set. For example, setting
3 copies on mypool/projects:

{{{
$ sudo zfs set copies=3 mypool/projects
}}}

== ZFS Deduplication ==

ZFS dedup will discard blocks that are identical to existing blocks and will
instead use a reference to the existing block.  This saves space on the device
but comes at a large cost to memory.  The dedup in-memory table uses ~320 bytes
per block. The greater the table is in size, the slower write performance becomes.

For example, enable dedup on mypool/projects, use:

{{{
$ sudo zfs set dedup=on mypool/projects
}}}

For more pros/cons of deduping, refer to
[[http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe]]. Deduplication
is almost never worth the performance penalty.

== ZFS Pool Scrubbing ==

To initiate an explicit data integrity check on a pool one uses the zfs scrub command. For
example, to scrub pool 'mypool':

{{{
$ sudo zpool scrub mypool
}}}

one can check the status of the scrub using zpool status, for example:

{{{
$ sudo zpool status -v mypool
}}}


== Data recovery, a simple example ==

Let's assume we have a 2 x 2 mirror'd zpool:

{{{
$ sudo zpool create mypool mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf -f
$ sudo zpool status
  pool: mypool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
}}}

Now populate it with some data and check sum the data:

{{{
$ dd if=/dev/urandom of=/mypool/random.dat bs=1M count=4096
$ md5sum /mypool/random.dat
f0ca5a6e2718b8c98c2e0fdabd83d943  /mypool/random.dat
}}}

Now we simulate catastrophic data loss by overwriting one of the
VDEV devices with zeros:

{{{
$ sudo dd if=/dev/zero of=/dev/sde bs=1M count=8192
}}}

And now initiate a scrub:

{{{
$ sudo zpool scrub mypool
}}}

And check the status:

{{{
$ sudo zpool status
  pool: mypool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Tue May 12 17:34:53 2015
    244M scanned out of 1.91G at 61.0M/s, 0h0m to go
    115M repaired, 12.46% done
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sde     ONLINE       0     0   948  (repairing)
            sdf     ONLINE       0     0     0
}}}

...now let us remove the drive from the pool:

{{{
$ sudo zpool detach mypool /dev/sde
}}}

..hot swap it out and add a new one back:

{{{
$ sudo zpool attach mypool /dev/sdf /dev/sde  -f
}}}

..and initiate a scrub to repair the 2 x 2 mirror:

{{{
$ sudo zpool scrub mypool
}}}

== ZFS compression ==

As mentioned earlier, one can compress data automatically with ZFS. With the speed of modern CPUs this is a useful option as reduced data size means less data to physically read and write and hence faster I/O.  ZFS offers a comprehensive range of compression methods.  The default is lz4 (a high performance replacement of lzjb) that offers faster compression/decompression to lzjb and slightly higher compression too. One can change the compression level, e.g.

{{{
sudo zfs set compression=gzip-9 mypool
}}}

or even the compression type:

{{{
sudo zfs set compression=lz4 mypool
}}}

and check on the compression level using:

{{{
sudo zfs get compressratio
}}}

lz4 is significantly faster than the other options while still performing well; lz4 is the safest choice.