ZFS
ZFS
ZFS support was added to Ubuntu Wily 15.10 as a technology preview. It is only supported on 64 bit architectures.
To install ZFS, use:
sudo apt-get install zfsutils-linux
this will take a while to install as the ZFS drivers need to be built from source using DKMS, so please be patient. Below is a quick overview of ZFS, this is intended as a getting started primer. For further information on ZFS, please refer to some excellent documentation written by Aaron Toponce.
ZFS Virtual Devices (ZFS VDEVs)
A VDEV is a meta-device that can represent one or more devices. ZFS supports 7 different types of VDEV:
- File - a pre-allocated file
- Physical Drive (HDD, SDD, PCIe NVME, etc)
- Mirror - a standard RAID1 mirror
- ZFS software raidz1, raidz2, raidz3 'distributed' parity based RAID
- Hot Spare - hot spare for ZFS software raid.
- Cache - a device for level 2 adaptive read cache (ZFS L2ARC)
- Log - ZFS Intent Log (ZFS ZIL)
VDEVS are dynamically striped by ZFS. A device can be added to a VDEV, but cannot be removed from it.
ZFS Pools
A zpool is a pool of storage made from a collection of VDEVS. One or more ZFS file systems can be created from a ZFS pool.
In the following example, a pool named "pool-test" is created from 3 physical drives:
$ sudo zpool create pool-test /dev/sdb /dev/sdc /dev/sdd
striping is performed dynamically, so this creates a zero redundancy RAID-0 pool.
One can see the status of the pool using the following command:
$ sudo zpool status pool-test
...and destroy it using:
$ sudo zpool destroy pool-test
A 2 x 2 mirrored zpool
The following example, we create a zpool containing a 2 drive mirror:
$ sudo zpool create mypool mirror /dev/sdc /dev/sdd
next, we add another 2 drive mirror to the pool:
$ sudo zpool add mypool mirror /dev/sde /dev/sdf -f $ sudo zpool status pool: mypool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0
In this example:
- /dev/sdc, /dev/sdd, /dev/sde, /dev/sdf are the physical devices
- mirror-0, mirror-1 are the VDEVs
- mypool is the pool
There are plenty of other ways to arrange VDEVs to create a zpool.
RAID
ZFS offers different RAID options:
Striped VDEVS
This is equivalent to RAID0. This has no parity and no mirroring to rebuild the data. This is not recommended because of the risk of losing data if a drive fails. Example, creating a striped pool using 4 VDEVs:
$ sudo zpool create example /dev/sdb /dev/sdc /dev/sdd /dev/sde
Mirrored VDEVs
Much like RAID1, one can use 2 or more VDEVs. For N VDEVs, one will have to have N-1 fail before data is lost. Example, creating mirrored pool with 2 VDEVs
$ sudo zpool create example mirror /dev/sdb /dev/sdc
Striped Mirrored VDEVs
Much like RAID10, great for small random read I/O. Create mirrored pairs and then stripe data over the mirrors. Example, creating striped 2 x 2 mirrored pool:
sudo zpool create example mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde
or:
sudo zpool create example mirror /dev/sdb /dev/sdc sudo zpool add example mirror /dev/sdd /dev/sde
RAIDZ
Like RAID5, this uses a variable width strip for parity. Allows one to get the most capacity out of a bunch of disks with parity checking with a sacrifice to some performance. Allows a single disk failure without losing data. Example, creating a 4 VDEV RAIDZ:
$ sudo zpool create example raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde
RAIDZ2
Like RAID6, with double the parity for 2 disk failures with performance similar to RAIDZ. Example, create a 2 parity 5 VDEV pool:
$ sudo zpool create example raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf
RAIDZ3
3 parity bits, allowing for 3 disk failures before losing data with performance like RAIDZ2 and RAIDZ. Example, create a 3 parity 6 VDEV pool:
$ sudo zpool create example raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg
Nested RAIDZ
Like RAID50, RAID60, striped RAIDZ volumes. This is better performing than RAIDZ but at the cost of reducing capacity. Example, 2 x RAIDZ:
$ sudo zpool create example raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde $ sudo zpool add example raidz /dev/sdf /dev/sdg /dev/sdh /dev/sdi
ZFS Intent Logs
ZIL (ZFS Intent Log) drives can be added to a ZFS pool to speed up the write capabilities of any level of ZFS RAID. One normally would use a fast SSD for the ZIL. ZFS writes metadata to the ZIL and when the drives in the pool are idle the metadata on the ZIL is flushed to the pool. One or more drives can be used for the ZIL.
For example, to add two SSDs to the pool 'mypool', use:
$ sudo zpool add mypool log mirror /dev/sdb /dev/sdg -f
ZFS Cache Drives
Cache devices provide an additional layer of caching between main memory and disk. They are especially useful to improve random-read performance of mainly static data.
Fox example, to add a cache drive /dev/sdh to the pool 'mypool', use:
$ sudo zpool add mypool cache /dev/sdh -f
ZFS file systems
ZFS allows one to create a maximum of 2^64 file systems per pool. In the following example, we create two file systems in the pool 'mypool':
sudo zfs create mypool/tmp sudo zfs create mypool/projects
and to destroy a file system, use:
sudo zfs destroy mypool/tmp
Each ZFS file systems can have properties set, for example, setting a maximum quota of 10 gigabytes:
sudo zfs set quota=10G mypool/projects
or adding using compression:
sudo zfs set compression=on mypool/projects
ZFS Snapshots
A ZFS snapshot is a read-only copy of ZFS file system or volume. It can be used to save the state of a ZFS file system at a point of time, and one can roll back to this state at a later date. One can even extract files from a snapshot and not need to perform a complete roll back.
In the following example, we snapshot the mypool/projects file system:
$ sudo zfs snapshot -r mypool/projects@snap1
..and we can see the collection of snapshots using:
$ sudo zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT mypool/projects@snap1 8.80G - 8.80G -
Now lets 'accidentally' destroy all the files and then roll back:
$ rm -rf /mypool/projects $ sudo zfs rollback mypool/projects@snap1
One can remove a snapshot using the following:
$ sudo zfs destroy mypool/projects@snap1
ZFS Clones
A ZFS clone is a writeable copy of a file system with the initial content of the clone being identical to the original file system. A ZFS clone can only be created from a ZFS snapshot and the snapshot cannot be destroyed until the clones created from it are also destroyed.
For example, to clone mypool/projects, first make a snapshot and then clone:
$ sudo zfs snapshot -r mypool/projects@snap1 $ sudo zfs clone mypool/projects@snap1 mypool/projects-clone
ZFS Send and Receive
ZFS send sends a snapshot of a filesystem that can be streamed to a file or to another machine. ZFS receive takes this stream and will write out the copy of the snapshot back as a ZFS filesystem. This is great for backups or sending copies over the network (e.g. using ssh) to copy a file system.
For example, make a snapshot and save it to a file:
sudo zfs snapshot -r mypool/projects@snap2 sudo zfs send mypool/projects@snap2 > ~/projects-snap.zfs
..and receive it back:
sudo zfs receive -F mypool/projects-copy < ~/projects-snap.zfs
ZFS Ditto Blocks
Ditto blocks create more redundant copies of data to copy, just for more added redundancy. With a storage pool of just one device, ditto blocks are spread across the device, trying to place the blocks at least 1/8 of the disk apart. With multiple devices in a pool, ZFS tries to spread ditto blocks across separate VDEVs. 1 to 3 copies can be can be set. For example, setting 3 copies on mypool/projects:
$ sudo zfs set copies=3 mypool/projects
ZFS Deduplication
ZFS dedup will discard blocks that are identical to existing blocks and will instead use a reference to the existing block. This saves space on the device but comes at a large cost to memory. The dedup in-memory table uses ~320 bytes per block. The greater the table is in size, the slower write performance becomes.
For example, enable dedup on mypool/projects, use:
$ sudo zfs set dedup=on mypool/projects
For more pros/cons of deduping, refer to http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe. Deduplication is almost never worth the performance penalty.
ZFS Pool Scrubbing
To initiate an explicit data integrity check on a pool one uses the zfs scrub command. For example, to scrub pool 'mypool':
$ sudo zpool scrub mypool
one can check the status of the scrub using zpool status, for example:
$ sudo zpool status -v mypool
Data recovery, a simple example
Let's assume we have a 2 x 2 mirror'd zpool:
$ sudo zpool create mypool mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf -f $ sudo zpool status pool: mypool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0
Now populate it with some data and check sum the data:
$ dd if=/dev/urandom of=/mypool/random.dat bs=1M count=4096 $ md5sum /mypool/random.dat f0ca5a6e2718b8c98c2e0fdabd83d943 /mypool/random.dat
Now we simulate catastrophic data loss by overwriting one of the VDEV devices with zeros:
$ sudo dd if=/dev/zero of=/dev/sde bs=1M count=8192
And now initiate a scrub:
$ sudo zpool scrub mypool
And check the status:
$ sudo zpool status pool: mypool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub in progress since Tue May 12 17:34:53 2015 244M scanned out of 1.91G at 61.0M/s, 0h0m to go 115M repaired, 12.46% done config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sde ONLINE 0 0 948 (repairing) sdf ONLINE 0 0 0
...now let us remove the drive from the pool:
$ sudo zpool detach mypool /dev/sde
..hot swap it out and add a new one back:
$ sudo zpool attach mypool /dev/sdf /dev/sde -f
..and initiate a scrub to repair the 2 x 2 mirror:
$ sudo zpool scrub mypool
ZFS compression
As mentioned earlier, one can compress data automatically with ZFS. With the speed of modern CPUs this is a useful option as reduced data size means less data to physically read and write and hence faster I/O. ZFS offers a comprehensive range of compression methods. The default is gzip compression level 6, but one can change the compression level, e.g.
sudo set compression=gzip-9 mypool
or even the compression type:
sudo set zfs set compression=lz4 mypool
and check on the compression level using:
sudo zfs get compressratio
lz4 is significantly faster than the other options while still performing well; lz4 is the safest choice.