Tuning ZFS for Speed on Linux
ZFS Tuning for HPC
If you manage storage servers, chances are you are already aware of ZFS and some of the features and functions it boasts. In short, ZFS is a combined all-purpose filesystem and volume manager that simplifies data storage management while offering some advanced features, including drive pooling with software RAID support, file snapshots, in-line data compression, data deduplication, built-in data integrity, advanced caching (to DRAM and SSD), and more.
ZFS is licensed under the Common Development and Distribution License (CDDL), a weak copyleft license based on the Mozilla Public License (MPL). Although open source, ZFS and anything else under the CDDL was, and supposedly still is, incompatible with the GNU General Public License (GPL). This hasn't stopped ZFS enthusiasts from porting it over to the Linux kernel, where it remains a side project under the dominion of the ZFS on Linux [1] (ZoL) project.
The ZoL project not only helped introduce the advanced filesystem to Linux users, it garnered its fair share of users, some developers, and an entire community to support it. That aside, with a significant user base and the filesystem's use for a wide variety of applications (HPC included), it often becomes necessary to know how to tune the filesystem and understand which knobs to turn.
You should understand that when you decide to apply the methods exercised in this article, you must do so with caution or after dry runs before rolling it out into production.
Creating the Test Environment
To begin, you need a server (or virtual machine) with one or more spare drives. I advise more than one because when it comes to performance, spreading I/O load across more disk drives instead of bottlenecking a single drive helps significantly. Therefore, I use four local drives – sdc
, sdd
, sde
, and sdf
– in this article:
$ cat /proc/partitions|grep sda 8 0 488386584 sda 8 1 1024 sda1 8 2 488383488 sda2 8 16 39078144 sdb 8 32 6836191232 sdc 8 64 6836191232 sde 8 48 6836191232 sdd 8 80 6836191232 sdf
Make sure to load the ZFS modules,
$ sudo modprobe zfs
and verify that they are loaded:
$ lsmod|grep zfs zfs 3039232 3 zunicode 331776 1 zfs zavl 16384 1 zfs icp 253952 1 zfs zcommon 65536 1 zfs znvpair 77824 2 zfs,zcommon spl 102400 4 zfs,icp,znvpair,zcommon
With the four drives identified above, I create a ZFS RAIDZ pool, which is equivalent to RAID5,
$ sudo zpool create -f myvol raidz sdc sdd sde sdf
and verify the status of the pool (Listing 1) and that it has been mounted (Listing 2).
Listing 1
Pool Status
$ zpool status pool: myvol state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM myvol ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 errors: No known data errors
Listing 2
Pool Mounted
$ df -ht zfs Filesystem Size Used Avail Use% Mounted on myvol 18T 128K 18T 1% /myvol
Some Basic Tuning
A few general procedures can tune a ZFS filesystem for performance, such as disabling file access time updates in the file metadata. Historically, filesystems have always tracked when a user or application accesses a file and logs the most recent time of access, even if that file was only read and not modified. This activity can affect metadata performance when updating this field. To avoid this unnecessary I/O, simply turn off the atime
parameters:
$ sudo zfs set atime=off myvol
To verify that it has been turned off, use the zfs get atime
command:
$ sudo zfs get atime myvol NAME PROPERTY VALUE SOURCE myvol atime off local
Another parameter that can affect performance is compression. Although some algorithms (e.g., LZ4) are known to perform extremely well, it still sucks up a bit of CPU time compared with its counterparts. Therefore, disable filesystem compression,
$ sudo zfs set compression=off myvol
and verify that compression has been turned off:
$ sudo zfs get compression myvol NAME PROPERTY VALUE SOURCE myvol compression off default
To view all available parameters, use zfs get all
(Listing 3).
Listing 3
View Parameters
$ zfs get all myvol NAME PROPERTY VALUE SOURCE myvol type filesystem - myvol creation Sat Feb 22 22:09 2020 - myvol used 471K - [ ... ]
ARC
Operating systems commonly rely on local (and volatile) memory (e.g., DRAM) to cache file data and have done so for decades, with the ultimate goal of not having to touch the back-end storage device. Waiting for a disk drive to read the requested data can be painfully slow, so operating systems – and, in turn, filesystems – attempt to cache data content in the hopes of not accessing the underlying device. ZFS implements its own non-least-recently-used (non-LRU)-based cache, referred to as the adaptive replacement cache (ARC). In a standard (LRU) cache, the least recently used page cache data is replaced with new cache data. ZFS implements algorithms to be a bit more intelligent than this by maintaining lists for:
1. recently cached entries,
2. recently cached entries that have been accessed more than once,
3. entries evicted from the list of (1) recently cached entries, and
4. entries evicted from the list of (2) recently cached entries that have been accessed more than once.
Caching reads is an extremely difficult task to accomplish. Predicting which data will need to continue to remain in cache is not possible, and the likelihood of data being evicted before it is needed again, and then reread back into cache, is very high because of the nature of randomized read I/O profiles and operations.
The amount of memory the ARC can use on your local system can be managed in multiple ways. For instance, if you want to cap it at 4GB, you can insert that into the ZFS module with the zfs_arc_max
parameter:
$ sudo modprobe zfs zfs_arc_max=4294967296
Or, you can create a configuration file for modprobe
called /etc/modprobe.d/zfs.conf
and save the following content in it:
options zfs zfs_arc_max=4294967296
You can verify the current setting of this parameter by viewing it under sysfs:
$ cat /sys/module/zfs/parameters/zfs_arc_max 0
Also, you can modify that same parameter over the same sysfs interface:
$ echo 4294967296 |sudo tee -a /sys/module/zfs/parameters/zfs_arc_max 4294967296 $ cat /sys/module/zfs/parameters/zfs_arc_max 4294967296
If you are ever interested in viewing the ARC statistics, it is all available in procfs (Listing 4).
Listing 4
ARC Statistics
$ cat /proc/spl/kstat/zfs/arcstats 13 1 0x01 96 26112 26975127196 517243166877 name type data hits 4 691 misses 4 254 demand_data_hits 4 0 demand_data_misses 4 0 demand_metadata_hits 4 691 demand_metadata_misses 4 254 prefetch_data_hits 4 0 prefetch_data_misses 4 0 prefetch_metadata_hits 4 0 prefetch_metadata_misses 4 0 mru_hits 4 88 mru_ghost_hits 4 0 mfu_hits 4 603 mfu_ghost_hits 4 0 deleted 4 0 mutex_miss 4 0 access_skip 4 0 evict_skip 4 0 [ ... ]
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.