Linux device mapper writecache

Kicking It Into Overdrive

Other Caching Tools

Tools earning honorable mention include:

  • RapidDisk. This dynamically allocatable memory disk Linux module uses RAM and can also be used as a front-end write-through and write-around caching node for slower media.
  • Memcached. A cross-platform userspace library with an API for applications, Memcached also relies on RAM to boost the performance of databases and other applications.
  • ReadyBoost. A Microsoft product, ReadyBoost was introduced in Windows Vista and is included in later versions of Windows. Similar to dm-cache and bcache, ReadyBoost enables SSDs to act as a cache for slower HDDs.

Working with dm-writecache

The only prerequisites for using dm-writecache are to be on a Linux distribution running a 4.18 kernel or later and to have a version of Logical Volume Manager 2 (LVM2) installed at v2.03.x or above. I will also show you how to enable a dm-writecache volume without relying on the LVM2 framework and instead manually invoke dmsetup.

Identifying and Configuring Your Environment

Identifying the storage volumes and configuring them is a pretty straightforward process (Listing 1).

Listing 1

Storage Volumes

$ cat /proc/partitions
major   minor   #blocks name
   7        0      91264 loop0
   7        1      56012 loop1
   7        2      90604 loop2
 259        0  244198584 nvme0n1
   8        0  488386584 sda
   8        1       1024 sda1
   8        2  488383488 sda2
   8       16 6836191232 sdb
   8       32 6836191232 sdc

In my example, I will be using both /dev/sdb and /dev/nvme0n1. As you might have already guessed, /dev/sdb is my slow device, and /dev/nvme0n1 is my NVMe fast device. Because I do not necessarily want to use my entire SSD (the rest could be used as a separate standalone or cached device elsewhere), I will place both the SSD and HDD into a single LVM2 volume group. To begin, I label the physical volumes for LVM2:

$ sudo pvcreate /dev/nvme0n1
  Physical volume "/dev/nvme0n1" successfully created.
$ sudo pvcreate /dev/sdb
  Physical volume "/dev/sdb" successfully created.

Then, I verify that the volumes have been appropriately labeled (Listing 2).

Listing 2

Volume Labels

$ sudo pvs
  PV           VG Fmt  Attr    PSize    PFree
  /dev/nvme0n1    lvm2 ---  <232.89g <232.89g
  /dev/sdb        lvm2 ---    <6.37t   <6.37t

Next, I add both volumes into a new volume group labeled vg-cache,

$ sudo vgcreate vg-cache /dev/nvme0n1 /dev/sdb
  Volume group "vg-cache" successfully created

verify that the volume group has been created as seen in Listing 3, and verify that both physical volumes are within it, as in Listing 4.

Listing 3

Volume Group Created

$ sudo vgs
  VG       #PV #LV #SN Attr   VSize VFree
  vg-cache   2   0   0 wz--n- 6.59t 6.59t

Listing 4

Physical Volumes Present

$ sudo pvs
  PV           VG       Fmt  Attr PSize   PFree
  /dev/nvme0n1 vg-cache lvm2 a--  232.88g 232.88g
  /dev/sdb     vg-cache lvm2 a--   <6.37t  <6.37t

Say I want to use 90 percent of the slow disk: I will carve a logical volume labeled slow from the volume group, use that slow device,

$ sudo lvcreate -n slow -l90%FREE vg-cache /dev/sdb
  Logical volume "slow" created.

and verify that the logical volume has been created (Listing 5).

Listing 5

Slow Logical Volume Created

$ sudo lvs vg-cache -o+devices
  LV   VG       Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices
  slow vg-cache -wi-a----- <5.93t                                                     /dev/sdb(0)

Using the fio benchmarking utility, I run a quick test with random write I/Os to the slow logical volume and get a better understanding of how poorly it performs (Listing 6).

Listing 6

Test Slow Logical Volume

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 \--filename=/dev/vg-cache/slow --rw=randwrite --numjobs=1 --name=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1401KiB/s][r=0,w=350 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3104: Sat Oct 12 14:39:08 2019
  write: IOPS=352, BW=1410KiB/s (1444kB/s)(82.8MiB/60119msec)
[ ... ]
Run status group 0 (all jobs):
  WRITE: bw=1410KiB/s (1444kB/s), 1410KiB/s-1410KiB/s (1444kB/s-1444kB/s), io=82.8MiB (86.8MB), run=60119-60119msec

I see an average of 1.4 kibibytes per second (KiBps) throughput. Although that number is not great, it is expected when sending a number of small random writes to an HDD. Remember, with mechanical and movable components, a large percentage of the time is spent seeking to new locations on the disk platters. If you recall, this method introduces latency and will take much longer for the disk drive to return with an acknowledgment that the write is persistent to disk.

Now, I will carve out a 10GB logical volume from the SSD and label it fast ,

$ sudo lvcreate -n fast -L 10G vg-cache /dev/nvme0n1

verify that the logical volume has been created (Listing 7) and verify that it is created from the NVMe drive (Listing 8).

Listing 7

Fast Logical Volume Created

$ sudo lvs
  LV     VG       Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  fast   vg-cache -wi-a-----  10.00g
  slow   vg-cache -wi-a-----   5.93t

Listing 8

Fast Logical Volume Created from NVMe Drive

$ sudo lvs vg-cache -o+devices
  LV     VG       Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices
  fast   vg-cache -wi-a-----  10.00g                                                     /dev/nvme0n1(0)
  slow   vg-cache -wi-a-----   5.93t                                                     /dev/sdb(0)

Like the example above, I will run another quick fio test with the same parameters as earlier (Listing 9).

Listing 9

<fio> Test

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 \
--filename=/dev/vg-cache/fast --rw=randwrite --numjobs=1 --name=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=654MiB/s][w=167k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1225: Sat Oct 12 19:20:18 2019
  write: IOPS=168k, BW=655MiB/s (687MB/s)(10.0GiB/15634msec); 0 zone resets
[ ... ]
Run status group 0 (all jobs):
  WRITE: bw=655MiB/s (687MB/s), 655MiB/s-655MiB/s (687MB/s-687MB/s), io=10.0GiB (10.7GB), run=15634-15634msec

Wow! You can see a night and day difference here of about 655MiBps throughput.

If you have not already, be sure to load the dm-writecache kernel module:

$ sudo modprobe dm-writecache

To enable the writecache volume via LVM2, you will first need to deactivate both volumes to ensure that nothing is actively writing to them. To deactivate the SSD, enter:

$ sudo lvchange -a n vg-cache/fast

To deactivate the HDD, enter:

$ sudo lvchange -a n vg-cache/slow

Now, convert both volumes into a single cache volume,

$ sudo lvconvert --type writecache --cachevol fast vg-cache/slow

activate the new volume,

$ sudo lvchange -a y vg-cache/slow

and verify that the conversion took effect (Listing 10).

Listing 10

Conversion

$ sudo lvs -a vg-cache -o devices,segtype,lvattr,name,vgname,origin
  Devices          Type       Attr       LV              VG       Origin
  /dev/nvme0n1(0)  linear     Cwi-aoC--- [fast]          vg-cache
  slow_wcorig(0) writecache Cwi-a-C--- slow            vg-cache [slow_wcorig]
  /dev/sdd(0)      linear     owi-aoC--- [slow_wcorig]   vg-cache

Now it's time to run fio (Listing 11).

Listing 11

Run fio

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 \--filename=/dev/vg-cache/slow --rw=randwrite --numjobs=1 --name=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=475MiB/s][w=122k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1634: Mon Oct 14 22:18:59 2019
  write: IOPS=118k, BW=463MiB/s (485MB/s)(10.0GiB/22123msec); 0 zone resets
[ ... ]
Run status group 0 (all jobs):
  WRITE: bw=463MiB/s (485MB/s), 463MiB/s-463MiB/s (485MB/s-485MB/s), io=10.0GiB (10.7GB), run=22123-22123msec

At about 460MiBps, it's almost 330 times faster than the plain old HDD. This is awesome. Remember, the NVMe is a front-end cache to the HDD, and although all writes are hitting the NVMe, a background thread (or more than one) schedules flushes to the backing store (i.e., the HDD).

If you want to remove the volume, type:

$ sudo lvconvert --splitcache vg-cache/slow

Now you are ready to map the NVMe drive as the writeback cache for the slow spinning drive with dmsetup (in the event that you do not have a proper version of LVM2 installed). To invoke dmsetup, you first need to grab the block count of the slow device:

$ sudo blockdev --getsz /dev/vg-cache/slow
12744687616

You will plug this number into the next command and create a writecache device mapper virtual node called wc with a 4K blocksize:

$ sudo dmsetup create wc --table "0 78151680 writecache s /dev/vg-cache/slow /dev/vg-cache/fast 4096 0"

Assuming that the command returns without an error, a new (virtual) device node will be accessible from /dev/mapper/wc. This is the dm-writecache mapping. Now you need to run fio again, but this time to the newly created device (Listing 12).

Listing 12

Run fio to New Device

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 \--filename=/dev/mapper/wc --rw=randwrite --numjobs=1 --name=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=7055: Sat Oct 12 19:09:53 2019
  write: IOPS=34.8k, BW=136MiB/s (143MB/s)(9.97GiB/75084msec); 0 zone resets
[ ... ]
Run status group 0 (all jobs):
  WRITE: bw=136MiB/s (143MB/s), 136MiB/s-136MiB/s (143MB/s-143MB/s), io=9.97GiB (10.7GB), run=75084-75084msec

Although it isn't near the standalone NVMe speeds, you can see a wonderful improvement of random write operations. At 90 times the original HDD performance, you observe a throughput of 136MiBps. I am not entirely sure what parameters are not being configured for the volume during the dmsetup create to match that of the earlier LVM2 example, but this is still pretty darn good.

To remove the device mapper cache mapping, you first need to flush forcefully (and manually) all pending write data to disk:

$ sudo dmsetup message /dev/mapper/wc 0 flush

Now it is safe to enter

$ dmsetup remove /dev/mapper/wc

to remove the mapping.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus