Lead Image © nikkikii, 123RF.com

Lead Image © nikkikii, 123RF.com

Rethinking RAID (on Linux)

Madam, I'm mdadm

Article from ADMIN 62/2021
By
Configure redundant storage arrays to boost overall data access throughput while maintaining fault tolerance.

Often, you find yourself attempting to eke out a bit more performance from the computer system you are attempting to either build or recycle, usually with limited funds at your disposal. Sure, you can tamper with the CPU or even memory settings, but if the I/O hitting the system needs to touch the underlying storage devices, those CPU tunings will make little to no difference.

In previous articles, I have shared methods by which one can boost write and read performance to slower disk devices by leveraging both solid state drives (SSD) and dynamic random access memory (DRAM) as a cache [1]. This time, I will instead shift focus to a unique way you can configure redundant storage arrays so that you not only boost overall data access throughput but also maintain fault tolerance. The following examples center around a multiple-device redundant array of inexpensive (or independent) disks (MD RAID) in Linux and its userland utility mdadm [2].

Conventional wisdom has always dictated that spreading I/O load across more disk drives instead of bottlenecking a single drive does help significantly when increasing one's workload. For instance, if instead of writing to a single disk drive you split the I/O requests and write that same amount of data in a stripe across multiple drives (e.g., RAID0), you are reducing the amount of work that a single drive must perform to accomplish the same task. For magnetic spinning disks (i.e., hard disk drives, HDDs), the advantages should be more noticeable. The time it takes to seek across a medium introduces latency, and with randomly accessed I/O patterns, the I/O throughput suffers as a result on a single drive. A striped approach does not solve all the problems, but it does help a bit.

In this article, I look at something entirely different. I spend more time focusing on increasing read throughput by way of RAID1 mirrors. In the first example, I discuss the traditional read balance in mirrored volumes (where read operations are balanced across both volumes in the mirrored set). The next examples are of read-preferred (or write-mostly) drives in a mirrored volume incorporating non-volatile media such as SSD or volatile media such as a ramdisk.

In my system, I have identified the following physical drives that I will be using in my examples:

$ cat /proc/partitions |grep -e sd[c,d] -e nvm
 259        0  244198584 nvme0n1
 259        2  244197543 nvme0n1p1
   8       32 6836191232 sdc
   8       33  244197544 sdc1
   8       48 6836191232 sdd
   8       49  244197544 sdd1

Notice that I have one non-volatile memory express (NVMe) drive and two serial-attached SCSI (SAS) drives. Later, I will introduce a ramdisk. Also notice that single partitions have been carved out in each drive that are approximately equal in size, which you will see is necessary when working with the RAID logic.

A quick random write test benchmark of one of the SAS volumes with the fio performance benchmarking utility can establish a baseline of both random write (Listing 1) and random read operations (Listing 2). The results show that the single HDD has a throughput of 1.4MBps for random writes and 1.9MBps for random reads.

Listing 1

Random Write Test

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 --filename=/dev/sdc1 --rw=randwrite --numjobs=1 --name=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1420KiB/s][w=355 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3377: Sat Jan  9 15:31:04 2021
  write: IOPS=352, BW=1410KiB/s (1444kB/s)(82.9MiB/60173msec); 0 zone resets
    [ ... ]
Run status group 0 (all jobs):
  WRITE: bw=1410KiB/s (1444kB/s), 1410KiB/s-1410KiB/s (1444kB/s-1444kB/s), io=82.9MiB (86.9MB), run=60173-60173msec
Disk stats (read/write):
  sdc1: ios=114/21208, merge=0/0, ticks=61/1920063, in_queue=1877884, util=98.96%

Listing 2

Random Read Test

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 --filename=/dev/sdc1 --rw=randread --numjobs=1 --name=test
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1896KiB/s][r=474 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3443: Sat Jan  9 15:32:51 2021
  read: IOPS=464, BW=1858KiB/s (1903kB/s)(109MiB/60099msec)
     [ ... ]
Run status group 0 (all jobs):
   READ: bw=1858KiB/s (1903kB/s), 1858KiB/s-1858KiB/s (1903kB/s-1903kB/s), io=109MiB (114MB), run=60099-60099msec
Disk stats (read/write):
  sdc1: ios=27838/0, merge=0/0, ticks=1912861/0, in_queue=1856892, util=98.07%

To test a RAID1 mirror's read balance performance, I create a mirrored volume with the two HDDs identified earlier (Listing 3) and then view the status (Listing 4) and details (Listing 5) of the RAID volume.

Listing 3

Create a Mirrored Volume

$ sudo mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
Continue creating array? y
mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

Listing 4

View RAID Status

 cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdd1[1] sdc1[0]
      244065408 blocks super 1.2 [2/2] [UU]
      [=>...................]  resync =  6.4% (15812032/244065408) finish=19.1min speed=198449K/sec
      bitmap: 2/2 pages [8KB], 65536KB chunk
unused devices: <none>

Listing 5

View RAID Details

$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Sat Jan  9 15:22:29 2021
        Raid Level : raid1
        Array Size : 244065408 (232.76 GiB 249.92 GB)
     Used Dev Size : 244065408 (232.76 GiB 249.92 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent
     Intent Bitmap : Internal
       Update Time : Sat Jan  9 15:24:20 2021
             State : clean, resyncing
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0
Consistency Policy : bitmap
     Resync Status : 9% complete
              Name : dev-machine:0  (local to host dev-machine)
              UUID : a84b0db5:8a716c6d:ce1e9ca6:8265de17
            Events : 22
    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1

You will immediately notice that the array is initializing the disks and zeroing out the data on each to bring it all to a good state. You can definitely use it in this state, but it will affect overall performance. Also, you probably do not want to disable the initial resync of the array with the --assume-clean option. Even if the drives are new out of the box, it is better to know that your array is in a proper state before writing important data to it. This process will definitely take a while, and the bigger the array, the longer the initialization process. Before proceeding with follow-up benchmarking tests, you should wait until volume synchronization completes.

Next, verify that the mirror initialization has completed,

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdd1[1] sdc1[0]
   244065408 blocks super 1.2 [2/2] [UU]
   bitmap: 1/2 pages [4KB], 65536KB chunk
 **
unused devices: <none>

and repeat the random write and read tests from before, but this time to the RAID volume (e.g., /dev/md0). Remember, the first random writes were 1.4MBps and random reads 1.9MBps. The good news is that whereas random writes dropped a tiny bit to 1.2MBps (Listing 6), random reads increased to almost double the throughput with a rate of 3.3MBps (Listing 7).

Listing 6

Random Write to RAID

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 --filename=/dev/md0 --rw=randwrite --numjobs=1 --name=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1280KiB/s][w=320 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3478: Sat Jan  9 15:46:11 2021
  write: IOPS=308, BW=1236KiB/s (1266kB/s)(72.5MiB/60102msec); 0 zone resets
        [ ... ]
Run status group 0 (all jobs):
  WRITE: bw=1236KiB/s (1266kB/s), 1236KiB/s-1236KiB/s (1266kB/s-1266kB/s), io=72.5MiB (76.1MB), run=60102-60102msec
Disk stats (read/write):
    md0: ios=53/18535, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=33/18732, aggrmerge=0/0, aggrticks=143/1173174, aggrin_queue=1135748, aggrutil=96.50%
  sdd: ios=13/18732, merge=0/0, ticks=93/1123482, in_queue=1086112, util=96.09%
  sdc: ios=54/18732, merge=0/0, ticks=194/1222866, in_queue=1185384, util=96.50%

Listing 7

Random Read from RAID

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 --filename=/dev/md0 --rw=randread --numjobs=1 --name=test
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=3184KiB/s][r=796 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3467: Sat Jan  9 15:44:42 2021
  read: IOPS=806, BW=3226KiB/s (3303kB/s)(189MiB/60061msec)
     [ ... ]
Run status group 0 (all jobs):
   READ: bw=3226KiB/s (3303kB/s), 3226KiB/s-3226KiB/s (3303kB/s-3303kB/s), io=189MiB (198MB), run=60061-60061msec
Disk stats (read/write):
    md0: ios=48344/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=24217/0, aggrmerge=0/0, aggrticks=959472/0, aggrin_queue=910458, aggrutil=96.15%
  sdd: ios=24117/0, merge=0/0, ticks=976308/0, in_queue=927464, util=96.09%
  sdc: ios=24318/0, merge=0/0, ticks=942637/0, in_queue=893452, util=96.15%

NVMe

Now I will introduce NVMe into the mix – that is, two drives: one NVMe and one HDD in the same mirror. The mdadm utility offers a neat little feature that can be leveraged with the --write-mostly argument, which translates to: Use the following drives for write operations only and not read operations (unless a drive failure were to occur on the volumes designated for read operations).

To begin, create the RAID volume (Listing 8). Next, view the RAID volume's details and pay particular attention to the drive labeled writemostly (Listing 9).

Listing 8

Create the RAID Volume

$ sudo mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/nvme0n1p1 --write-mostly /dev/sdc1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdc1 appears to be part of a raid array:
       level=raid1 devices=2 ctime=Sat Jan  9 15:22:29 2021
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

Listing 9

View the RAID Volume

$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Sat Jan  9 15:52:00 2021
        Raid Level : raid1
        Array Size : 244065408 (232.76 GiB 249.92 GB)
     Used Dev Size : 244065408 (232.76 GiB 249.92 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent
     Intent Bitmap : Internal
       Update Time : Sat Jan  9 15:52:21 2021
             State : clean, resyncing
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0
Consistency Policy : bitmap
     Resync Status : 1% complete
              Name : dev-machine:0  (local to host dev-machine)
              UUID : 833033c5:cd9b78de:992202ee:cb1bf77f
            Events : 4
    Number   Major   Minor   RaidDevice State
       0     259        2        0      active sync   /dev/nvme0n1p1
       1       8       33        1      active sync writemostly   /dev/sdc1

Then, repeat the same fio tests by executing the random write test (Listing 10) and the random read test (Listing 11).

Listing 10

Random Write with NVMe

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 --filename=/dev/md0 --rw=randwrite --numjobs=1 --name=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1441KiB/s][w=360 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3602: Sat Jan  9 16:14:10 2021
  write: IOPS=342, BW=1371KiB/s (1404kB/s)(80.5MiB/60145msec); 0 zone resets
     [ ... ]
Run status group 0 (all jobs):
  WRITE: bw=1371KiB/s (1404kB/s), 1371KiB/s-1371KiB/s (1404kB/s-1404kB/s), io=80.5MiB (84.4MB), run=60145-60145msec
Disk stats (read/write):
    md0: ios=100/20614, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=103/20776, aggrmerge=0/0, aggrticks=12/920862, aggrin_queue=899774, aggrutil=97.47%
  nvme0n1: ios=206/20776, merge=0/0, ticks=24/981, in_queue=40, util=95.01%
  sdc: ios=0/20776, merge=0/0, ticks=0/1840743, in_queue=1799508, util=97.47%

Listing 11

Random Read with NVMe

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 --filename=/dev/md0 --rw=randread --numjobs=1 --name=test
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=678MiB/s][r=173k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3619: Sat Jan  9 16:14:53 2021
  read: IOPS=174k, BW=682MiB/s (715MB/s)(10.0GiB/15023msec)
        [ ... ]
Run status group 0 (all jobs):
   READ: bw=682MiB/s (715MB/s), 682MiB/s-682MiB/s (715MB/s-715MB/s), io=10.0GiB (10.7GB), run=15023-15023msec
Disk stats (read/write):
    md0: ios=2598587/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=1310720/0, aggrmerge=0/0, aggrticks=25127/0, aggrin_queue=64, aggrutil=99.13%
  nvme0n1: ios=2621440/0, merge=0/0, ticks=50255/0, in_queue=128, util=99.13%
  sdc: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

Wow. Although the result is a very small increase in write operations because of the NVMe (up to 1.4MBps), look at the random reads. In the case of RAID, you are only as fast as your slowest drive, which is why speeds are hovering around the original baseline of 1.3MBps. The original single HDD benchmark for random reads was 1.9MBps, and the read-balanced mirrored HDDs saw 3.3MBps. Here, with the NVMe volume set as the drive from which to read, speeds are a whopping 715MBps! I wonder if it can be better?

Ramdisk

What would happen if I introduced a ramdisk into the picture? That is, I want to boost read operations but also persist the data after reboots of the system. This process should not be confused with caching. The data is not being staged on the ramdisk temporarily before being persisted into a backing store.

In the next example, the ramdisk will be treated like a backing store, even though the volatile medium technically isn't one. Before I proceed, I will need to carve out a partition roughly the same size as the ramdisk. I have chosen a meager 2GB because the older system I am currently using does not have much installed to begin with:

$ cat /proc/partitions |grep -e sd[c,d]
   8       48 6836191232 sdd
   8       49  244197544 sdd1
   8       50    2097152 sdd2
   8       32 6836191232 sdc
   8       33  244197544 sdc1
   8       34    2097152 sdc2

Now to add the ramdisk. As a prerequisite, you need to ensure that the jansson development library is installed on your local machine. Clone the rapiddisk Git repository [3], build the package, install it,

$ git clone https://github.com/pkoutoupis/rapiddisk.git
$ cd rapiddisk/
$ make
$ sudo make install

insert the kernel modules,

$ sudo modprobe rapiddisk
$ sudo modprobe rapiddisk-cache

verify that the modules are installed,

$ lsmod|grep rapiddisk
rapiddisk_cache        20480  0
rapiddisk              20480  0

create a single 2GB ramdisk,

$ sudo rapiddisk --attach 2048
rapiddisk 6.0
Copyright 2011 - 2019 Petros Koutoupis
Attached device rd0 of size 2048 Mbytes

verify that the ramdisk has been created,

$ sudo rapiddisk --list
rapiddisk 6.0
Copyright 2011 - 2019 Petros Koutoupis
List of RapidDisk device(s):
 RapidDisk Device 1: rd0 Size (KB): 2097152
List of RapidDisk-Cache mapping(s):
  None

and create a mirrored volume with the ramdisk as the primary and one of the smaller HDD partitions set to write-mostly (Listing 12). Now, verify the RAID1 mirror state:

Listing 12

Create Mirrored Volume

$ sudo mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/rd0 --write-mostly /dev/sdc2
mdadm: /dev/rd0 appears to be part of a raid array:
       level=raid1 devices=2 ctime=Sat Jan  9 16:32:35 2021
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdc2 appears to be part of a raid array:
       level=raid1 devices=2 ctime=Sat Jan  9 16:32:35 2021
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc2[1](W) rd0[0]
   2094080 blocks super 1.2 [2/2] [UU]
 **
unused devices: <none>

The initialization time should be relatively quick here. Also, verify the RAID1 mirror details (Listing 13) and rerun the random write I/O test (Listing 14). As you saw with the NVMe drive earlier, you also see a small bump in random write operations at approximately 1.6MBps. Remember that you are only as fast as your slowest disk (i.e., the HDD paired with the ramdisk in the mirrored set).

Listing 13

Verify RAID1 Mirror Details

$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Sat Jan  9 16:32:43 2021
        Raid Level : raid1
        Array Size : 2094080 (2045.00 MiB 2144.34 MB)
     Used Dev Size : 2094080 (2045.00 MiB 2144.34 MB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent
       Update Time : Sat Jan  9 16:32:54 2021
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0
Consistency Policy : resync
              Name : dev-machine:0  (local to host dev-machine)
              UUID : 79387934:aaaad032:f56c6261:de230a86
            Events : 17
    Number   Major   Minor   RaidDevice State
       0     252        0        0      active sync   /dev/rd0
       1       8       34        1      active sync writemostly   /dev/sdc2

Listing 14

Ramdisk Random Write

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 --filename=/dev/md0 --rw=randwrite --numjobs=1 --name=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1480KiB/s][w=370 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=5854: Sat Jan  9 16:34:44 2021
  write: IOPS=395, BW=1581KiB/s (1618kB/s)(92.9MiB/60175msec); 0 zone resets
      [ ... ]
Run status group 0 (all jobs):
  WRITE: bw=1581KiB/s (1618kB/s), 1581KiB/s-1581KiB/s (1618kB/s-1618kB/s), io=92.9MiB (97.4MB), run=60175-60175msec
Disk stats (read/write):
    md0: ios=81/23777, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/11889, aggrmerge=0/0, aggrticks=0/958991, aggrin_queue=935342, aggrutil=99.13%
  sdc: ios=0/23778, merge=0/0, ticks=0/1917982, in_queue=1870684, util=99.13%
  rd0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

Now, run the random read test (Listing 15). I know I said wow before, but, Wow. Random reads are achieving greater than 1GBps throughput because it is literally only hitting RAM. On faster systems with faster memory, this number should be much larger.

Listing 15

Ramdisk Random Read

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 --filename=/dev/md0 --rw=randread --numjobs=1 --name=test
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1)
test: (groupid=0, jobs=1): err= 0: pid=5872: Sat Jan  9 16:35:08 2021
  read: IOPS=251k, BW=979MiB/s (1026MB/s)(2045MiB/2089msec)
       [ ... ]
Run status group 0 (all jobs):
   READ: bw=979MiB/s (1026MB/s), 979MiB/s-979MiB/s (1026MB/s-1026MB/s), io=2045MiB (2144MB), run=2089-2089msec
Disk stats (read/write):
    md0: ios=475015/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  sdc: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  rd0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

This setup has a problem, though. It has only a single persistent (or non-volatile) volume in the mirror, and if that drive were to fail, only the volatile memory volume would be left. Also, if you reboot the system, you are in a degraded mode and reading solely from the HDD – until you recreate the ramdisk and rebuild the mirror, that is (which can be accomplished with simple Bash scripts on bootup).

How do you address this problem? A simple solution would be to add a second persistent volume into the mirror, creating a three-copy RAID1 array. If you recall in my earlier example, I created a 2GB partition on the second volumes that can be configured with mdadm (Listing 16). When you verify the details (Listing 17), notice that both the HDDs are set to writemostly .

Listing 16

Config 2GB Partition

$ sudo mdadm --create /dev/md0 --level=1 --raid-devices=3 /dev/rd0 --write-mostly /dev/sdc2 /dev/sdd2
mdadm: /dev/rd0 appears to be part of a raid array:
       level=raid1 devices=2 ctime=Sat Jan  9 16:32:43 2021
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdc2 appears to be part of a raid array:
       level=raid1 devices=2 ctime=Sat Jan  9 16:32:43 2021
mdadm: /dev/sdd2 appears to be part of a raid array:
       level=raid1 devices=2 ctime=Sat Jan  9 16:32:43 2021
Continue creating array? (y/n) y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

Listing 17

Verify 2GB Details

$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Sat Jan  9 16:36:18 2021
        Raid Level : raid1
        Array Size : 2094080 (2045.00 MiB 2144.34 MB)
     Used Dev Size : 2094080 (2045.00 MiB 2144.34 MB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent
       Update Time : Sat Jan  9 16:36:21 2021
             State : clean, resyncing
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0
Consistency Policy : resync
     Resync Status : 23% complete
              Name : dev-machine:0  (local to host dev-machine)
              UUID : e0e5d514:d2294825:45d9f09c:db485a0c
            Events : 3
    Number   Major   Minor   RaidDevice State
       0     252        0        0      active sync   /dev/rd0
       1       8       34        1      active sync writemostly   /dev/sdc2
       2       8       50        2      active sync writemostly   /dev/sdd2

Once the volume completes its initialization, do another run of fio benchmarks and execute the random write test (Listing 18) and the random read test (Listing 19). The random writes are back down a bit to 1.3MBps as a result of writing to the extra HDD and the additional latencies introduced by the mechanical drive.

Listing 18

2GB Random Write

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 --filename=/dev/md0 --rw=randwrite --numjobs=1 --name=test
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1305KiB/s][w=326 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=5941: Sat Jan  9 16:38:30 2021
  write: IOPS=325, BW=1301KiB/s (1333kB/s)(76.4MiB/60156msec); 0 zone resets
       [ ... ]
Run status group 0 (all jobs):
  WRITE: bw=1301KiB/s (1333kB/s), 1301KiB/s-1301KiB/s (1333kB/s-1333kB/s), io=76.4MiB (80.2MB), run=60156-60156msec
Disk stats (read/write):
    md0: ios=82/19571, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/13048, aggrmerge=0/0, aggrticks=0/797297, aggrin_queue=771080, aggrutil=97.84%
  sdd: ios=0/19572, merge=0/0, ticks=0/1658959, in_queue=1619688, util=93.01%
  sdc: ios=0/19572, merge=0/0, ticks=0/732934, in_queue=693552, util=97.84%
  rd0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

Listing 19

2GB Random Read

$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 --filename=/dev/md0 --rw=randread --numjobs=1 --name=test
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1)
test: (groupid=0, jobs=1): err= 0: pid=5956: Sat Jan  9 16:38:53 2021
  read: IOPS=256k, BW=998MiB/s (1047MB/s)(2045MiB/2049msec)
        [ ... ]
Run status group 0 (all jobs):
   READ: bw=998MiB/s (1047MB/s), 998MiB/s-998MiB/s (1047MB/s-1047MB/s), io=2045MiB (2144MB), run=2049-2049msec
Disk stats (read/write):
    md0: ios=484146/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  sdd: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  sdc: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  rd0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

Notice that the 1GBps random read throughput is still maintained, now with the security of an extra volume for protection in the event of a drive failure. However, you will still need to recreate the ramdisk and rebuild the mirrored set on every reboot.

Conclusion

As you can see, you can rely on age-old concepts such as RAID technologies to give you a boost of performance in your computing environments – and without relying on a temporary cache. In some cases, you can breathe new life into older hardware.

Infos

  1. "Tuning ZFS for Speed on Linux" by Petros Koutoupis, ADMIN, 57, 2020, pp. 44-46
  2. mdadm(8): https://www.man7.org/linux/man-pages/man8/mdadm.8.html
  3. The RapidDisk Project: https://github.com/pkoutoupis/rapiddisk

The Author

Petros Koutoupis is a senior performance software engineer at Cray (now HPE) for its Lustre High Performance File System division. He is also the creator and maintainer of the RapidDisk Project (http://www.rapiddisk.org). Petros has worked in the data storage industry for well over a decade and has helped pioneer many technologies unleashed in the wild today.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus