Lustre HPC distributed filesystem

Radiance

I/O and Performance Benchmarking

MDTest is an MPI-based metadata performance testing application designed to test parallel filesystems, and IOR is a benchmarking utility also designed to test the performance of distributed filesystems. To put it more simply: With MDTest, you would typically test the metadata operations involved in creating, removing, and reading objects such as directories, files, and so on. IOR is more straightforward and just focuses on benchmarking buffered or direct sequential or random write-read throughput to the filesystem. Both are maintained and distributed together under the IOR GitHub project [6]. To build the latest IOR package from source, you need to install a Message Passing Interface (MPI) framework, then clone, build, and install the test utilities:

$ sudo dnf install mpich mpich-devel
$ git clone https://github.com/hpc/ior.git
$ cd ior
$ MPICC=/usr/lib64/mpich/bin/mpicc ./configure
$ cd src/
$ sudo make && make install

You are now ready to run a simple benchmark of your filesystem.

IOR

The benchmark will give you a general idea of how it performs in its current environment. I rely on mpirun to dispatch the I/O generated by IOR in parallel across the clients; in the end, I get an aggregated result of the entire job execution.

The filesystem is currently empty, with the exception of the file created earlier to test the filesystem. Both the MDT and OSTs are empty with no real file data (Listing 9, executed from the client).

Listing 9

Current Environment

$ sudo lfs df
UUID                    1K-blocks        Used   Available Use% Mounted on
testfs-MDT0000_UUID      22419556       10784    19944620   1% /lustre[MDT:0]
testfs-OST0000_UUID      23335208        1764    20852908   1% /lustre[OST:0]
testfs-OST0001_UUID      23335208        1768    20852904   1% /lustre[OST:1]
testfs-OST0002_UUID      23335208        1768    20852904   1% /lustre[OST:2]
filesystem_summary:      70005624        5300    62558716   1% /lustre

To benchmark the performance of the HPC setup, run a write-only instance of IOR from the four clients simultaneously. Each client will initiate a single process to write 64MB transfers to a 5GB file (Listing 10).

Listing 10

IOR Write Only

$ sudo /usr/lib64/mpich/bin/mpirun --host 10.0.0.3,10.0.0.4,10.0.0.5,10.0.0.6 /usr/local/bin/ior -F -w -t 64m -k --posix.odirect -D 60 -u -b 5g -o /lustre/test.01
IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Tue Jan 25 20:02:21 2022
Command line        : /usr/local/bin/ior -F -w -t 64m -k --posix.odirect -D 60 -u -b 5g -o /lustre/test.01
Machine             : Linux lustre-client1
TestID              : 0
StartTime           : Tue Jan 25 20:02:21 2022
Path                : /lustre/0/test.01.00000000
FS                  : 66.8 GiB   Used FS: 35.9%   Inodes: 47.0 Mi   Used Inodes: 0.0%
Options:
api                 : POSIX
apiVersion          :
test filename       : /lustre/test.01
access              : file-per-process
type                : independent
segments            : 1
ordering in a file  : sequential
ordering inter file : no tasks offsets
nodes               : 4
tasks               : 4
clients per node    : 1
repetitions         : 1
xfersize            : 64 MiB
blocksize           : 5 GiB
aggregate filesize  : 20 GiB
stonewallingTime    : 60
stoneWallingWearOut : 0
Results:
access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)
------    ---------  ----       ----------  ---------- ---------  --------   -------- ...
write     1835.22    28.68      0.120209    5242880    65536      0.000934   11.16
 **
    close(s)   total(s)   iter
... --------   --------   ----
    2.50       11.16      0
Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs) ...
... write    1835.22    1835.22    1835.22       0.00      28.68      28.68      28.68
 **
... StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff ...
      0.00   11.15941           NA             NA     0      4   1    1   1     0        1
 **
... reordrand seed segcnt      blksiz     xsize  aggs(MiB)    API RefNum
            0    0      1  5368709120  67108864   20480.0   POSIX      0
Finished            : Tue Jan 25 20:02:32 2022

Notice a little more than 1.8GiBps throughput writes to the filesystem. Considering that each client is writing to the target filesystem in a single process and you probably did not hit the limit of the GigE back end, this isn't a bad result. You will start to see the OST targets fill up with data (Listing 11).

Listing 11

Writing to OST Targets

$ lfs df
UUID                    1K-blocks        Used   Available Use% Mounted on
testfs-MDT0000_UUID      22419556       10800    19944604   1% /lustre[MDT:0]
testfs-OST0000_UUID      23335208     5244648    15577064  26% /lustre[OST:0]
testfs-OST0001_UUID      23335208     5244652    15577060  26% /lustre[OST:1]
testfs-OST0002_UUID      23335208    10487544    10301208  51% /lustre[OST:2]
filesystem_summary:      70005624    20976844    41455332  34% /lustre

This time, rerun IOR, but in read-only mode. The command will use the same number of clients, threads, and transfer size, but read 1GB (Listing 12; the missing information under Options is identical to Listing 10).

Listing 12

IOR Read Only

$ sudo /usr/lib64/mpich/bin/mpirun --host 10.0.0.3,10.0.0.4,10.0.0.5,10.0.0.6 /usr/local/bin/ior -F -r -t 64m -k --posix.odirect -D 15 -u -b 1g -o /lustre/test.01
IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Tue Jan 25 20:04:11 2022
Command line        : /usr/local/bin/ior -F -r -t 64m -k --posix.odirect -D 15 -u -b 1g -o /lustre/test.01
Machine             : Linux lustre-client1
TestID              : 0
StartTime           : Tue Jan 25 20:04:11 2022
Path                : /lustre/0/test.01.00000000
FS                  : 66.8 GiB   Used FS: 30.0%   Inodes: 47.0 Mi   Used Inodes: 0.0%
Options:
[...]
blocksize           : 1 GiB
aggregate filesize  : 4 GiB
stonewallingTime    : 15
stoneWallingWearOut : 0
Results:
access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)
------    ---------  ----       ----------  ---------- ---------  --------   -------- ...
 **
    close(s)   total(s)   iter
... --------   --------   ----
WARNING: Expected aggregate file size       = 4294967296
WARNING: Stat() of aggregate file size      = 21474836480
WARNING: Using actual aggregate bytes moved = 4294967296
WARNING: Maybe caused by deadlineForStonewalling
read      2199.66    34.40      0.108532    1048576    65536      0.002245   1.86       0.278201   1.86       0
Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs) ...
read         2199.66    2199.66    2199.66       0.00      34.37      34.37      34.37
 **
... StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff ...
      0.00    1.86211           NA             NA     0      4   1    1   1     0        1
 **
... reordrand seed segcnt     blksiz    xsize aggs(MiB)   API RefNum
            0    0      1 1073741824 67108864    4096.0 POSIX      0
Finished            : Tue Jan 25 20:04:13 2022

For a virtual machine deployment on a 1GigE network, I get roughly 2.2GiBps reads, which again, if you think about it, is not bad at all. Imagine this on a much larger configuration with better compute, storage, and network capabilities; more processes per client; and more clients. This cluster would scream with speed.

Conclusion

That is the Lustre high-performance filesystem in a nutshell. To unmount the filesystem from the client, use the umount command, just as you would unmount any other device from a system:

$ sudo pdsh -w 10.0.0.[3-6] umount /lustre

Much like any other technology, Lustre is not the only distributed filesystem of its kind, including IBM's GPFS, BeeGFS, and plenty more. Either way and despite the competition, Lustre is both stable and reliable and has cemented itself in the HPC space for nearly two decades; it is not going anywhere.

The Author

Petros Koutoupis is currently a senior performance software engineer at Cray (now HPE) for its Lustre High Performance File System division and is the creator and maintainer of the RapidDisk Project http://www.rapiddisk.org. He has worked in the data storage industry for well over a decade and has helped pioneer the many technologies unleashed in the wild today.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus