« Previous 1 2 3
Lustre HPC distributed filesystem
Radiance
I/O and Performance Benchmarking
MDTest is an MPI-based metadata performance testing application designed to test parallel filesystems, and IOR is a benchmarking utility also designed to test the performance of distributed filesystems. To put it more simply: With MDTest, you would typically test the metadata operations involved in creating, removing, and reading objects such as directories, files, and so on. IOR is more straightforward and just focuses on benchmarking buffered or direct sequential or random write-read throughput to the filesystem. Both are maintained and distributed together under the IOR GitHub project [6]. To build the latest IOR package from source, you need to install a Message Passing Interface (MPI) framework, then clone, build, and install the test utilities:
$ sudo dnf install mpich mpich-devel $ git clone https://github.com/hpc/ior.git $ cd ior $ MPICC=/usr/lib64/mpich/bin/mpicc ./configure $ cd src/ $ sudo make && make install
You are now ready to run a simple benchmark of your filesystem.
IOR
The benchmark will give you a general idea of how it performs in its current environment. I rely on mpirun
to dispatch the I/O generated by IOR in parallel across the clients; in the end, I get an aggregated result of the entire job execution.
The filesystem is currently empty, with the exception of the file created earlier to test the filesystem. Both the MDT and OSTs are empty with no real file data (Listing 9, executed from the client).
Listing 9
Current Environment
$ sudo lfs df UUID 1K-blocks Used Available Use% Mounted on testfs-MDT0000_UUID 22419556 10784 19944620 1% /lustre[MDT:0] testfs-OST0000_UUID 23335208 1764 20852908 1% /lustre[OST:0] testfs-OST0001_UUID 23335208 1768 20852904 1% /lustre[OST:1] testfs-OST0002_UUID 23335208 1768 20852904 1% /lustre[OST:2] filesystem_summary: 70005624 5300 62558716 1% /lustre
To benchmark the performance of the HPC setup, run a write-only instance of IOR from the four clients simultaneously. Each client will initiate a single process to write 64MB transfers to a 5GB file (Listing 10).
Listing 10
IOR Write Only
$ sudo /usr/lib64/mpich/bin/mpirun --host 10.0.0.3,10.0.0.4,10.0.0.5,10.0.0.6 /usr/local/bin/ior -F -w -t 64m -k --posix.odirect -D 60 -u -b 5g -o /lustre/test.01 IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O Began : Tue Jan 25 20:02:21 2022 Command line : /usr/local/bin/ior -F -w -t 64m -k --posix.odirect -D 60 -u -b 5g -o /lustre/test.01 Machine : Linux lustre-client1 TestID : 0 StartTime : Tue Jan 25 20:02:21 2022 Path : /lustre/0/test.01.00000000 FS : 66.8 GiB Used FS: 35.9% Inodes: 47.0 Mi Used Inodes: 0.0% Options: api : POSIX apiVersion : test filename : /lustre/test.01 access : file-per-process type : independent segments : 1 ordering in a file : sequential ordering inter file : no tasks offsets nodes : 4 tasks : 4 clients per node : 1 repetitions : 1 xfersize : 64 MiB blocksize : 5 GiB aggregate filesize : 20 GiB stonewallingTime : 60 stoneWallingWearOut : 0 Results: access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) ------ --------- ---- ---------- ---------- --------- -------- -------- ... write 1835.22 28.68 0.120209 5242880 65536 0.000934 11.16 ** close(s) total(s) iter ... -------- -------- ---- 2.50 11.16 0 Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) ... ... write 1835.22 1835.22 1835.22 0.00 28.68 28.68 28.68 ** ... StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff ... 0.00 11.15941 NA NA 0 4 1 1 1 0 1 ** ... reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum 0 0 1 5368709120 67108864 20480.0 POSIX 0 Finished : Tue Jan 25 20:02:32 2022
Notice a little more than 1.8GiBps throughput writes to the filesystem. Considering that each client is writing to the target filesystem in a single process and you probably did not hit the limit of the GigE back end, this isn't a bad result. You will start to see the OST targets fill up with data (Listing 11).
Listing 11
Writing to OST Targets
$ lfs df UUID 1K-blocks Used Available Use% Mounted on testfs-MDT0000_UUID 22419556 10800 19944604 1% /lustre[MDT:0] testfs-OST0000_UUID 23335208 5244648 15577064 26% /lustre[OST:0] testfs-OST0001_UUID 23335208 5244652 15577060 26% /lustre[OST:1] testfs-OST0002_UUID 23335208 10487544 10301208 51% /lustre[OST:2] filesystem_summary: 70005624 20976844 41455332 34% /lustre
This time, rerun IOR, but in read-only mode. The command will use the same number of clients, threads, and transfer size, but read 1GB (Listing 12; the missing information under Options is identical to Listing 10).
Listing 12
IOR Read Only
$ sudo /usr/lib64/mpich/bin/mpirun --host 10.0.0.3,10.0.0.4,10.0.0.5,10.0.0.6 /usr/local/bin/ior -F -r -t 64m -k --posix.odirect -D 15 -u -b 1g -o /lustre/test.01 IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O Began : Tue Jan 25 20:04:11 2022 Command line : /usr/local/bin/ior -F -r -t 64m -k --posix.odirect -D 15 -u -b 1g -o /lustre/test.01 Machine : Linux lustre-client1 TestID : 0 StartTime : Tue Jan 25 20:04:11 2022 Path : /lustre/0/test.01.00000000 FS : 66.8 GiB Used FS: 30.0% Inodes: 47.0 Mi Used Inodes: 0.0% Options: [...] blocksize : 1 GiB aggregate filesize : 4 GiB stonewallingTime : 15 stoneWallingWearOut : 0 Results: access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) ------ --------- ---- ---------- ---------- --------- -------- -------- ... ** close(s) total(s) iter ... -------- -------- ---- WARNING: Expected aggregate file size = 4294967296 WARNING: Stat() of aggregate file size = 21474836480 WARNING: Using actual aggregate bytes moved = 4294967296 WARNING: Maybe caused by deadlineForStonewalling read 2199.66 34.40 0.108532 1048576 65536 0.002245 1.86 0.278201 1.86 0 Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) ... read 2199.66 2199.66 2199.66 0.00 34.37 34.37 34.37 ** ... StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff ... 0.00 1.86211 NA NA 0 4 1 1 1 0 1 ** ... reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum 0 0 1 1073741824 67108864 4096.0 POSIX 0 Finished : Tue Jan 25 20:04:13 2022
For a virtual machine deployment on a 1GigE network, I get roughly 2.2GiBps reads, which again, if you think about it, is not bad at all. Imagine this on a much larger configuration with better compute, storage, and network capabilities; more processes per client; and more clients. This cluster would scream with speed.
Conclusion
That is the Lustre high-performance filesystem in a nutshell. To unmount the filesystem from the client, use the umount
command, just as you would unmount any other device from a system:
$ sudo pdsh -w 10.0.0.[3-6] umount /lustre
Much like any other technology, Lustre is not the only distributed filesystem of its kind, including IBM's GPFS, BeeGFS, and plenty more. Either way and despite the competition, Lustre is both stable and reliable and has cemented itself in the HPC space for nearly two decades; it is not going anywhere.
Infos
- The Lustre project: https://www.lustre.org
- Wiki: https://wiki.lustre.org/Main_Page
- Documentation: https://doc.lustre.org/lustre_manual.xhtml
- e2fsprogs: https://downloads.whamcloud.com/public/e2fsprogs/latest
- e2fsprogs files used in this article: https://downloads.whamcloud.com/public/e2fsprogs/latest/el8/RPMS/x86_64
- IOR (and MDTest): https://github.com/hpc/ior
« Previous 1 2 3
Buy this article as PDF
(incl. VAT)