« Previous 1 2
Improving performance with environment variables
Trick or No Trick
NVBLAS
NVidia has several libraries you can use when writing programs. Some of these libraries are standard conforming libraries, such as cuBLAS [9]. NVidia has taken cuBLAS and used it as part of a "drop-in" replacement BLAS library, NVBLAS, that provides BLAS level 3 routines [10]. NVBLAS uses cuBLAS, both of which are included as part of CUDA [11]; simply follow the directions for downloading and installing CUDA. For this article, I used the cuBLAS and NVBLAS that came with the NVidia HPC SDK, version 21.3.
Before using NVBLAS, you have to configure it. From the NVBLAS documentation [12], "It must be configured through an ASCII text file that describes how many and which GPUs can participate in the intercepted BLAS calls." To use NVBLAS, create the file nvblas.conf
in the directory in which you are running the scripts. For the example in this article, the contents of the file I used were:
# This is the configuration file to use NVBLAS Library NVBLAS_LOGFILE nvblas.log NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so.0 NVBLAS_GPU_LIST 0 NVBLAS_AUTOPIN_MEM_ENABLED
The first line of the file defines the logfile where NVBLAS writes any log information. The next line defines the CPU-only BLAS library for cases in which there is no GPU routine. The code defaults to running on the CPU and falls through to the CPU BLAS library, which the NVBLAS_CPU_BLAS_LIB variable specifies for NVBLAS. In this case, I chose to use the OpenBLAS library.
The third line lists the GPU devices that should be used. The numbering begins with 0
. In this case, the laptop only has one NVidia GPU, so only one is listed. You can also use the keyword ALL
to define all the GPUs in the system. The last line is something I used from an article about NVBLAS with Octave [13]. After configuring nvblas.conf
, you have to take two steps to run Octave. The first step is to export the NVBLAS_CONFIG_FILE environment variable that points to the location of the nvblas.conf
file:
export NVBLAS_CONFIG_FILE=$HOME/PROJECTS/OCTAVE/nvblas.conf
This environment variable just points to the ASCII configuration file you created. The second step is the run command itself, which uses the LD_PRELOAD trick to load NVBLAS first:
LD_PRELOAD=/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/math_libs/11.2/targets/x86_64-linux/lib/libnvblas.so.11.4.1.1026 octave-cli ./sgemm.m
The command begins by defining LD_PRELOAD, pointing to the NVBLAS library, which is then followed by the command that runs Octave (octave-cli
). To run the script, you can simply concatenate the two commands together (I tend to write a one-line Bash script for this). The results for the single- and double-precision scripts are shown in Table 4.
Table 4
Octave Results with the NVBLAS Library
Single-Precision, GPU | Double-Precision, GPU | |||
---|---|---|---|---|
N | Elapsed Time (secs) | GFLOPS | Elapsed Time (secs) | GFLOPS |
2 | 0.001167 | 0.000014 | 0.001007 | 0.000016 |
4 | 0.000076 | 0.001678 | 0.000069 | 0.001864 |
8 | 0.000061 | 0.016777 | 0.000061 | 0.016777 |
16 | 0.000061 | 0.134218 | 0.000069 | 0.119305 |
32 | 0.000076 | 0.858993 | 0.000076 | 0.858993 |
64 | 0.000099 | 5.286114 | 0.000145 | 3.616815 |
128 | 0.000542 | 7.74304 | 0.000603 | 6.958934 |
256 | 0.000549 | 61.083979 | 0.001152 | 29.126136 |
512 | 0.016685 | 16.087962 | 0.012955 | 20.721067 |
1,024 | 0.008904 | 241.195353 | 0.039238 | 54.72975 |
2,048 | 0.01741 | 986.765913 | 0.250496 | 68.583432 |
4,096 | 0.093765 | 1465.776933 | 1.500099 | 91.619911 |
8,192 | 0.643051 | 1709.835418 | 12.03125 | 91.387979 |
The strange "blurp" in the results for N =512 I cannot explain, but it happens very frequently. Notice the strange results at N =256 and N =512 that also happened when using the CPU.
For the CPU results, the double-precision results are about half the single-precision results, which is expected. However, the GPU double-precision performance is less than half of the single-precision results, because the GPU used (the GeForce 1650) is a consumer-grade GPU with the focus primarily on 32-bit performance. However, as you can tell, it can run double-precision code, just not as well as the data center GPUs that focus on 64-bit performance.
Summary
The PRELOAD trick is something of a rite of passage for new system administrators. When they find out about the trick, it is a revelation because of how flexible it can be. Soon, it is no longer a trick but a part of what the admin uses every day. I hope the simple example of LD_PRELOAD in this article with GPUs for computation and without any code changes illustrates its utility.
If you knew of this trick but have forgotten it, or if you are just learning it, I hope this article proved useful.
Infos
- PATH: http://www.linfo.org/path_env_var.html
- Shared objects: https://man7.org/linux/man-pages/man8/ld.so.8.html
- BLAS: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
- Octave: https://www.gnu.org/software/octave/index
- Matlab: https://www.mathworks.com/
- i5-103000H CPU: https://ark.intel.com/content/www/us/en/ark/products/201839/intel-core-i5-10300h-processor-8m-cache-up-to-4-50-ghz.html
- NVidia GeForce 1650 GPU: https://www.nvidia.com/en-us/geforce/graphics-cards/gtx-1650/
- OpenBLAS: https://en.wikipedia.org/wiki/OpenBLAS
- cuBLAS: https://developer.nvidia.com/cublas
- BLAS Level 3 routines: https://docs.nvidia.com/cuda/nvblas/index.html
- CUDA: https://developer.nvidia.com/cuda-toolkit
- NVBLAS documention: https://docs.nvidia.com/cuda/nvblas/index.html#configuration-file
- NVBLAS with Octave: https://developer.nvidia.com/blog/drop-in-acceleration-gnu-octave/
« Previous 1 2
Buy this article as PDF
(incl. VAT)