Improving performance with environment variables
Trick or No Trick
A topic that system administrators learn as they gain experience is called the "LD_PRELOAD Trick." This trick can help fix misbehaving applications, upgrade applications, and even improve application performance. Of course, it is not really a trick, just the use of a feature in *nix operating systems.
Have you ever installed an application on Linux and tried to run it only to be told the application can't be found? To debug the issue, probably the first thing to check is your PATH [1], which is "an environment variable … that tells the shell which directories to search for executable files." In short, the path tells Linux where to look for applications. If the application is not in the path, then Linux "thinks" it does not exist.
Fortunately, environment variables in Linux can be changed. If Linux cannot find an application, you can change the environment variable, edit the environment variable, or add to the environment variable. Linux uses other environment variables for defining other aspects of the operating system beyond the location of executables. Many times, applications define and use their own environment variables. Users can even define their own environment variables.
In addition to PATH, which helps locate applications, the LD_LIBRARY_PATH environment variable tells Linux where to search for the shared libraries used by applications, which allows you to control which libraries are "available." Like PATH, this variable can be changed, and each shell can have its own value.
The variable can be useful when debugging a new library because you can simply change LD_LIBRARY_PATH to the new library, test it, and then change it back. You can also use it when upgrading libraries. If there is little or no change to the APIs in the new library, then a simple change to LD_LIBRARY_PATH allows you to use the new library without changing anything else.
A third environment variable that also works with libraries, and is at the heart of the "trick," is LD_PRELOAD, an environment variable that contains a delimited list of shared objects (libraries) [2] that are loaded before all others. This variable allows you to have more control over the order that libraries are found by the application than just LD_LIBRARY_PATH.
LD_PRELOAD can be a great help in debugging because you can set it to a new library without changing LD_LIBRARY_PATH. After debugging, just set LD_PRELOAD to its previous value.
Perhaps the greatest strength of LD_PRELOAD is that you can easily substitute a new library for an existing one, allowing you to upgrade a library in an attempt to get better performance. Inserting a library before another for whatever purpose you have in mind is the so-called LD_PRELOAD trick.
One use I've seen of LD_PRELOAD is to load a library that intercepts calls to a normal library. The "intercept library" uses the same symbols (functions) as the usual library so that it will intercept any function calls from the application that were intended for that library. This intercept library can then be used to gather telemetry information from the calling application, perhaps writing it to a file. The intercept library then calls the intended functions in the usual library. With LD_PRELOAD, you can load the intercept library before the usual library without having to change it or the application.
A classic use case for an intercept library is for gathering telemetry (information) about I/O functions. With LD_PRELOAD, the intercept library intercepts I/O function calls such as open()
, close()
, read()
, and write()
to gather information and then passes the function calls to the intended I/O library. The intercept library uses the same function names, but rather than rewrite the I/O functionality for these functions, the new library typically gathers information, writes it to a file, and then calls the normal library to perform the I/O functions. Although this example is a classic use case of LD_PRELOAD, it is not the only use case. The next section presents another use of LD_PRELOAD resulting in increased performance.
Octave
Probably one of the best examples I know for the use of the LD_PRELOAD trick is to push basic linear algebra subprogram (BLAS) [3] computations from a CPU onto an NVidia GPU. I will illustrate this with an example from Octave [4], a mathematics tool similar to Matlab [5].
To demonstrate the process, I'll use two Octave scripts: The first does a simple square matrix multiply in single precision for various matrix sizes (Listing 1). The second script (Listing 2) is the same as Listing 1, but uses double precision.
Listing 1
Single-Precision Square Matrix Multiply
# Example SGEMM for N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192] A = single( rand(N,N) ); B = single( rand(N,N) ); start = clock(); C = A*B; elapsedTime = etime(clock(), start); gFlops = 2*N*N*N / (elapsedTime * 1e+9); disp(sprintf("N = %4d, elapsed Time = %9.6f, GFlops = %9.6f ", ... N, elapsedTime, gFlops) ); endfor
Listing 2
Double-Precision Square Matrix Multiply
# Example DGEMM for N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192] A = double( rand(N,N) ); B = double( rand(N,N) ); start = clock(); C = A*B; elapsedTime = etime(clock(), start); gFlops = 2*N*N*N / (elapsedTime * 1e+9); disp(sprintf("N = %4d, elapsed Time = %9.6f, GFlops = %9.6f ", ... N, elapsedTime, gFlops) ); endfor
To begin, I'll run these scripts on a test system with the default BLAS library that comes with Octave; then, I can use the LD_PRELOAD trick to have Octave call a different BLAS library, resulting in different, conceivably better, performance.
The test system is my Linux laptop (see Table 1 for specifications). The laptop runs Ubuntu 20.04 with the 455.45.01 NVidia driver, and CUDA 11.2. Octave 5.2.0 was used for the tests. All software was installed from the Apt repository for the specific distribution version.
Table 1
Test System Specs
CPU: Intel Core i5-10300H CPU [6] @2.50GHz |
---|
Processor base frequency 2.5GHz |
Max turbo frequency 4.5GHz |
Cache 8MB |
Four cores (eight with hyper-threading) |
45W TDP |
8GB DDR4-2933 memory |
Maximum of two memory channels |
Memory bandwidth 45.8GBps |
NVidia GeForce 1650 GPU [7] |
Architecture: Turing (TU117) |
Memory 4GB GDDR5 |
Memory speed 8bps |
Memory bandwidth 128GBps |
Memory bus 8-bit |
L2 cache 1MB |
TDP 75W |
Base clock 1,485GHz |
Boost clock 1,665MHz |
896 CUDA cores |
The two scripts were run several times (>15) for each case to get a feel for the performance; then, they were run for the results presented in this article.
Default BLAS Library
By default, Octave uses a multithreaded BLAS library. Specifically, Octave used the BLAS library located at /lib/x86_64-linux-gnu/libblas.so.3
. The two scripts, one for single precision and one for double precision, were run under the default BLAS library. The straightforward command to run the single-precision code with all cores (the default) is:
$ octave-cli ./sgemm.m
To run with a single core, you modify the command slightly:
$ OMP_NUM_THREADS=1 octave-cli ./sgemm.m
The results for running the two scripts are presented in Table 2 (where GFLOPS is a billion floating-point operations per second). First, they are run on a single core, and then on all cores. A fair amount of variability is evident for N =256 and N =512, which is also true for all subsequent CPU results.
Table 2
Octave Results with Default BLAS Library
Single-Precision, One Core | Double-Precision, One Core | Single-Precision, All Cores | Double-Precision, All Cores | |||||
---|---|---|---|---|---|---|---|---|
N | Elapsed Time (secs) | GFLOPS | Elapsed Time (secs) | GFLOPS | Elapsed Time (secs) | GFLOPS | Elapsed Time (secs) | GFLOPS |
2 | 0.000702 | 0.000023 | 0.000427 | 0.000037 | 0.000961 | 0.000017 | 0.000137 | 0.000117 |
4 | 0.000069 | 0.001864 | 0.000076 | 0.001678 | 0.000099 | 0.001291 | 0.00092 | 0.001398 |
8 | 0.000069 | 0.014913 | 0.000061 | 0.016777 | 0.000092 | 0.011185 | 0.000084 | 0.012202 |
16 | 0.000061 | 0.134218 | 0.000061 | 0.134218 | 0.000092 | 0.089478 | 0.000084 | 0.097613 |
32 | 0.000076 | 0.858993 | 0.000076 | 0.858993 | 0.000099 | 0.660764 | 0.000107 | 0.613567 |
64 | 0.000099 | 5.286114 | 0.000145 | 3.616815 | 0.000153 | 3.435974 | 0.000206 | 2.545166 |
128 | 0.000313 | 13.408678 | 0.000587 | 7.139686 | 0.000565 | 7.429133 | 0.000473 | 8.867029 |
256 | 0.001785 | 18.795071 | 0.003654 | 9.181725 | 0.000542 | 61.944317 | 0.001144 | 29.32031 |
512 | 0.013779 | 19.481934 | 0.027763 | 9.668693 | 0.0047 | 57.117487 | 0.022438 | 11.963404 |
1,024 | 0.100395 | 21.390301 | 0.215065 | 9.985277 | 0.02961 | 72.526405 | 0.055252 | 38.867022 |
2,048 | 0.776039 | 22.137891 | 1.612694 | 10.652902 | 0.199173 | 86.256026 | 0.455025 | 37.755903 |
4,096 | 5.855209 | 23.472936 | 12.275261 | 11.196418 | 1.575951 | 87.21019 | 3.468651 | 39.623174 |
8,192 | 39.343849 | 27.946214 | 102.974144 | 10.677551 | 12.247917 | 89.771315 | 26.561623 | 41.394746 |
OpenBLAS
One of the most popular BLAS libraries is OpenBLAS [8], which you can use with the PRELOAD trick instead of the default BLAS library. The command to run the single-precision script is:
$ OMP_NUM_THREADS=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libopenblas.so.0 octave-cli ./sgemm.m
Table 3 contains the results. Note that the OpenBLAS library is installed from the Apt repository for this distribution and version. Likely, one built on the system could produce better results.
Table 3
Octave Results with OpenBLAS Library
Single-Precision, One Core | Double-Precision, One Core | Single-Precision, All Cores | Double-Precision, All Cores | |||||
---|---|---|---|---|---|---|---|---|
N | Elapsed Time (secs) | GFLOPS | Elapsed Time (secs) | GFLOPS | Elapsed Time (secs) | GFLOPS | Elapsed Time (secs) | GFLOPS |
2 | 0.000114 | 0.00014 | 0.000114 | 0.00014 | 0.001022 | 0.000016 | 0.000771 | 0.000021 |
4 | 0.000076 | 0.001678 | 0.000076 | 0.001678 | 0.000099 | 0.001291 | 0.000061 | 0.002097 |
8 | 0.000061 | 0.016777 | 0.000061 | 0.016777 | 0.000092 | 0.011185 | 0.000061 | 0.016777 |
16 | 0.000061 | 0.134218 | 0.000069 | 0.119305 | 0.000084 | 0.097613 | 0.000076 | 0.107374 |
32 | 0.000061 | 1.073742 | 0.000076 | 0.858993 | 0.000092 | 0.715828 | 0.000076 | 0.858993 |
64 | 0.000099 | 5.286114 | 0.000137 | 3.817749 | 0.000145 | 3.616815 | 0.000137 | 3.817749 |
128 | 0.000313 | 13.408678 | 0.000572 | 7.330078 | 0.000381 | 10.995116 | 0.000656 | 6.392509 |
256 | 0.001808 | 18.557158 | 0.003624 | 9.259045 | 0.000519 | 64.677155 | 0.001144 | 29.32031 |
512 | 0.013237 | 20.279177 | 0.026962 | 9.955963 | 0.004074 | 65.888337 | 0.008163 | 32.882591 |
1,024 | 0.101677 | 21.120656 | 0.20388 | 10.533061 | 0.035118 | 61.150332 | 0.052483 | 40.918008 |
2,048 | 0.774956 | 22.168839 | 1.59137 | 10.79565 | 0.201546 | 85.240558 | 0.410416 | 41.859683 |
4,096 | 5.741043 | 23.939718 | 11.007278 | 12.486188 | 1.558258 | 88.20038 | 3.523735 | 39.003771 |
8,192 | 39.33165 | 27.954882 | 84.512154 | 13.010101 | 12.305489 | 89.351318 | 26.867691 | 40.92319 |
Buy this article as PDF
(incl. VAT)