Preload Trick

By using the LD_PRELOAD environment variable, you can improve performance without making changes to applications.

A topic system that administrators learn as they gain experience is called the “LD_PRELOAD Trick” that can help fix misbehaving applications, upgrade applications, and even improve application performance. Of course, it is not really a trick, just the use of a feature in *nix operating systems.

Have you ever installed an application on Linux and tried to run it, and it tells you the application can’t be found? To debug the issue, probably the first thing to check is your PATH, which is “an environment variable … that tells the shell which directories to search for executable files.” In short, the path tells Linux where to look for applications. If the application is not in the path, then Linux “thinks” it does not exist.

Fortunately, environment variables in Linux can be changed. If Linux cannot find an application, you can change the environment variable, edit the environment variable, or add to the environment variable. Linux uses other environment variables for defining other aspects of the operating system beyond the location of executables. Many times, applications define and use their own environment variables. Users can even define their own environment variables.

In addition to PATH, which helps locate applications, the LD_LIBRARY_PATH environment variable tells Linux where to search for the shared libraries used by applications, which allows you to control which libraries are “available.” Like PATH, this variable can be changed, and each shell can have its own value.

The variable can be useful when debugging a new library because you can simply change LD_LIBRARY_PATH to the new library, test it, and then change it back. You can also use it when upgrading libraries. If there is little or no change to the APIs in the new library, then a simple change to LD_LIBRARY_PATH allows you to use the new library without changing anything else.

A third environment variable that also works with libraries and is at the heart of the “trick,” is LD_PRELOAD, an environment variable that contains a delimited list of shared objects (libraries) that are loaded before all others. This variable allows you to have more control over the order that libraries are found by the application than just LD_LIBRARY_PATH.

LD_PRELOAD can be a great help in debugging because you can set it to a new library without changing LD_LIBRARY_PATH. After debugging, just set LD_PRELOAD to its previous value.

Perhaps the greatest strength of LD_PRELOAD is that you can easily substitute a new library for an existing one, allowing you to upgrade a library in an attempt to get better performance. Inserting a library before another for whatever purpose you have in mind is the so-called LD_PRELOAD trick.

One use I’ve seen of LD_PRELOAD is to load a library that intercepts calls to a normal library. The “intercept library” uses the same symbols (functions) as the usual library so that it will intercept any function calls from the application that were intended for that library. This intercept library can then be used to gather telemetry information from the calling application, perhaps writing it to a file. The intercept library then calls the intended functions in the usual library. With LD_PRELOAD you can load the intercept library before the usual library without having to change itor the application.

A classic use case for an intercept library is for gathering telemetry (information) about I/O functions. With LD_PRELOAD, the intercept library intercepts I/O function calls such as open(), close(), read(), and write() to gather information and then passes the function calls to the intended I/O library. The intercept library uses the same function names, but rather than re-write the I/O functionality for these functions, the new library typically gathers information, writes it to a file, and then calls the normal library to perform the I/O functions. Although this example is a classic use case of LD_PRELOAD, it is not the only use case. The next section presents another use of LD_PRELOAD resulting in increased performance.

Octave

Probably one of the best examples I know for the use of the LD_PRELOAD trick is to push basic linear algebra subprogram (BLAS) computations from a CPU onto an Nvidia GPU. I will illustrate this with an example from Octave, a mathematics tool similar to Matlab.

To demonstrate the process, I’ll use two Octave scripts: The first does a simple square matrix multiply in single precision for various matrix sizes (Listing 1). The second script (Listing 2) is the same as Listing 1, but uses double precision.

Listing 1: Single-Precision Square Matrix Multiply

# Example SGEMM
 
for N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
 
   A = single( rand(N,N) );
   B = single( rand(N,N) );
 
   start = clock();
   C = A*B;
   elapsedTime = etime(clock(), start);
 
   gFlops = 2*N*N*N / (elapsedTime * 1e+9);
 
   disp(sprintf("N = %4d, elapsed Time = %9.6f, GFlops = %9.6f ", ...
                N, elapsedTime, gFlops) );
 
endfor

Listing 2: Double-Precision Square Matrix Multiply

# Example DGEMM
 
for N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
 
   A = double( rand(N,N) );
   B = double( rand(N,N) );
 
   start = clock();
   C = A*B;
   elapsedTime = etime(clock(), start);
 
   gFlops = 2*N*N*N / (elapsedTime * 1e+9);
 
   disp(sprintf("N = %4d, elapsed Time = %9.6f, GFlops = %9.6f ", ...
                N, elapsedTime, gFlops) );
 
endfor

To begin, I’ll run these scripts on a test system with the default BLAS library that comes with Octave; then, I can use the LD_PRELOAD trick to have Octave call a different BLAS library, resulting in different, conceivably better, performance.

The test system is my Linux laptop:

  • CPU: Intel(R) Core(TM) i5-10300H CPU @2.50GHz
    • Processor base frequency 2.5GHz
    • Max turbo frequency 4.5GHz
    • Cache 8MB
    • Four cores (eight with hyper-threading)
    • 45W TDP
    • 8GB DDR4-2933 memory
    • Maximum of two memory channels
    • Memory bandwidth 45.8GBps
  • Nvidia GeForce 1650 GPU
    • Architecture: Turing (TU117)
    • Memory 4GB GDDR5
    • Memory speed 8bps
    • Memory bandwidth 128GBps
    • Memory bus 8-bit
    • L2 cache 1MB
    • TDP 75W
    • Base clock 1,485GHz
    • Boost clock 1,665MHz
    • 896 CUDA cores

The laptop runs Ubuntu 20.04 with the 455.45.01 Nvidia driver, and CUDA 11.2. Octave 5.2.0 was used for the tests. All software was installed from the Apt repository for the specific distribution version.

The two scripts were run several times (>15) for each case to get a feel for the performance; then, they were run for the results presented in this article.

Default BLAS Library

By default, Octave uses a multithreaded BLAS library. Specifically, Octave used the BLAS library located at/lib/x86_64-linux-gnu/libblas.so.3. The two scripts, one for single precision and one for double precision, were run under the default BLAS library. The straightforward command to run the single-precision code with all cores (the default) is:

$ octave-cli ./sgemm.m

To run with a single core, you modify the command slightly:

$ OMP_NUM_THREADS=1 octave-cli ./sgemm.m

The results for running the two scripts are presented in Table 1 (where GFLOPS is a billion floating-point operations per second). First, they are run on a single core, then on all cores. A fair amount of variability is evident for N=256 and N=512, which is also true for all subsequent CPU results.

Table 1: Octave Results with Default BLAS Library

  Single-Precision, One Core Double-Precision, One Core Single-Precision, All Cores Double-Precision, All Cores
N Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS
2 0.000702 0.000023 0.000427 0.000037 0.000961 0.000017 0.000137 0.000117
4 0.000069 0.001864 0.000076 0.001678 0.000099 0.001291 0.00092 0.001398
8 0.000069 0.014913 0.000061 0.016777 0.000092 0.011185 0.000084 0.012202
16 0.000061 0.134218 0.000061 0.134218 0.000092 0.089478 0.000084 0.097613
32 0.000076 0.858993 0.000076 0.858993 0.000099 0.660764 0.000107 0.613567
64 0.000099 5.286114 0.000145 3.616815 0.000153 3.435974 0.000206 2.545166
128 0.000313 13.408678 0.000587 7.139686 0.000565 7.429133 0.000473 8.867029
256 0.001785 18.795071 0.003654 9.181725 0.000542 61.944317 0.001144 29.32031
512 0.013779 19.481934 0.027763 9.668693 0.0047 57.117487 0.022438 11.963404
1,024 0.100395 21.390301 0.215065 9.985277 0.02961 72.526405 0.055252 38.867022
2,048 0.776039 22.137891 1.612694 10.652902 0.199173 86.256026 0.455025 37.755903
4,096 5.855209 23.472936 12.275261 11.196418 1.575951 87.21019 3.468651 39.623174
8,192 39.343849 27.946214 102.974144 10.677551 12.247917 89.771315 26.561623 41.394746
OpenBLAS

One of the most popular BLAS libraries is OpenBLAS, which you can use with the PRELOAD trick instead of the default BLAS library. The command to run the single-precision script is:

$ OMP_NUM_THREADS=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libopenblas.so.0 octave-cli ./sgemm.m

Table 2 contains the results. Note that the OpenBLAS library is installed from the Apt repository for this distribution and version. Likely, one built on the system could produce better results.

Table 2: Octave Results with OpenBLAS Library

  Single-Precision, One Core Double-Precision, One Core Single-Precision, All Cores Double-Precision, All Cores
N Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS
2 0.000114 0.00014 0.000114 0.00014 0.001022 0.000016 0.000771 0.000021
4 0.000076 0.001678 0.000076 0.001678 0.000099 0.001291 0.000061 0.002097
8 0.000061 0.016777 0.000061 0.016777 0.000092 0.011185 0.000061 0.016777
16 0.000061 0.134218 0.000069 0.119305 0.000084 0.097613 0.000076 0.107374
32 0.000061 1.073742 0.000076 0.858993 0.000092 0.715828 0.000076 0.858993
64 0.000099 5.286114 0.000137 3.817749 0.000145 3.616815 0.000137 3.817749
128 0.000313 13.408678 0.000572 7.330078 0.000381 10.995116 0.000656 6.392509
256 0.001808 18.557158 0.003624 9.259045 0.000519 64.677155 0.001144 29.32031
512 0.013237 20.279177 0.026962 9.955963 0.004074 65.888337 0.008163 32.882591
1,024 0.101677 21.120656 0.20388 10.533061 0.035118 61.150332 0.052483 40.918008
2,048 0.774956 22.168839 1.59137 10.79565 0.201546 85.240558 0.410416 41.859683
4,096 5.741043 23.939718 11.007278 12.486188 1.558258 88.20038 3.523735 39.003771
8,192 39.33165 27.954882 84.512154 13.010101 12.305489 89.351318 26.867691 40.92319
NVBLAS

Nvidia has several libraries you can use when writing programs. Some of these libraries are standard conforming libraries, such as cuBLAS. Nvidia has taken cuBLAS and used it as part of a “drop-in” replacement BLAS library, NVBLAS, that provides BLAS level 3 routines. NVBLAS uses cuBLAS, both of which are included as part of CUDA; simply follow the directions for downloading and installing CUDA. For this article, I used the cuBLAS and NVBLAS that came with the Nvidia HPC SDK, version 21.3.

Before using NVBLAS, you have to configure it. From the NVBLAS documentation, “It must be configured through an ASCII text file that describes how many and which GPUs can participate in the intercepted BLAS calls.” To use NVBLAS, create the file nvblas.conf in the directory in which you are running the scripts. For the example in this article, the contents of the file I used was:

# This is the configuration file to use NVBLAS Library
NVBLAS_LOGFILE  nvblas.log
NVBLAS_CPU_BLAS_LIB  /usr/lib/x86_64-linux-gnu/libopenblas.so.0
NVBLAS_GPU_LIST 0 
NVBLAS_AUTOPIN_MEM_ENABLED

The first line of the file defines the logfile where NVBLAS writes any log information.The next line defines the CPU-only BLAS library for cases in which there is no GPU routine. The code defaults to running on the CPU and falls through to the CPU BLAS library, which the NVBLAS_CPU_BLAS_LIB variable specifies for NVBLAS.In this case, I chose to use the OpenBLAS library.

The third line lists the GPU devices that should be used. The numbering begins with 0. In this case, the laptop only has one Nvidia GPU, so only one is listed. You can also use the keyword ALL to define all the GPUs in the system. The last line is something I used from an article about NVBLAS with Octave. After configuring nvblas.conf, you have to take two steps to run Octave. The first step is to export the NVBLAS_CONFIG_FILE environment variable that points to the location of the nvblas.conf file:

export NVBLAS_CONFIG_FILE=$HOME/PROJECTS/OCTAVE/nvblas.conf

This environment variable just points to the ASCII configuration file you created. The second step is the run command itself, which uses the LD_PRELOAD trick to load NVBLAS first:

LD_PRELOAD=/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/math_libs/11.2/targets/x86_64-linux/lib/libnvblas.so.11.4.1.1026 octave-cli ./sgemm.m

The command begins by defining LD_PRELOAD, pointing to the NVBLAS library, which is then followed by the command that runs Octave (octave-cli). To run the script, you can simply concatenate the two commands together (I tend to write a one-line bash script for this). The results for the single- and double-precision scripts are shown in Table 3.

Table 3: Octave Results with the NVBLAS Library

  Single Precision, GPU Double Precision, GPU
N Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS
2 0.001167 0.000014 0.001007 0.000016
4 0.000076 0.001678 0.000069 0.001864
8 0.000061 0.016777 0.000061 0.016777
16 0.000061 0.134218 0.000069 0.119305
32 0.000076 0.858993 0.000076 0.858993
64 0.000099 5.286114 0.000145 3.616815
128 0.000542 7.74304 0.000603 6.958934
256 0.000549 61.083979 0.001152 29.126136
512 0.016685 16.087962 0.012955 20.721067
1,024 0.008904 241.195353 0.039238 54.72975
2,048 0.01741 986.765913 0.250496 68.583432
4,096 0.093765 1465.776933 1.500099 91.619911
8,192 0.643051 1709.835418 12.03125 91.387979

The strange “blurp” in the results for N=512 I cannot explain, but it happens very frequently. Notice the strange results at N=256 and N=512 that also happened when using the CPU.

For the CPU results, the double-precision results are about half the single-precision results, which is expected. However, the GPU double-precision performance is less than half of the single-precision results, because the GPU used, the GeForce 1650, is a consumer-grade GPU with the focus primarily on 32-bit performance. However, as you can tell, it can run double-precision code, just not as well as the data center GPUs that focus on 64-bit performance.

Summary

The PRELOAD trick is something of a rite of passage for new system administrators. When they find out about the trick, it is something of a revelation because of how flexible it can be. Soon, it is no longer a trick but a part of what the admin uses every day. I hope the simple example of LD_PRELOAD in this article with GPUs for computation and without any code changes illustrates its utility.

If you knew of this trick but have forgotten it, or if you are just learning it, I hope this article proved useful.