Lead Image © sgame, fotolia.com

Lead Image © sgame, fotolia.com

Improving performance with environment variables

Trick or No Trick

Article from ADMIN 64/2021
By
By using the LD_PRELOAD environment variable, you can improve performance without making changes to applications.

A topic that system administrators learn as they gain experience is called the "LD_PRELOAD Trick." This trick can help fix misbehaving applications, upgrade applications, and even improve application performance. Of course, it is not really a trick, just the use of a feature in *nix operating systems.

Have you ever installed an application on Linux and tried to run it only to be told the application can't be found? To debug the issue, probably the first thing to check is your PATH [1], which is "an environment variable … that tells the shell which directories to search for executable files." In short, the path tells Linux where to look for applications. If the application is not in the path, then Linux "thinks" it does not exist.

Fortunately, environment variables in Linux can be changed. If Linux cannot find an application, you can change the environment variable, edit the environment variable, or add to the environment variable. Linux uses other environment variables for defining other aspects of the operating system beyond the location of executables. Many times, applications define and use their own environment variables. Users can even define their own environment variables.

In addition to PATH, which helps locate applications, the LD_LIBRARY_PATH environment variable tells Linux where to search for the shared libraries used by applications, which allows you to control which libraries are "available." Like PATH, this variable can be changed, and each shell can have its own value.

The variable can be useful when debugging a new library because you can simply change LD_LIBRARY_PATH to the new library, test it, and then change it back. You can also use it when upgrading libraries. If there is little or no change to the APIs in the new library, then a simple change to LD_LIBRARY_PATH allows you to use the new library without changing anything else.

A third environment variable that also works with libraries, and is at the heart of the "trick," is LD_PRELOAD, an environment variable that contains a delimited list of shared objects (libraries) [2] that are loaded before all others. This variable allows you to have more control over the order that libraries are found by the application than just LD_LIBRARY_PATH.

LD_PRELOAD can be a great help in debugging because you can set it to a new library without changing LD_LIBRARY_PATH. After debugging, just set LD_PRELOAD to its previous value.

Perhaps the greatest strength of LD_PRELOAD is that you can easily substitute a new library for an existing one, allowing you to upgrade a library in an attempt to get better performance. Inserting a library before another for whatever purpose you have in mind is the so-called LD_PRELOAD trick.

One use I've seen of LD_PRELOAD is to load a library that intercepts calls to a normal library. The "intercept library" uses the same symbols (functions) as the usual library so that it will intercept any function calls from the application that were intended for that library. This intercept library can then be used to gather telemetry information from the calling application, perhaps writing it to a file. The intercept library then calls the intended functions in the usual library. With LD_PRELOAD, you can load the intercept library before the usual library without having to change it or the application.

A classic use case for an intercept library is for gathering telemetry (information) about I/O functions. With LD_PRELOAD, the intercept library intercepts I/O function calls such as open(), close(), read(), and write() to gather information and then passes the function calls to the intended I/O library. The intercept library uses the same function names, but rather than rewrite the I/O functionality for these functions, the new library typically gathers information, writes it to a file, and then calls the normal library to perform the I/O functions. Although this example is a classic use case of LD_PRELOAD, it is not the only use case. The next section presents another use of LD_PRELOAD resulting in increased performance.

Octave

Probably one of the best examples I know for the use of the LD_PRELOAD trick is to push basic linear algebra subprogram (BLAS) [3] computations from a CPU onto an NVidia GPU. I will illustrate this with an example from Octave [4], a mathematics tool similar to Matlab [5].

To demonstrate the process, I'll use two Octave scripts: The first does a simple square matrix multiply in single precision for various matrix sizes (Listing 1). The second script (Listing 2) is the same as Listing 1, but uses double precision.

Listing 1

Single-Precision Square Matrix Multiply

# Example SGEMM
for N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
   A = single( rand(N,N) );
   B = single( rand(N,N) );
   start = clock();
   C = A*B;
   elapsedTime = etime(clock(), start);
   gFlops = 2*N*N*N / (elapsedTime * 1e+9);
   disp(sprintf("N = %4d, elapsed Time = %9.6f, GFlops = %9.6f ", ...
                N, elapsedTime, gFlops) );
endfor

Listing 2

Double-Precision Square Matrix Multiply

# Example DGEMM
for N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
   A = double( rand(N,N) );
   B = double( rand(N,N) );
   start = clock();
   C = A*B;
   elapsedTime = etime(clock(), start);
   gFlops = 2*N*N*N / (elapsedTime * 1e+9);
   disp(sprintf("N = %4d, elapsed Time = %9.6f, GFlops = %9.6f ", ...
                N, elapsedTime, gFlops) );
endfor

To begin, I'll run these scripts on a test system with the default BLAS library that comes with Octave; then, I can use the LD_PRELOAD trick to have Octave call a different BLAS library, resulting in different, conceivably better, performance.

The test system is my Linux laptop (see Table 1 for specifications). The laptop runs Ubuntu 20.04 with the 455.45.01 NVidia driver, and CUDA 11.2. Octave 5.2.0 was used for the tests. All software was installed from the Apt repository for the specific distribution version.

Table 1

Test System Specs

CPU: Intel Core i5-10300H CPU [6] @2.50GHz
Processor base frequency 2.5GHz
Max turbo frequency 4.5GHz
Cache 8MB
Four cores (eight with hyper-threading)
45W TDP
8GB DDR4-2933 memory
Maximum of two memory channels
Memory bandwidth 45.8GBps
NVidia GeForce 1650 GPU [7]
Architecture: Turing (TU117)
Memory 4GB GDDR5
Memory speed 8bps
Memory bandwidth 128GBps
Memory bus 8-bit
L2 cache 1MB
TDP 75W
Base clock 1,485GHz
Boost clock 1,665MHz
896 CUDA cores

The two scripts were run several times (>15) for each case to get a feel for the performance; then, they were run for the results presented in this article.

Default BLAS Library

By default, Octave uses a multithreaded BLAS library. Specifically, Octave used the BLAS library located at /lib/x86_64-linux-gnu/libblas.so.3. The two scripts, one for single precision and one for double precision, were run under the default BLAS library. The straightforward command to run the single-precision code with all cores (the default) is:

$ octave-cli ./sgemm.m

To run with a single core, you modify the command slightly:

$ OMP_NUM_THREADS=1 octave-cli ./sgemm.m

The results for running the two scripts are presented in Table 2 (where GFLOPS is a billion floating-point operations per second). First, they are run on a single core, and then on all cores. A fair amount of variability is evident for N =256 and N =512, which is also true for all subsequent CPU results.

Table 2

Octave Results with Default BLAS Library

  Single-Precision, One Core Double-Precision, One Core Single-Precision, All Cores Double-Precision, All Cores
N Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS
2 0.000702 0.000023 0.000427 0.000037 0.000961 0.000017 0.000137 0.000117
4 0.000069 0.001864 0.000076 0.001678 0.000099 0.001291 0.00092 0.001398
8 0.000069 0.014913 0.000061 0.016777 0.000092 0.011185 0.000084 0.012202
16 0.000061 0.134218 0.000061 0.134218 0.000092 0.089478 0.000084 0.097613
32 0.000076 0.858993 0.000076 0.858993 0.000099 0.660764 0.000107 0.613567
64 0.000099 5.286114 0.000145 3.616815 0.000153 3.435974 0.000206 2.545166
128 0.000313 13.408678 0.000587 7.139686 0.000565 7.429133 0.000473 8.867029
256 0.001785 18.795071 0.003654 9.181725 0.000542 61.944317 0.001144 29.32031
512 0.013779 19.481934 0.027763 9.668693 0.0047 57.117487 0.022438 11.963404
1,024 0.100395 21.390301 0.215065 9.985277 0.02961 72.526405 0.055252 38.867022
2,048 0.776039 22.137891 1.612694 10.652902 0.199173 86.256026 0.455025 37.755903
4,096 5.855209 23.472936 12.275261 11.196418 1.575951 87.21019 3.468651 39.623174
8,192 39.343849 27.946214 102.974144 10.677551 12.247917 89.771315 26.561623 41.394746

OpenBLAS

One of the most popular BLAS libraries is OpenBLAS [8], which you can use with the PRELOAD trick instead of the default BLAS library. The command to run the single-precision script is:

$ OMP_NUM_THREADS=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libopenblas.so.0 octave-cli ./sgemm.m

Table 3 contains the results. Note that the OpenBLAS library is installed from the Apt repository for this distribution and version. Likely, one built on the system could produce better results.

Table 3

Octave Results with OpenBLAS Library

  Single-Precision, One Core Double-Precision, One Core Single-Precision, All Cores Double-Precision, All Cores
N Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS Elapsed Time (secs) GFLOPS
2 0.000114 0.00014 0.000114 0.00014 0.001022 0.000016 0.000771 0.000021
4 0.000076 0.001678 0.000076 0.001678 0.000099 0.001291 0.000061 0.002097
8 0.000061 0.016777 0.000061 0.016777 0.000092 0.011185 0.000061 0.016777
16 0.000061 0.134218 0.000069 0.119305 0.000084 0.097613 0.000076 0.107374
32 0.000061 1.073742 0.000076 0.858993 0.000092 0.715828 0.000076 0.858993
64 0.000099 5.286114 0.000137 3.817749 0.000145 3.616815 0.000137 3.817749
128 0.000313 13.408678 0.000572 7.330078 0.000381 10.995116 0.000656 6.392509
256 0.001808 18.557158 0.003624 9.259045 0.000519 64.677155 0.001144 29.32031
512 0.013237 20.279177 0.026962 9.955963 0.004074 65.888337 0.008163 32.882591
1,024 0.101677 21.120656 0.20388 10.533061 0.035118 61.150332 0.052483 40.918008
2,048 0.774956 22.168839 1.59137 10.79565 0.201546 85.240558 0.410416 41.859683
4,096 5.741043 23.939718 11.007278 12.486188 1.558258 88.20038 3.523735 39.003771
8,192 39.33165 27.954882 84.512154 13.010101 12.305489 89.351318 26.867691 40.92319

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Preload Trick

    By using  the LD_PRELOAD environment variable ,  you  can improve performance without making changes to applications.

comments powered by Disqus