Finding Your Way Around a GPU-Accelerated Cloud Environment
Speed Racer
Raw compute performance horsepower has migrated from the central processing unit into dedicated chips over the last decade. Starting with specialized graphic processing units (GPUs), it has evolved into ever more specialized options for artificial intelligence use (tensor processing unit – TPU). Some emerging applications even make use of user-programmed field-programmable gate arrays (FPGAs) to execute customized in-silicon logic. These enhanced computing capabilities require adopting domain-specific data parallel programming models, of which NVidia's CUDA [1] is the most widely used.
The rise of the cloud has made access to the latest hardware cost effective even for individual engineers, because coders can purchase time on accelerated cloud instances from Amazon Web Services (AWS), Microsoft Azure, Google, or Linode, to name but a few options. This month I look at the tools needed to discover, configure, and monitor an accelerated cloud instance in my trademark style, employing the simplest possible tool that will get the job done.
Knock, Knock. Who's There?
On logging in to an environment configured by someone else (or by yourself a few weeks prior), the first question you would pose is just what acceleration capabilities, if any, are available. This is quickly discovered with the command:
$ ec2metadata | grep instance-type instance-type: p3.2xlarge
Variations of the ec2metadata
tool query the AWS metadata service, helping you identify the instance's type. Alon Swartz's original ec2metadata
[2] is found in Ubuntu releases like Bionic (18.04), on which the Deep Learning Amazon Machine Image (DLAMI) is currently based [3]. It has been replaced since by ec2-metadata
[4] as Canonical decided to standardize on Amazon's implementation of the tool beginning with Groovy Gorilla (20.10).
Documentation indicates this instance type is equipped with an NVidia Tesla V100 [5] datacenter accelerator (Figure 1). Built on the basis of the Volta microarchitecture [6], the V100 supports CUDA 7.0, and it was the first to ship tensor cores designed for superior machine learning performance over regular CUDA GPU cores. You can also find this out without resorting to references by interrogating the hardware with lspci
(Figure 2); equivalent information can also be obtained with lshw
.
Stopwatch
A tidy and convenient utility to keep tabs on what is going on with the GPU is called gpustat
[7]. Load information and memory utilization can be sourced alongside temperature, power, and fan speed. An updating watch
[8] view is also present. After installing from the Python package repository (pip3 install gpustat
), try the following:
$ gpustat -P [0] Tesla V100-SXM2-16GB | 37'C, 0 %, 24 / 300 W | 0 / 16160 MB |
One GPU is present, running at a cool 37 Celsius and drawing 24W while doing absolutely nothing. To proceed further, you need to find a load generator, because trusted standbys stress
[9] and stress-ng
[10] do not yet supply GPU stressors. Obvious choices include the glmark2
[11] and glxgears
[12] graphic load generators. Both tools measure a frame-rate benchmark and require a valid X11 display. A headless alternative is supplied by the password recovery utility hashcat
[13], which includes a built-in GPU-accelerated hashing benchmark. Version 6 supplies a CUDA driver and can be found on the developer's website. Launching the nightmare
workload profile will keep the GPU busy for some time, giving you a few minutes to test tools (Figure 3). Try it with:
$ sudo ./hashcat-6.1.1/hashcat.bin -b -O -w 4
Figure 4 shows the results with gpustat
. Temperature is exceeding 55 degrees, and power consumption is approaching 230W; 2.5GB of memory is in use, and GPU load is now at 100%. At the same time, I took the opportunity to call on nvidia-smi
[14], the NVidia systems management interface utility, for an alternative view of the system's status. The nvidia-smi
utility is the official GPU-configuration tool supplied by the vendor. Available on all supported Linux distributions, it encompasses all recent NVidia hardware. (See "Intel and AMD Radeon GPU Tools" box.)
Intel and AMD Radeon GPU Tools
Users of hardware not manufactured by NVidia need not fear; Linux tools exist for their GPUs as well. The intel-gpu-tools
package supplies the intel_gpu_top
[15] command, which will produce a process and load listing (but alas no curses chart) on machines equipped with Intel hardware. For AMD chips, the radeontop
[16] command provided by the eponymous package will do the trick – and it provides an interesting take on terminal graphics, showcasing loads in different parts of the rendering pipeline.
Another interesting bit of software coming out of Intel is the oneAPI Toolkit, which stands out for its ability to bridge with one data-parallel abstraction execution across CPUs, GPUs, and even FPGAs [17].
The Real McCoy
This tour must inevitably end with a top
-like tool. Maxime Schmitt's nvtop
[18] is packaged in the universe
repository starting with Focal (20.04), but it is easily compiled from source on the 18.04-based DLAMI: I was able to do so without incident in a few minutes. Packaged for the most popular Linux distributions, nvtop
can handle multiple GPUs, and it produces an intuitive in-terminal plot. Conveniently, it can distinguish between graphic
and compute
workloads in its process listing and plots the load on each GPU alongside the use of GPU memory. The intermittent nature of Hashcat's many-part benchmark is shown clearly in a test (Figure 5).
One last, excellent option comes from AWS itself in the form of the CloudWatch service. CloudWatch does not track GPU metrics by default, but the DLAMI documentation provides instructions on how to configure and authorize a simple Python script reporting temperature, power consumption, GPU, and GPU memory usage to the cloud service [19]. The results are great (Figure 6), and the data is stored in the service that you should be already using to monitor your cloud instances, making a case for convenience and integration. You can customize the granularity of the sampling by modifying the supplied script. Please take note of a minor inconsistency in the documentation: The store_resolution
variable is really named store_reso
.
Infos
- CUDA: https://developer.nvidia.com/cuda-zone
- Alon Swartz – ec2metadata: https://www.turnkeylinux.org/blog/amazon-ec2-metadata
- DLAMI: https://docs.aws.amazon.com/dlami/latest/devguide/what-is-dlami.html
- ec2-metadata: http://manpages.ubuntu.com/manpages/groovy/en/man8/ec2-metadata.8.html
- NVidia V100 Tensor Core GPU: https://www.nvidia.com/en-us/data-center/v100/
- NVidia Volta microarchitecture: https://en.wikipedia.org/wiki/Volta_(microarchitecture)
- Jongwook Choi – gpustat: https://pypi.org/project/gpustat/
- watch (1) man page: http://manpages.ubuntu.com/manpages/bionic/en/man1/watch.1.html
- Amos Waterland – stress (1) man page: http://manpages.ubuntu.com/manpages/bionic/man1/stress.1.html
- Colin King – stress-ng (1) man page: http://manpages.ubuntu.com/manpages/bionic/man1/stress-ng.1.html
- glmark2 (1) man page: http://manpages.ubuntu.com/manpages/bionic/en/man1/glmark2.1.html
- glxgears (1) man page: http://manpages.ubuntu.com/manpages/bionic/en/man1/glxgears.1.html
- Jens Steube – hashcat v6: https://hashcat.net/hashcat/
- nvidia-smi: https://developer.nvidia.com/nvidia-system-management-interface
- intel_gpu_top (1) man page: http://manpages.ubuntu.com/manpages/bionic/en/man1/intel_gpu_top.1.html
- radeontop: https://github.com/clbr/radeontop
- Intel oneAPI Toolkits: https://software.intel.com/content/www/us/en/develop/tools/oneapi/all-toolkits.html
- Maxime Schmitt – nvtop: https://github.com/Syllo/nvtop
- GPU monitoring with CloudWatch: https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-gpu-monitoring-gpumon.html
Buy this article as PDF
(incl. VAT)