Getting the Most from Your Cores

In a general sense, high-performance computing means getting the most out of your resources. This translates to utilizing the CPUs (cores) as much as possible. Consequently, CPU utilization becomes a very important metric to determine how well an application is using the cores. On today’s systems with multiple cores per socket and various cache levels that may or may not be shared across cores, determining CPU utilization might not be easy or simple to determine.

To explain this, a definition of CPU utilization is needed. As a starting point, I’ll use the definition from Technopedia, which states:

CPU utilization refers to a computer’s usage of processing resources, or the amount of work handled by a CPU. Actual CPU utilization varies depending on the amount and type of managed computing tasks. Certain tasks require heavy CPU time, while others require less because of non-CPU resource requirements.

The definition goes on to state:

CPU utilization should not be confused with CPU load .

This is a very important point in the quest for measuring CPU utilization of HPC applications – don’t confuse CPU load and CPU utilization.

In the days of single-core CPUs, CPU utilization was fairly straightforward. If a processor was operating at a fixed frequency of 2.0GHz, CPU utilization was the percentage of time the processor spent doing work. (Not doing work is idle. ) For 50% utilization, the processor performed about 1 billion cycles worth of work in one second. Current processors have multiple cores, hardware multithreading, shared caches, and even dynamically changing frequencies. Moreover, the exact details of these components varies from processor to processor, making CPU utilization comparison difficult.

Current processors can have shared L3 caches across all cores or shared L2 and L1 caches across cores. Sometimes these resources are shared in interesting ways. The example in Figure 1 is the Xeon E5 v2 to Xeon E5 v3 processors (courtesy of EnterpriseTech). 

Figure 1: Xeon E5v2 to Xeon E5v3 architectures.

Notice how the specific CPU architecture changes, moving from the Ivy Bridge (Xeon E5 v2) generation to the Haswell (Xeon E5 v3) generation and how the resources are shared differently in the two architectures.

Equally as important is how the various applications or “bits of work” are distributed. As a result, when resources are shared, the net effect on performance becomes dependent on workload.

Illustrating this point is fairly simple. If an application benefits from a larger cache and the cache space is shared, the performance can suffer because of cache misses. Memory access performance is also an important aspect of application performance (getting/putting data into either caches or main memory). The reported CPU utilization times include the time spent waiting for cache or memory access. This time can be larger or smaller based on the amount and the kind of resource sharing that is going on in the CPU.

Symmetric Multithreading (SMT), as in Intel’s Hyper-Threading technology, have logical cores that are the same as physical cores and may have execution units shared between them. These processors also have non-uniform memory access (NUMA) characteristics, so the placement of processes, including pinning them to certain cores, can have an effect on CPU utilization.

Also affecting CPU utilization is the frequency of the cores (logical units) doing work. Many processors have the ability to turn up their frequency if neighboring cores are idle or doing very little work. The goal is to keep the temperature of the collective processors below a threshold. This means that the frequency of the various cores can vary while an application is running, which also affects CPU utilization and, more importantly, how the frequency or work capability of the processor is computed.

Yet another event with an effect on CPU utilization is virtualization. Virtualization can introduce more complexity into the problem of CPU utilization because allocation of work to the various cores is performed by the hypervisor rather than the guest OS, so the performance counters used for measuring CPU utilization should be hypervisor-aware.

All of these factors interact so that measuring CPU utilization is not as easy as it might seem. Furthermore, trying to translate CPU utilization from one CPU architecture to another can result in very different results. Understanding the CPU architectures and how CPU utilization is measured are keys for making the transition.

CPU Utilization or CPU Load?

As mentioned in the first section of the article, “CPU utilization should not be confused with CPU load.” This is a very critical distinction that will prevent confusion.

In Linux (and *nix computing in general), system load is a measure of work that the system performs. The classic uptime (or w ) command lists three load averages for 1-minute, 5-minute, and 15-minute time periods. When the system is idle, the load number is zero. For each process that is using or waiting for the CPU, the load is incremented by one. Typically, this includes the effect of processes blocked in I/O, because busy or stalled I/O systems are increasing the load average even though the CPU is not being used. Moreover, the load is computed as an exponentially damped/weighted moving average of the load number. Note that the average is computed.

As a result, using load to measure CPU utilization has drawbacks, because processes blocked in I/O mask the true load and the use of a computed load average skews the results. The moral is that if accurate CPU utilization measurements are needed, don’t use load measurements.

psutil

You can find a number of articles around the Internet about CPU utilization on Linux machines. Many of them use uptime or w , which aren’t the best ways to determine CPU utilization, particularly if testing an HPC application that uses a majority of the cores.

For this article, I use the psutil , which is a cross-platform library for gathering information on running processes and system utilization. It currently supports Linux, Windows, OS X, FreeBSD, and Solaris; has a very easy-to-use set of functions; and can be used to write all sorts of useful tools. For example, the author of psutil wrote a top- like tool in a couple of hundred lines of Python.

The pustil documentation discusses several functions for gathering CPU stats, particularly CPU times and percentages. Moreover, these statistics can be gathered with user-controllable intervals and for either the entire system (all cores) or every core individually. Thus, psutil is a great tool for gathering CPU utilization stats.

As an example of what can be done with psutil for gathering CPU utilization statistics, a simple Python program was written that gathers CPU stats and plots them using Matplotlib (Listing 1). The program is just an example of gathering CPU statistics with some values hard-coded in.

Related content

  • Getting the most from your cores
    CPU utilization metrics tell you how well your applications are using your processing resources.
  • Monitoring Storage with iostat

    One tool you can use to monitor the performance of storage devices is iostat . In this article, we talk a bit about iostat, introduce a Python script that takes iostat data and creates an HTML report with charts, and look at a simple example of using iostat to examine storage device behavior while running IOzone.

  • Processor and Memory Metrics

    One goal of HPC administration is effective monitoring of clusters. In this article, we talk about writing code that measures processor and memory metrics on each node.

  • Remora – Resource Monitoring for Users

    Remora provides per-node and per-job resource utilization data that can be used to understand how an application performs on the system through a combination of profiling and system monitoring.

  • A Bash-based monitoring tool
    In a sea of top-like tools, bashtop impresses with an easy-to-use and efficient interface.
comments powered by Disqus