Lead Image © podfoto, 123RF.com

Lead Image © podfoto, 123RF.com

CPU affinity in OpenMP and MPI applications

Bindings

Article from ADMIN 67/2022
By
Get better performance from your nodes by binding processes and associating memory to specific cores.

It's called high-performance computing (HPC), not low-performance computing (LPC), not medium-performance computing (MPC), and not even really awful-performance computing (RAPC). The focus is doing everything possible to get the highest performance possible for your applications.

Needless to say, but I will say it anyway, processors and systems have gotten very complicated. Individual CPUs can have 64+ cores, and this number is growing. They are being packaged in different ways, including multichip modules [1] with memory controllers connected in various locations, multiple memory channels, multiple caches sometimes shared across cores, chip and module interconnections, network connections, Peripheral Component Interconnect Express (PCIe) switches, and more. These elements are connected in various ways, resulting in a complex non-uniform memory access (NUMA) [2] architecture.

To get the best possible performance, you want the best bandwidth and least latency between the processing elements and between the memory and processors. You want the best performance from the interconnect between processing elements, the interconnect among processing and memory elements and accelerators, and the interconnect among the processors and accelerators to external networks. Understanding how these components are connected is a key step for improving application performance.

Compounding the challenge of finding the hardware path for best performance is the operating system. Periodically, the operating system runs services, and sometimes the kernel scheduler will move running processes from a particular process to another as a result. Then your carefully planned hardware path can be disrupted, resulting in poor performance.

I have run all types of code on my workstation and various clusters, including serial, OpenMP, OpenACC, and MPI code. I carefully watch the load on each core with GkrellM [3], and I can see the scheduler move processes from one core to another. Even when I leave one to two cores free for system processes, with the hope that processes won't be moved, I still see the processes move from one core to another. In my experience, when running serial code, it only stays on a particular core for a few seconds before being moved to another core.

When a process move takes place, the application is "paused" while its state moves from one processor to another, which takes time and slows the application. After the process is moved, it could be accessing memory from another part of the system that requires traversing a number of internal interconnects, reducing the memory bandwidth, increasing the latency, and negatively affecting performance. Remember, it's not LPC, it's HPC.

Fortunately, Linux has developed a set of tools and techniques for "pinning" or "binding" processes to specific cores while associating memory to these cores. With these tools, you can tell Linux to run your process on very specific cores or limit the movement of the processes, as well as control where memory is allocated for these cores.

In this article, I present tools you can use for binding processes. In "Processor Affinity for OpenMP and MPI" (online) [4], I show how they can be used with OpenMP and MPI applications.

Example Architecture

I'll use a simple example of a single-socket system with an AMD Ryzen Threadripper [5] 3970X CPU that has simultaneous multithreading (SMT) turned on.

A first step in understanding how the processors are configured is to use the command lscpu. The output of the command on the example system is shown in Listing 1. The output notes 64 CPUs and two threads per CPU, which indicates that SMT is turned on, which means 32 "real" cores and 32 SMT cores.

Listing 1

lscpu

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD Ryzen Threadripper 3970X 32-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2198.266
CPU max MHz:                     3700.0000
CPU min MHz:                     2200.0000
BogoMIPS:                        7400.61
Virtualization:                  AMD-V
L1d cache:                       1 MiB
L1i cache:                       1 MiB
L2 cache:                        16 MiB
L3 cache:                        128 MiB
NUMA node0 CPU(s):               0-63
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_ts
                                 c cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignss
                                 e 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall fsgsbase bmi1 avx2 sme
                                 p bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbr
                                 v svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Also note the single socket and one NUMA node. The output also lists the L1d cache as 1MiB, the L1i cache as 1MiB, the L2 cache as 16MiB, and the L3 cache as 128MiB. However, it doesn't tell you how the caches are associated with cores.

One way to get most of this information in a more compact form is shown in Listing 2.

Listing 2

Compact lscpu

$ lscpu | egrep 'Model name|Socket|Thread|NUMA|CPU\(s\)'
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Socket(s):                       1
NUMA node(s):                    1
Model name:                      AMD Ryzen Threadripper 3970X 32-Core Processor
NUMA node0 CPU(s):               0-63

An important question to be answered is: Which cores are "real," and which cores are SMT? One way is to look at the /sys filesystem for the CPUs:

-----------text01 code
$ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
0,32

If the first number in the output [6] is equal to the CPU number in the command, then it's a real core. If not, it is an SMT core. For the example command, the CPU number in the command is 0 and the first number is also 0. This makes it a real core.

Now try the command on a few other CPUs (Listing 3). The first command looks at CPU 1, and it's a real core (the CPU number is 1, and the first number in the output is 1, which matches). CPU 30 and 31 are also both real cores. However, when the command is run on CPU 32, the first number in the output is 0. Because 0 does not match 32, it is an SMT core. The same is also true on CPU 33.

Listing 3

Real or SMT? Method 1

$ cat /sys/devices/system/cpu/cpu1/topology/thread_siblings_list
1,33
$ cat /sys/devices/system/cpu/cpu30/topology/thread_siblings_list
30,62
$ cat /sys/devices/system/cpu/cpu31/topology/thread_siblings_list
31,63
$ cat /sys/devices/system/cpu/cpu32/topology/thread_siblings_list
0,32
$ cat /sys/devices/system/cpu/cpu33/topology/thread_siblings_list
1,33

You can also use the first number in the output for the SMT cores as the real core with which it is associated. For example, CPU 32 is associated with CPU 0 (the first number in the output). So CPU 0 is the real core and CPU 32 is the SMT core in the pair.

Understanding the numbering of the real and SMT cores is important, but you have another way to check whether the CPU is real or SMT. Again, it involves examining the /sys filesystem (Listing 4). The output from the command is in pairs, listing the real CPU number first and the associated SMT CPU number last. The first line of the output says that CPU 0 is the real core and CPU 32 is the SMT CPU. Really it's the same as the previous command, except it lists all of the cores at once.

Listing 4

Real or SMT? Method 2

$ cat $(find /sys/devices/system/cpu -regex ".*cpu[0-9]+/topology/thread_siblings_list") | sort -n | uniq
0,32
1,33
2,34
3,35
4,36
5,37
6,38
7,39
8,40
9,41
10,42
11,43
12,44
13,45
14,46
15,47
16,48
17,49
18,50
19,51
20,52
21,53
22,54
23,55
24,56
25,57
26,58
27,59
28,60
29,61
30,62
31,63

The lstopo tool can give you a visual layout of the hardware along with a more detailed view of the cache layout (Figure 1). This very useful command returns the hardware layout of your system. Although it can include PCIe connections as well, I've chosen not to display that output.

Figure 1: lstopo output for sample systems.

Notice in the figure that each 16MB L3 cache has four groups of two cores. The first core in each pair is the real core and the second is the SMT core. For example, Core L#0 has two processing units (PUs), where PU L#0 is a real core listed as P#0 and PU L#1 is the SMT core listed as P#32. Each group of two cores has an L2 cache of 512KB, an L1d (data) cache of 32KB, and a L1i (instruction) cache of 32KB.

The eight L3 cache "groups" make a total of 64 cores with SMT turned on.

Affinity Tools

In this article, I discuss two Linux tools that allow you to set and control application threads (processes), giving you great flexibility to achieve the performance you want. For example, a great many applications need memory bandwidth. The tools allow you to make sure that each thread gets the largest amount of memory bandwidth possible.

If network performance is critical to application performance (think MPI applications), with these tools, you can bind threads to cores that are close to a network interface card (NIC), perhaps not crossing a PCIe switch. Alternatively, you can bind processes to cores that are as close as possible to accelerators to get the maximum possible PCIe bandwidth.

The Linux tools presented here allow you to bind processes and memory to cores; you have to find the best way to use these tools for the best possible application performance.

taskset

The taskset command [7] is considered the most portable Linux way of setting or retrieving the CPU affinity (binding) of a running process (thread). According to the taskset man page, "The Linux scheduler will honor the given CPU affinity and the process will not run on any other CPUs."

An example of executing a process with the taskset command is:

-------text02
$ taskset --cpu_list 0,2 application.exe

This command sets the affinity of application.exe to cores 0 and 2 and then executes it. You can also use the short version of the --cpu_list option, -c.

If you want to change the affinity of a running process, you need to get the process ID (PID) of the processes with the --pid (-p) option. For example, if you have an application with four processes (or four individual processes), you get the PIDs of each process and then run the following command to move them to cores 10, 12, 14, and 16:

----------text03
$ taskset --pid --cpu_list 10 [pid1]
$ taskset --pid --cpu_list 12 [pid2]
$ taskset --pid --cpu_list 14 [pid3]
$ taskset --pid --cpu_list 16 [pid4]

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus