CPU affinity in OpenMP and MPI applications

Bindings

It's called high-performance computing (HPC), not low-performance computing (LPC), not medium-performance computing (MPC), and not even really awful-performance computing (RAPC). The focus is doing everything possible to get the highest performance possible for your applications.

Needless to say, but I will say it anyway, processors and systems have gotten very complicated. Individual CPUs can have 64+ cores, and this number is growing. They are being packaged in different ways, including multichip modules [1] with memory controllers connected in various locations, multiple memory channels, multiple caches sometimes shared across cores, chip and module interconnections, network connections, Peripheral Component Interconnect Express (PCIe) switches, and more. These elements are connected in various ways, resulting in a complex non-uniform memory access (NUMA) [2] architecture.

To get the best possible performance, you want the best bandwidth and least latency between the processing elements and between the memory and processors. You want the best performance from the interconnect between processing elements, the interconnect among processing and memory elements and accelerators, and the interconnect among the processors and accelerators to external networks. Understanding how these components are connected is a key step for improving application performance.

Compounding the challenge of finding the hardware path for best performance is the operating system. Periodically, the operating system runs services, and sometimes the kernel scheduler will move running processes from a particular process to another as a result. Then your carefully planned hardware path can be disrupted, resulting in poor performance.

I have run all types of code on my workstation and various clusters, including serial, OpenMP, OpenACC, and MPI code. I carefully

...

Use Express-Checkout link below to read the full article (PDF).