Using loop directives to improve performance
Parallelizing Code
In the last half of 2018, I wrote about critical high-performance computing (HPC) admin tools [1]. Often, HPC admins become programming consultants by helping researchers get started with applications, debug the applications, and improve performance. In addition to administering the system, then, they have to know good programming techniques and what tools to use.
MPI+X
The world is moving toward exascale computing – at least 1018 floating-point operations per second (FLOPS) – at a rapid pace. Even though most systems aren't exascale, quite a few are at least petascale (>1015 FLOPS) and use a large number of nodes. Programming techniques are evolving to accommodate petascale systems while getting ready for exascale. Meanwhile, a key programming technique called MPI+X refers to using the Message Passing Interface (MPI) in an application for data communication between nodes while using something else (the X ) for application coding within the node.
The X can refer to any of several tools or languages, including the use of MPI across all nodes (i.e., MPI+MPI), which has been a prevalent programming technique for quite a while. Classically, each core assigned to an application is assigned an MPI rank and communicates over whatever network exists between the nodes. To adapt to larger and larger systems, data communication has adapted to use multiple levels of communication. MPI ranks within the same node can communicate directly without a network interface card (NIC). Ranks that are not on the same physical node communicate through NIC. Networking techniques can take advantage of specific topologies to reduce latency, improve bandwidth, and improve scalability.
Directives
A popular X category is the directive [2], which includes OpenMP and OpenACC, both of which were formed to standardize on directives that are not specific to a machine, operating system, or vendor. Directives are also referred to as "pragmas" and instruct the compiler to perform certain code transformations before compiling the resulting code.
If the compiler doesn't understand the directive (pragma), it will ignore it. This feature is important because it allows for a single codebase, reducing the likelihood of adding errors to the code. For example, you can place OpenMP directives in your serial code and still run the code in either serial mode or parallel mode, depending on your compiler setting. In C/C++ code, a pragma will look like #pragma token-string
. For instance,
#pragma omp parallel for
might be all that's needed to parallelize a simple for
loop. In this article, I look at OpenACC, a directives-based approach to parallelizing code and improving code performance.
OpenACC
OpenACC was originally developed to add accelerator device support that was missing from OpenMP. The design goals for OpenACC are a bit different from OpenMP. OpenACC takes a descriptive approach by using directives to describe the properties of the parallel region to the compiler, which then generates the best code possible to meet the description on which you plan to run. The goal of OpenACC was to support a wide range of accelerators, including multicore CPUs. Currently it supports:
- POWER CPU
- Sunway
- x86 CPU
- x86 Xeon Phi
- Nvidia GPU
- PEZY-SC
As with OpenMP, OpenACC allows you to use a single codebase, which can reduce errors from the introduction of new code. To compilers, the directives just look like comments. OpenACC uses parallel directives (regions that are parallelizable), data directives (data movements to/from the accelerator devices), and clauses. Fundamentally, OpenACC requires that the parallel loop be free of any data dependencies, which sometimes requires loops to be rewritten. When such a code refactoring is required, the resulting code often runs faster both with and without the directives.
OpenACC breaks the work into smaller pieces depending on the directives used in the code and the target architecture for the code. The run-time environment will select how that code is mapped to gangs, which are essentially a group of threads that can neither synchronize nor share data, on the target architecture. For example, on CPUs, they are mapped to cores. For GPUs, they are mapped to the GPU processors. For more parallelism, OpenACC can also use multiple gangs or combinations of gangs and lower level parallelism (to be covered later).
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.