Parallelizing Code – Loops
In the last half of 2018, I wrote about critical HPC admin tools. Often, HPC admins become programming consultants by helping researchers get started with applications, debug the applications, and improve performance. In addition to administering the system, then, they have to know good programming techniques and what tools to use.
MPI+X
The world is moving toward Exascale computing (at least 1018 floating point operations per second [FLOPS]) at a rapid pace. Even though most systems aren’t Exascale, quite a few are at least Petascale (>1015 FLOPS) and use a large number of nodes. Programming techniques are evolving to accommodate Petascale systems while getting ready for Exascale. Meanwhile, a key programming technique called MPI+X refers to using MPI in an application for data communication between nodes while using something else (the X) for application coding within the node.
The X can refer to any of several tools or languages, including the use of MPI across all nodes (i.e., MPI+MPI), which has been a prevalent programming technique for quite a while. Classically, each core assigned to an application is assigned an MPI rank and communicates over whatever network exists between the nodes. To adapt to larger and larger systems, data communication has adapted to use multiple levels of communication. MPI ranks within the same node can communicate directly without a network interface card (NIC). Ranks that are not on the same physical node communicate through NIC. Networking techniques can take advantage of specific topologies to reduce latency, improve bandwidth, and improve scalability.
Directives
A popular X category is the directive, which includes OpenMP and OpenACC, which were formed to standardize on directives that are not specific to a machine, operating system, or vendor. Directives are also referred to as “pragmas” and instruct the compiler to perform certain code transformations before compiling the resulting code.
If the compiler doesn’t understand the directive (pragma), it will ignore it. This feature is important because it allows for a single codebase, reducing the likelihood of adding errors to the code. For example, you can place OpenMP directives in your serial code and still run the code in either serial mode or parallel mode, depending on your compiler setting. In C/C++ code, a pragma will look like #pragma token-string. For instance,
#pragma omp parallel for
might be all that’s needed to parallelize a simple for loop.
In this article, I look at OpenACC, a directives-based approach to parallelizing code and improving code performance.