New OpenMP 4.0 Spec
Recent news about the OpenMP 4.0 specification from OpenMP.org is exciting. In addition to several major enhancements, the specification provides a new mechanism to describe regions of code for which data, computation, or both should be moved to another computing device (e.g., a GPU). The following are a few of the features that should be of interest to the HPC crowd:
- Support for accelerators. The OpenMP 4.0 API specification effort needed significant participation by all the major vendors to support a wide variety of compute devices. Several prototypes for the accelerator proposal have already been implemented.
- Single instruction, multiple data (SIMD) constructs to vectorize both serial and parallelized loops. All major processors have some form of SIMD units available to programmers. The OpenMP 4.0 API provides portable mechanisms to describe when multiple iterations of the loop can be executed concurrently using SIMD instructions and to describe how to create versions of functions that can be invoked across SIMD lanes.
- Mechanisms for thread affinity to define where to execute OpenMP threads. Platform-specific data and algorithm-specific properties are separated, offering a deterministic behavior and simplicity in use. The advantages for the user are better locality, decreased false sharing, and more memory bandwidth.
- New tasking extensions added to the task-based parallelism support. Tasks can be grouped to support deep task synchronization, and task groups can be aborted to reflect completion of cooperative tasking activities, such as search. Task-to-task synchronization is now supported through the specification of task dependency.
- Fortran 2003 modern computer language features. Having these features in the specification allows users to parallelize Fortran 2003-compliant programs. This includes interoperability of Fortran and C, which is one of the most popular features in Fortran 2003.
- Support for user-defined reductions. Previously, the OpenMP API only supported reductions with base language operators and intrinsic procedures. The OpenMP 4.0 API now supports user-defined reductions.
Other new features and improvements in the specification might be of interest. In the future, the complete specification will be available as part of most compilers, and a subset of these new features already could be available in some existing compilers.
NVidia recently announced its purchase of The Portland Group (PGI), who has been a leader in the automatic use of GPUs with Fortran and C/C++ compilers. Similar to OpenMP, PGI has developed a set of “pragma” comments or hints that are used by the compiler to create CPU/GPU programs. By using comments, the original source code of the program remains unchanged. This approach makes it easier to implement new codes or port old codes, rather than re-writing in OpenGL or CUDA. Of course the performance might not be as good as using a GPU-oriented language, but the “time cost” to use GPU accelerators is drastically reduced.
Reaching Beyond the Motherboard
Intel and AMD have plenty of multicore options, including standard servers and workstations with 12, 16, or more cores. Add to this GPUs or a Xeon Phi, and you have plenty of ways to pack FLOPS into a single node. The new OpenMP standard should allow programmers to use this raw hardware easily, in terms of their domain expertise (i.e., they can work with the Fortran and C/C++ software applications with which they are familiar).
All of this sounds great until you realize OpenMP is a shared memory solution and does not really help get your application working across SMP memory domains. Assuming that your application is scalable or that you might want to tackle larger data sets, what are the options to move beyond OpenMP? In a single word, MPI (okay, it is an acronym). MPI is quite different from OpenMP, although it is possible to create hybrid applications using both. Unfortunately, it might not be as easy as using OpenMP pragmas in an existing MPI program, because many of the parallel loops are cast in MPI calls. Moreover, the dynamics of the application communication/computation ratio can change, resulting in a loss of scalability.
Running existing MPI applications on SMP systems will work and is an easy “no coding” option. If you have been running a 16-way MPI application on a cluster and would like to run it on a single 16-core node, the performance could vary depending on the application. Assuming a pure OpenMP solution will always work better than an MPI application on a single node would be a mistake. Consider Table 1, which shows the results of NAS Parallel Benchmarks (NPB3.2.1), which have both MPI and OpenMP versions, run on a single 16-core node (dual Intel Xeon E5-2680 processors with 32GB of RAM). Results are in mega-operations per second (MOPS), which is FLOPS for all tests except IS (integer sort). More information on each test can be found at the NAS site.
Table 1: OpenMP and MPI versions of the NAS Parallel Benchmarks. Both tests used gcc/gfortran version 4.4.6.
Testa | OpenMP |
MPI OpenMP 1.4.4 |
% Difference |
BT | 35,819.75 | 26,850.06 | 25.00% |
CG | 7,110.39 | 8,545.62 | 17% |
EP | 630.80 | 626.49 | 0.70% |
FT | 24,451.57 | 17,554.99 | 28.00% |
IS | 470.63 | 904.86 | 48% |
LU | 40,600.64 | 34,120.96 | 16.00% |
MG | 23,289.81 | 24,823.29 | 6% |
SP | 19,751.24 | 12,061.77 | 39.00% |
a BT, block tri-diagonal; CG, conjugate gradient; EP, embarrassingly parallel; FT, 3D fast Fourier transform; IS, integer sort; LU, lower–upper Gauss–Seidel; MG, multigrid; SP, scalar penta-diagonal. |
Note in the table that EP is close to the same for both methods because it has very little communication. Next, OpenMP is not always the better choice. In some cases (BT, FT, and SP), the difference was notable. In other cases, the differences was less than 20% (CG, LU, and MG) and even favored MPI (CG and MG). Interestingly, IS which is latency sensitive, was much better using MPI than OpenMP.
Of course your application may vary, but the results in the above table indicate that using MPI for SMP multicore applications might not be a bad choice. The one big advantage of MPI is its ability to scale beyond the SMP memory domain. If your application can use more cores, then adding MPI offers a scalable pathway. Additionally, MPI offers a flexible execution model, in which an application can be run many different ways on a cluster (e.g. The tests above could have been run on 16 separate nodes, one 16-way node, or anything in between.)
With an Eye To the Future
Thanks to the new OpenMP specification, the HPC developer will soon have some new choices, but the path is not quite clear. The ability of OpenMP 4.0 to address both cores and accelerators is very attractive. Casting applications in this framework is a good choice if you expect a single SMP memory domain will hold enough compute horsepower for your needs. Just as the processor clock “free lunch” hit a wall (because of heat), so will the current just add cores “free lunch” (because of memory issues). Expecting a steady growth curve of x86 cores in processors is probably not a safe bet. Even now, memory contention limits how fast some applications will run on a multicore processor (i.e., applications stop scaling because of a memory bottleneck). Adding more cores only exacerbates this problem.
Using MPI is a good choice, but utilizing multiple cores might not be efficient on SMP nodes, as shown in Table 1. In terms of accelerators, managing two distinct memory domains (the CPU and the accelerator) could require some serious contortion of a new or existing MPI program.
As mentioned, hybrid programing strategies could be a solution. Certainly the number of cores or accelerators per node is a constantly moving target and preferably should not be hard coded into an application. In the future, almost all processors will have cores and, if you have been following the AMD Fusion technology, most likely accelerators. Thus, using OpenMP to manage “things on the node” and MPI to manage “things between nodes” seems like a good strategy (i.e., nodes are treated as single entities from an MPI viewpoint).
The flexibility to run a program on N nodes (N ≥ 1), with C cores per node (C ≥ 1), and A accelerators per core (A ≥ 0) should be the goal of any programmer. By abstracting away C and A, the OpenMP 4.0 API can make such applications possible. Thus, using a hybrid strategy, N, C, and A should become run-time arguments for all HPC applications. Running code almost anywhere should be a real possibility. Although a trade-off between portability and efficiency is possible, the flexibility offered to end users would be hard to resist. If you are interested in OpenMP/MPI current hybrid approaches you will find plenty of links via your favorite search engine.