OpenMP – Coding Habits and GPUs
When first using a new programming tool or programming language, it’s always good to develop some good general habits. Everyone who codes with OpenMP directives develops their own habits – some good and some perhaps not so good. As this three-part OpenMP series finishes, I highlight best practices from the previous articles that can lead to good habits.
Enamored with new things, especially those that drive performance and scalability, I can’t resist throwing a couple more new directives and clauses into the mix. After covering these new directives and clauses, I will briefly discuss OpenMP and GPUs. This pairing is fairly recent, and compilers are still catching up to the newer OpenMP standards, but it is important for you to understand that you can run OpenMP code on targeted offload devices (e.g., GPUs).
Nested Loops Part 2
In the first OpenMP article, I briefly touched on nested loops from the perspective of ordering the loops to get the most parallelism possible. Remember that the directive omp do/for only parallelizes one loop: the loop right after the directive. To create even more parallelism in code, OpenMP has added a clause to the omp do/for directive.
The clause is of the form omp do/for collapse(N), where N is the number of loops in a nested loop that should be collapsed into one “iteration space.” The iteration space is then divided with the schedule clause, if provided.
A simple example of using the collapse clause is the code snippet in Listing 1. Note that for this simple code, OpenMP will collapse the two loops together to create a larger iteration space; then, it can take that iteration space and run it across the threads as directed.
Listing 1: collapse Clause
Fortran | C |
---|---|
!$omp do collapse(2) do i=1,N do j=1,N A(i,j) = B(i,j) + C(i,j) enddo enddo |
#pragma omp for collapse(2) for ( i = 0; i < N; i++ ) { for ( j = 0; j < N; j++ ) { A[i][j] = B[i][j] + C[i][j] } } |
The collapse clause has some restrictions. One is that you cannot put any statements between the loops (i.e., no code between the i and j loops in the example). Reordering or refactoring code to keep the loops next to each other can pay off, especially when targeting the GPUs discussed later. Regardless, CPUs are reaching 64 cores per socket, allowing more scalability, so collapse can help on CPUs, as well.
SIMD
If you remember your college math classes, you will recall that a scalar is a single number, whereas a vector is a list or array of numbers. Many processors today, such as the Intel Xeon Scalable CPU, IBM Power CPU, some AMD CPUs, and newer ARM CPUs have the capability of operating on vectors of numbers, rather than just a single number, at the same time. Single instruction, multiple data (SIMD) is another way of saying “vector operations” or “data parallelism.”
Why do you care about vectorization? Fundamentally, it adds more parallelism to the code, which improves performance and scalability. Vectorization applies the same set of instructions to a “pipeline” of data, improving performance. Most likely the data is taken from the L1 or L2 cache, which is much faster than regular memory. Today’s cores and compilers can help you vectorize your code with auto-vectorization (so you don’t have to vectorize by hand).
Starting in OpenMP 4.0, the directive omp simd was introduced. It basically gives a hint to the compiler to look for places in which the next loop can take advantage of any vector processing support that the hardware provides. Much of this work is done by the compiler analyzing the code to create as much SIMD as possible for the cores. However, omp simd is not a magic directive that you can sprinkle on your code to make it run faster; you need to pay attention to some gotchas:
- Application performance will be sensitive to
- hardware vector length,
- data being in-cache (cache misses can be a killer on performance), and
- proper cache alignment.
- Code vectorization results in changes in the order of operations within a loop.
- However, program semantics must be maintained so that each result is independent of the previous results.
- Therefore, the compiler will be conservative and produce correct results even at the expense of vectorizing.
If the compiler can't vectorize certain loops, you might have to rework the code to reduce array dependencies, which is beyond the scope of this article; however, you can find materials online that explain how to do this.
For completeness, the omp simd directives for Fortran and C are shown in Listing 2.
Listing 2: omp simd Directive
Fortran | C |
---|---|
!$omp simd do i=1,N ... enddo !omp end simd |
#pragma omp simd for ( i = 0; i < N; i++ ) { ... } |
For Fortran you need the end simd directive to end the region. For C/C++, the end of the loop marks the end of the region. You can also combine directives and put omp simd on the same line as omp do/for.
Since OpenMP 4.0, OpenMP 4.5 and 5.0 have added clauses to omp simd to give the compiler more hints on how to vectorize the code. These clauses can get complicated, so I won't discuss them here.
To determine whether the compiler is vectorizing code, look at the compiler output. First, be sure to use an option that produces “verbose” output from the compiler. Then, look for output that tells you whether the compiler was able to vectorize the loop or not (e.g., “OpenMP SIMD LOOP WAS VECTORIZED” or “loop was not vectorized: existence of vector dependence”). This feedback allows you to go back into the code to look for data dependencies or other issues that prevent vectorization.
Depending on the language you use and your code, vectorization (omp simd) can be awesome, but it can also create issues. A tutorial makes the following observations:
- Incorrect results are possible
- Poor performance is possible
- Memory errors are possible
To reduce the probability of any of these happening, be sure to test the output of your code for correctness and check the run time of the code.
As a final note, the directive omp simd is relatively new, so you will need to use more recent versions of GCC, Clang, or commercial compilers. The compilers that come standard with various distributions probably won’t understand the directive. However, because the directive appears to be a comment, the compiler will just ignore it. Once you get access to a compiler that understands it, you’re just a recompile away with no code changes.
Fewest Parallel Regions
The first habit I’ve gotten into is using the fewest parallel regions possible. Just the creation and destruction of a parallel region can take tens of microseconds. Although it doesn’t sound like much, creating and destroying parallel regions thousands or tens of thousands of times or more can cause a measurable delay in code execution and can reduce the scalability of the code according to Amdahl’s Law. In effect, the code becomes “serial time” that cannot be reduced by using more processors.
To determine how to balance the number of times a parallel region is created and destroyed compared with how much time is spent in the parallel region, I found a somewhat vague rule of thumb that says: The code inside a parallel region should run “several” times longer than the time to create or destroy a parallel region. I’m not quite sure how much longer the execution time should be, but the longer, the better.
Be Safe and Think Debugging
As you port your code to OpenMP, you should remember a few items with regard to the code. By default, the behavior for parallel regions and worksharing is that any variable defined outside of the parallel region is usually shared. If the variable is defined only in the parallel region, it is private. It is very important to understand which variable uses which data parallelism.
Relying on the default behavior of OpenMP can be dangerous because you have to play very close attention to what is shared and what is private. A possible best practice is to use the default(private) directive at first, which forces all variables in each thread to be private: No variable values are shared between threads.
Another option, which can also help with debugging, is to use default(none), which forces you to define whether each variable in the OpenMP region is either private or shared. If you don’t define the variable, the compiler will throw a compiler error telling you the variable needs to be defined (i.e., the code won't compile), providing a great way track down each variable in the parallel region/worksharing and define them.
The difference between the two clauses default(private) and default(none) is that the last clause forces you to define every variable, including shared variables. The first option defines all variables as private and does not allow shared variables. You have to decide which variable option works for specific variables (i.e., know thy code).
The best practice that I use is to start with default(private) to make sure the threads are created correctly and that at least the initial portions of the parallel code is executed correctly. I don’t worry too much about the correctness of the output because I’m just checking that everything starts correctly. Then, I switch to default(none) and define all variables, including shared and private variables. I make sure the code compiles and the output is correct, then I start debugging and working on improving performance and scalability.
Best Practices for schedule
In the second OpenMP article of the series, the concept of a schedule clause was introduced that tells the compiler how to schedule the blocks of work across the threads. The form of the Fortran clause is:
!$omp do schedule(kind[, chunksize])
The option kind has five possible values:
- static
- dynamic
- guided
- auto
- runtime
Each schedule has different properties, as discussed in that same article. The question is: Which one is “best” for running your code?
As a review, the second option for the schedule clause is the chunksize. If specified, the work is divided into chunks of that size. A quick review of the schedule options are:
- static: If chunksize is specified, the iteration space is divided into chunks, each of chunksize iterations, and the chunks are assigned cyclically to each thread in order (block cyclic schedule); otherwise, the iteration space is divided into the number of threads.
- dynamic: This schedule divides the iteration space up into chunks of size chunksize and puts them into a task queue. The chunks are assigned to unoccupied threads on a first come, first served basis out of the task queue (default for chunksize is 1).
- guided: The guided schedule is similar to dynamic, but the chunks start off large and get smaller exponentially. The size of the next chunk is proportional to the number of remaining iterations divided by the number of threads, where the chunksize specifies the minimum size of the chunks. The default chunksize is 1.
- auto: The auto schedule lets the run time decide, on its own, the assignment of iterations to threads. For example, if the parallel loop is executed many times, the run time can evolve a good schedule with some good load balance characteristics and low overheads. However, creating a good schedule can take time.
- runtime: This option defers the scheduling decision until run time and is defined by an environment variable (OMP_SCHEDULE) that allows you to vary the schedule simply by changing the environment variable. You cannot specify a chunksize for this schedule.
With a static schedule and the default chunksize, the loop iteration range is divided into equal-sized chunks. For example, if you have four threads, then thread 0 will get the first quarter, thread 1 the second quarter, thread 2 the third quarter, and thread 3 the fourth quarter of the iteration range. This approach tends to improve cache utilization because the processor will grab subsequent memory locations, usually enough to fill a cache line, providing much faster data access for data in subsequent iterations.
As you can probably discern, using the static schedule works well when the threads have close to the same amount of work (uniform work) to perform, but when different iterations require differing lengths of time (i.e., the per-iteration workloads are non-uniform), using the static schedule might not be ideal. Instead, one of the other schedules would be a better choice because they address non-uniform workloads and can often improve processing performance of the workload. For example, a dynamic schedule with a smaller chunksize will do a better job at load balancing with varying number of iterations that have different run times.
A dynamic schedule, however, is not just a “free change” to a different schedule. While the application is running, the OpenMP run time has to make some decisions, which can slow down the application. On the other hand, a static schedule doesn’t make decisions during run time (only at compile time) but might not be the most efficient scheduling of iteration ranges.
I typically use the static scheduler, which assigns loop iterations to match the number of threads; however, a number of people have developed best practices for choosing a schedule. By the way, the default schedule is implementation dependent, so it’s not a bad idea to specify which one you want to use by default.
Viktor Eijkhout of TACC (Texas Advanced Computer Center) has created OpenMP documentation with some schedule recommendations:
- Using the static schedule (#pragma omp for schedule(static[,chunk]) could deliver good load balancing if the iterations take very different amounts of time to execute, although it does carry run-time overhead for managing the queue of iteration tasks.
- Using the clause option schedule guided gradually decreases the chunk size. The thinking here is that large chunks carry the least overhead, but smaller chunks are better for load balancing.
- If you don’t want to decide on a schedule in your code, you can specify runtime and the schedule will then be chosen at run time by being read from the OMP_SCHEDULE environment variable. You can even just leave it to the run-time library by specifying the schedule as auto.