OpenACC Directives for Data Movement
OpenACC was designed with accelerators in mind, with the host CPU and the accelerator device each having their own memory. In my previous article, I showed how to use OpenACC loop directives to off-load regions of code from a host CPU to multicore CPUs or an attached accelerator device (e.g., GPU) for parallelizing your code. In this article, the accelerator is presumed to be a GPU, but it doesn’t have to be.
OpenACC Memory Layout
CPU memory is almost always greater than GPU memory, although the GPU memory has much more memory bandwidth than the host memory. Separating these two memory pools is an I/O bus, commonly a PCIe bus. Any data transferred between the CPU and GPU goes over the PCIe bus, which is relatively slow compared with memory bandwidth. Most importantly, neither the CPU or the GPU can do any computation until all the data is in memory.
OpenACC accommodates the accelerator being a multicore CPU. For this case, the host memory and the accelerator memory are the same (shared memory), so there is no need to manage memory because it’s already in the same pool.
For the case where the accelerator has its own memory, data needs to be migrated between the CPU and the GPU (accelerator) memory. In the case of a multicore processor, the accelerator is the same as the host and the accelerator memory is the same as the host memory.
CUDA-Managed Memory
In general, OpenACC is designed for a host and an accelerator device, each with different memories, which means the user has to manage moving data to and from the accelerator, as shown in Figure 1. To make things easier, OpenACC compilers have adopted the Unified Memory approach. Fundamentally, this means that a pointer can be dereferenced from either the CPU or the GPU. In non-CS-speak, this means that the accelerator memory and the host memory appear as one pool of memory to the application.
With unified memory, you do not need to worry about specific data movement and can code assuming that the data will be on the accelerator when needed. In the background, the compiler handles the data movement between the host and the GPU, which is generically referred to as CUDA Unified Memory.
When unified memory is used in an application, the compiler makes the decisions about when and how to move data from the host to the accelerator and back. It retires pages of data from the accelerator memory to the host memory according to usage. However, it may be useful for the data to stay on the accelerator because it will be used in subsequent parts of the code that is running on the accelerator. Although OpenACC-compliant compilers improve the automatic data movement of unified memory with every new version, the compilers are unlikely to know as much about the code as the programmer. Therefore, OpenACC has directives for data movement.
Note that for the case of the “accelerator” being additional cores on the CPU, the memory is already unified (i.e., there is only one pool of memory). Therefore, you can just use unified memory and not worry about data movement.