Secret Sauce
HIP: CUDA Integration with ROCm
The Heterogeneous-Compute Interface for Portability (HIP) is the secret sauce inside AMD’s new ROCm GPU-accelerated programming environment. AMD makes a grand re-entry into the HPC space with ROCm, a platform designed to let the programmer work in a familiar language and write GPU-accelerated code for multiple platforms.
What is HIP and why is it so important to the success of ROCm? I talked to AMD Senior Architect and Software Fellow Ben Sander about HIP and how it emerged from the ROCm development process. “When we started putting together the ROCm environment, one of the things we heard from customers was ‘what’s your solution for CUDA?’ All this CUDA code is out there. How will we plug into all that previous work?”
CUDA is a GPU-accelerated language based on C++ that is maintained by Nvidia and only works for Nvidia GPUs. AMD wanted a solution for GPU-accelerated programming on their own GPUs, but they didn’t like the idea of building a solution within an isolated, proprietary environment. The AMD developers envisioned ROCm as an open platform for GPU-based programming that would support AMD GPUs but would also allow other vendors to support their own hardware through the ROCm code base.
According to Sander, the developers knew that ROCm would need to fit in with what HPC programmers are already doing. “We found there are a lot of people in the community who know how to program in CUDA. We knew that if we found a way for source-portable CUDA code to run on our platform, it would dramatically expand our software ecosystem.”
The solution was HIP – a C++ run-time API and kernel language that is structured in a way that supports easy, automated porting of CUDA code to a vendor-neutral format. This neutral HIP format provides a path for adapting CUDA code to fit in with AMD’s ROCm development stack. A ROCm “hipify” tool converts CUDA code to HIP format, thus making it compatible with ROCm. Sander says “Typically 99% of the code is either the same or converts automatically with the tool. There are a few differences, so a developer has to go in and tweak the remaining 1%.” The ROCm developers recently converted the CUDA-based Caffe machine learning library to HIP and found that 99.6% of the 55,000 code lines converted automatically, with the remaining code requiring less than one week of developer time.
The close compatibility between CUDA and HIP offers another interesting benefit: The vendor-neutral HIP format also lets you port code written for the ROCm environment to the Nvidia/CUDA stack. The result is an open environment in which a developer can write code once and use it with either Nvidia or AMD GPUs. Performance with HIP code on NVIDIA GPUs is the same as using native CUDA on the same platform.
To create a solution that ports up to 99% of the CUDA code automatically, the HIP developers knew they would need to create a fairly complete collection of CUDA-equivalent features. “HIP offers strong support for the most commonly used parts of the CUDA API,” Sander explains, “… streams, events, memory allocation and deallocation, profiling, and driver API support. On CUDA, developers can use native CUDA tools (nvcc , nvprof , etc.). On ROCm, developers can use native ROCm tools (hcc, CodeXL, etc.).”
The ROCm platform isn’t just for C++ derivatives like CUDA and HIP. The platform currently supports OpenCL and Python, and you can even embed assembly language into your GPU-accelerated programs. ROCm also supports ISO C++ standards, including C++11, C++14, and some C++17 features.
This multilanguage support offers a point of entry for programmers from many diverse platforms and environments, providing access to the benefits of GPU acceleration for coders from a wide range of backgrounds. For this article, Sander adds that the benefits for Python programmers are particularly significant. “A lot of Python code is used for quick prototyping. The view of the Python guy is, ‘If I can very easily get some speedup from GPU acceleration without having to change my code, I’m happy.’ This on-the-fly quality of Python code often means no one is tuning in to optimization, and the results of GPU acceleration can be quite pronounced. Our tests have shown it is not uncommon to see up to a 500x speedup with Python.”
Who’s using HIP? According to Sander, one important area of emphasis is artificial intelligence and machine learning, which is receiving a big share of attention from the high-performance computing community. “Machine learning frameworks are extremely well suited to the GPU. There is a huge amount of interest in running TensorFlow (machine learning library). We’ve learned a lot about the porting process working with TensorFlow and Caffe.”
The versatile and modular ROCm platform has the potential to become a central feature of the HPC ecosystem regardless of the programmer’s language choice, but Sander points out that both HIP and CUDA, which were created because the standard C++ environment does offer a means for coding to the GPU without extensions, might ultimately prove to be transitional technologies. “A lot of work is going on inside the ISO C++ group for supporting GPUs. My impression before I started working on this is that the ISO C++ group wasn’t very interested in GPUs, but that has changed. The US Department of Energy is a driving force for supporting GPUs in C++. It could be until 2020-2023 before we have everything we need for GPU support in ISO C++, but the transition is definitely coming.” At that point, ROCm, with its language-independent design, will still be around, but dialects like CUDA and HIP won’t be necessary.”
Until then, AMD is betting that the ROCm platform and its HIP secret sauce will usher in a new era of cross-platform programming in the HPC space.