OpenCL
Summary and Conclusions
The GPU doesn’t guarantee a shorter execution time. On the one hand is the overhead for just-in-time compilation of the OpenCL kernel, and on the other, the data first must be copied to GPU RAM, which is computationally expensive. For special cases (large convolution kernels) and large volumes of data, you can still save time without even considering optimization strategies.
Far larger speed boosts are possible if you optimize the kernel functions. The threads in a work group share local memory, which is three orders of magnitude faster than the global GPU RAM. In the native kernel, the convolution kernel matrix elements are retrieved from global memory on access. If, instead, the elements were loaded once per work group into local memory, it would be possible to leverage the video card’s potential more efficiently.
Additionally, the image convolution has some potential for optimization if you restrict the problem to separable kernels. However, I purposely did without improvements of this kind to keep the problem simple and provide an easier entry into OpenCL. At the same time, you can view this article as a guide that will help you solve problems by running portions of your programs on the video card.
For more in-depth information, I recommend the NVidia OpenCL Programming Guide [13], which investigates the video card’s hardware architecture, as well as the sample code in the ATI and NVidia SDKs. OpenCL developers will not want to be without the OpenCL specification [16] and the documentation for the C++ bindings [17].
Info
[1] Wikipedia SIMD:
[http://en.wikipedia.org/wiki/SIMD]
[2] NVidia CUDA overview:
[http://www.nvidia.com/object/what_is_cuda_new.html]
[3] Official OpenCL website:
[http://www.khronos.org/registry/cl/]
[4] AMD/ATI system requirements, driver compatibility:
[http://developer.amd.com/gpu/ATIStreamSDK/pages/DriverCompatibility.aspx]
[5] Supported NVidia GPUs:
[http://www.nvidia.com/object/cuda_gpus.html]
[6] ATI Stream SDK download:
[http://developer.amd.com/gpu/ATIStreamSDK/downloads/]
[7] ATI Stream SDK DEB package:
[http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=125792]
[8] CUDA toolkit download:
[http://developer.nvidia.com/object/cuda_3_2_downloads.html#Linux]
[9] NVidia pre-release drivers:
[http://developer.nvidia.com/object/opencl.html]
[10] OpenCL 1.1 C++ bindings header file:
[http://www.khronos.org/registry/cl/api/1.1/cl.hpp]
[11] Wikipedia on convolution:
[http://en.wikipedia.org/wiki/Convolution]
[11] Code for this article:
[http://www.linux-magazine.com/Resources/Article-Code] (choose issue 127)
[13] NVidia OpenCL programming guide:
[http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_ProgrammingGuide.pdf]
[14] OpenCL extension cl_khr_byte_addressable_store:
[http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_byte_addressable_store.html]
[15] libpng website:
[http://www.libpng.org/pub/png/libpng.html]
[16] OpenCL documentation:
[http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/]
[17] OpenCL C++ bindings documentation:
[http://www.khronos.org/registry/cl/specs/opencl‑cplusplus‑1.1.pdf]
The Author
Markus Roth is a student of Computer Science at the Karlsruhe Institute of Technology (KIT), Germany, where he is researching GPU-supported acceleration in computer vision at the Institute of Anthropomatics.