OpenCL
The Framework for GPU Applications
When you create a regular C++ program for execution on the CPU, the data is stored in the RAM and the CPU registers. If you want to use the GPU to process data, you first must feed the data to the GPU. To do this, you must copy the data you want to process (across the bus) from RAM to video memory. The results are retrieved in the same way but in the opposite order.
The video card runs a thread on each core. The threads all run the same kernel function (not to be confused with the convolution kernel described above), but with a different index for each thread. All threads that are active at the same time perform precisely the same instruction at a given point in time. The kernel is a piece of C source code that tells each thread what it has to do (depending on its index).
Depending on the scale of the problem, you will need to define the number of threads (in this case, the number of pixels in the output image) and the data (in this case, the input image, output image, convolution kernel) and then start the kernel.
Firing Up the GPU
Listing 3 shows OpenCL Code for the convolution example.
Listing 3: OpenCL Code
001 #define __CL_ENABLE_EXCEPTIONS 002 #include "convolve.hpp" 003 #include "timer.hpp" 004 005 #include "CL/cl.hpp" 006 #include <limits> // uchar max, min 007 #include <assert.h> 008 #include <iostream> 009 010 /** 011 * The OpenCL kernel for image convolution. 012 */ 013 const char* kernelSource = " 014 __kernel void convolveKernel(\ 015 global uchar *in,\ 016 uint inWidth,\ 017 uint inHeight,\ 018 global uint *out,\ 019 uint outWidth,\ 020 uint outHeight,\ 021 global float *convKernel,\ 022 uint convKernelWidth,\ 023 uint convKernelHeight)\ 024 {\ 025 size_t x = get_global_id(0);\ 026 size_t y = get_global_id(1);\ 027 \ 028 /* Kill unneeded threads */\ 029 if(x >= outWidth || y >= outHeight)\ 030 {\ 031 return;\ 032 }\ 033 \ 034 float convolutionSum = 0.0f;\ 035 for(size_t ky = 0; ky < convKernelHeight; ++ky)\ 036 {\ 037 for(size_t kx = 0; kx < convKernelWidth; ++kx)\ 038 {\ 039 convolutionSum += (float) in[(y + ky) * inWidth + (x + kx)] * convKernel[ky * convKernelWidth + kx];\ 040 }\ 041 }\ 042 out[y * outWidth + x] = (uint) clamp(convolutionSum, 0, 255);\ 043 }"; 044 045 /** 046 * Convolve a grayscale image with a convolution kernel on the GPU using OpenCL. 047 */ 048 grayImage convolveGPU(grayImage in, convolutionKernel convKernel) 049 { 050 grayImage out; 051 out.width = in.width ‑ (convKernel.width ‑ 1); 052 out.height = in.height ‑ (convKernel.height ‑ 1); 053 out.data = new uchar[out.height * out.width]; 054 055 // Platforms 056 std::vector< cl::Platform > platforms; 057 cl::Platform::get(&platforms); 058 assert(platforms.size() > 0); 059 060 // Devices 061 std::vector<cl::Device> devices; 062 platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices); 063 assert(devices.size() > 0); 064 assert(devices[0].getInfo<CL_DEVICE_TYPE>() == CL_DEVICE_TYPE_GPU); 065 066 // Context 067 cl::Context context(devices); 068 069 // Create GPU buffers 070 cl::Buffer inGPU(context, CL_MEM_READ_ONLY, in.width * in.height * sizeof(uchar)); 071 cl::Buffer convKernelGPU(context, CL_MEM_READ_ONLY, convKernel.width * convKernel.height * sizeof(float)); 072 cl::Buffer outGPU(context, CL_MEM_WRITE_ONLY, out.width * out.height * sizeof(uint)); 073 074 // Commandqueue 075 cl::CommandQueue queue(context, devices[0], 0); 076 077 // Upload in.data to inGPU 078 queue.enqueueWriteBuffer(inGPU,false, // FIFO0,in.width * in.height * sizeof(uchar),in.data); 079 080 // Upload kernel.data to convKernelGPU 081 queue.enqueueWriteBuffer(convKernelGPU,true, // Blocking for correct timing0,convKernel.width * convKernel.height * sizeof(float),convKernel.data); 082 083 // Program 084 cl::Program::Sources source(1, std::make_pair(kernelSource, strlen(kernelSource))); 085 086 cl::Program program(context, source); 087 program.build(devices); 088 089 // Ranges 090 size_t localWidth = 16; 091 size_t localHeight = 16; 092 093 cl::NDRange localRange(localWidth, localHeight); 094 cl::NDRange globalRange(((out.width‑1)/localWidth+1) * localWidth, ((out.height‑1)/localHeight+1) * localHeight); 095 096 // Run kernel 097 cl::Kernel kernel(program, "convolveKernel"); 098 cl::KernelFunctor func = kernel.bind(queue, globalRange, localRange); 099 100 cl::Event event = func(inGPU, in.width, in.height, outGPU, out.width, out.height, convKernelGPU, convKernel.width, convKernel.height); 101 event.wait(); 102 103 // Download result 104 uint *outTemp = new uint[out.width * out.height]; 105 queue.enqueueReadBuffer(outGPU,true,0,out.width * out.height * sizeof(uint),outTemp); 106 107 // Convert uint array to uchar array 108 for(size_t i = 0; i < out.width * out.height; ++i) 109 { 110 out.data[i] = (uchar) outTemp[i]; 111 } 112 113 delete outTemp; 114 return out; 115 }
The OpenCL C++ bindings are included with #include <CL/cl.hpp>. To use the exceptions in the bindings rather than normal C error codes, I need to #define __CL_ENABLE_EXCEPTIONS. All of the classes are located in the cl:: namespace. To tell the linker what to link against, I add an ‑lOpenCL argument to the g++ parameter list.
In contrast to CUDA, OpenCL doesn’t create platform-dependent code until run time. This task of creating platform-dependent code means that OpenCL first has to discover the hardware that will be running the parallel code. To allow this to happen, I create a cl::Platform and a vector of cl::Device (Listing 3, lines 56, 61).
Note the two different cl::Platform types: “full profile” and “embedded profile.” In this article, I will be looking at programming with the full profile type.
Within each cl::Platform, multiple cl::Devices can exist. A cl::Device stands for the GPU or CPU. assert() in line 64 makes sure at least one device supports OpenCL and that the first device is a supported video card. To execute your program on the CPU, you have to take the CL_DEVICE_TYPE_CPU type device from the vector.
A cl::Context manages objects such as command queues, memory objects, kernels, and execution objects across multiple cl::Devices. I will look at these objects in more detail in the following sections. In this case, the cl::Context only manages the video card.
I still need to define a cl::CommandQueue. This is where action objects are stored for first-in, first-out (FIFO) execution in a normal case.