OpenCL
The Framework for GPU Applications
When you create a regular C++ program for execution on the CPU, the data is stored in the RAM and the CPU registers. If you want to use the GPU to process data, you first must feed the data to the GPU. To do this, you must copy the data you want to process (across the bus) from RAM to video memory. The results are retrieved in the same way but in the opposite order.
The video card runs a thread on each core. The threads all run the same kernel function (not to be confused with the convolution kernel described above), but with a different index for each thread. All threads that are active at the same time perform precisely the same instruction at a given point in time. The kernel is a piece of C source code that tells each thread what it has to do (depending on its index).
Depending on the scale of the problem, you will need to define the number of threads (in this case, the number of pixels in the output image) and the data (in this case, the input image, output image, convolution kernel) and then start the kernel.
Firing Up the GPU
Listing 3 shows OpenCL Code for the convolution example.
Listing 3: OpenCL Code
001 #define __CL_ENABLE_EXCEPTIONS
002 #include "convolve.hpp"
003 #include "timer.hpp"
004
005 #include "CL/cl.hpp"
006 #include <limits> // uchar max, min
007 #include <assert.h>
008 #include <iostream>
009
010 /**
011 * The OpenCL kernel for image convolution.
012 */
013 const char* kernelSource = "
014 __kernel void convolveKernel(\
015 global uchar *in,\
016 uint inWidth,\
017 uint inHeight,\
018 global uint *out,\
019 uint outWidth,\
020 uint outHeight,\
021 global float *convKernel,\
022 uint convKernelWidth,\
023 uint convKernelHeight)\
024 {\
025 size_t x = get_global_id(0);\
026 size_t y = get_global_id(1);\
027 \
028 /* Kill unneeded threads */\
029 if(x >= outWidth || y >= outHeight)\
030 {\
031 return;\
032 }\
033 \
034 float convolutionSum = 0.0f;\
035 for(size_t ky = 0; ky < convKernelHeight; ++ky)\
036 {\
037 for(size_t kx = 0; kx < convKernelWidth; ++kx)\
038 {\
039 convolutionSum += (float) in[(y + ky) * inWidth + (x + kx)] * convKernel[ky * convKernelWidth + kx];\
040 }\
041 }\
042 out[y * outWidth + x] = (uint) clamp(convolutionSum, 0, 255);\
043 }";
044
045 /**
046 * Convolve a grayscale image with a convolution kernel on the GPU using OpenCL.
047 */
048 grayImage convolveGPU(grayImage in, convolutionKernel convKernel)
049 {
050 grayImage out;
051 out.width = in.width ‑ (convKernel.width ‑ 1);
052 out.height = in.height ‑ (convKernel.height ‑ 1);
053 out.data = new uchar[out.height * out.width];
054
055 // Platforms
056 std::vector< cl::Platform > platforms;
057 cl::Platform::get(&platforms);
058 assert(platforms.size() > 0);
059
060 // Devices
061 std::vector<cl::Device> devices;
062 platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
063 assert(devices.size() > 0);
064 assert(devices[0].getInfo<CL_DEVICE_TYPE>() == CL_DEVICE_TYPE_GPU);
065
066 // Context
067 cl::Context context(devices);
068
069 // Create GPU buffers
070 cl::Buffer inGPU(context, CL_MEM_READ_ONLY, in.width * in.height * sizeof(uchar));
071 cl::Buffer convKernelGPU(context, CL_MEM_READ_ONLY, convKernel.width * convKernel.height * sizeof(float));
072 cl::Buffer outGPU(context, CL_MEM_WRITE_ONLY, out.width * out.height * sizeof(uint));
073
074 // Commandqueue
075 cl::CommandQueue queue(context, devices[0], 0);
076
077 // Upload in.data to inGPU
078 queue.enqueueWriteBuffer(inGPU,false, // FIFO0,in.width * in.height * sizeof(uchar),in.data);
079
080 // Upload kernel.data to convKernelGPU
081 queue.enqueueWriteBuffer(convKernelGPU,true, // Blocking for correct timing0,convKernel.width * convKernel.height * sizeof(float),convKernel.data);
082
083 // Program
084 cl::Program::Sources source(1, std::make_pair(kernelSource, strlen(kernelSource)));
085
086 cl::Program program(context, source);
087 program.build(devices);
088
089 // Ranges
090 size_t localWidth = 16;
091 size_t localHeight = 16;
092
093 cl::NDRange localRange(localWidth, localHeight);
094 cl::NDRange globalRange(((out.width‑1)/localWidth+1) * localWidth, ((out.height‑1)/localHeight+1) * localHeight);
095
096 // Run kernel
097 cl::Kernel kernel(program, "convolveKernel");
098 cl::KernelFunctor func = kernel.bind(queue, globalRange, localRange);
099
100 cl::Event event = func(inGPU, in.width, in.height, outGPU, out.width, out.height, convKernelGPU, convKernel.width, convKernel.height);
101 event.wait();
102
103 // Download result
104 uint *outTemp = new uint[out.width * out.height];
105 queue.enqueueReadBuffer(outGPU,true,0,out.width * out.height * sizeof(uint),outTemp);
106
107 // Convert uint array to uchar array
108 for(size_t i = 0; i < out.width * out.height; ++i)
109 {
110 out.data[i] = (uchar) outTemp[i];
111 }
112
113 delete outTemp;
114 return out;
115 }The OpenCL C++ bindings are included with #include <CL/cl.hpp> . To use the exceptions in the bindings rather than normal C error codes, I need to #define __CL_ENABLE_EXCEPTIONS . All of the classes are located in the cl:: namespace. To tell the linker what to link against, I add an ‑lOpenCL argument to the g++ parameter list.
In contrast to CUDA, OpenCL doesn’t create platform-dependent code until run time. This task of creating platform-dependent code means that OpenCL first has to discover the hardware that will be running the parallel code. To allow this to happen, I create a cl::Platform and a vector of cl::Device (Listing 3, lines 56, 61).
Note the two different cl::Platform types: “full profile” and “embedded profile.” In this article, I will be looking at programming with the full profile type.
Within each cl::Platform , multiple cl::Device s can exist. A cl::Device stands for the GPU or CPU. assert() in line 64 makes sure at least one device supports OpenCL and that the first device is a supported video card. To execute your program on the CPU, you have to take the CL_DEVICE_TYPE_CPU type device from the vector.
A cl::Context manages objects such as command queues, memory objects, kernels, and execution objects across multiple cl::Devices . I will look at these objects in more detail in the following sections. In this case, the cl::Context only manages the video card.
I still need to define a cl::CommandQueue . This is where action objects are stored for first-in, first-out (FIFO) execution in a normal case.