Document Sample

Gaussian Blur: The OpenCL Implementation 2012/05/02 Fixstars Corp. Revision 1.1 Contents 1 Introduction 1 1.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Testing Machine Speciﬁcations . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Gaussian Blur Algorithm 2 2.1 4-Nested Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Convolution Separable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 About This Project 4 3.1 Workspace Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 The OpenCL Implementation 7 4.1 Scalar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Scalar Fast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3 SIMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.4 SIMD Fast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5 Performance Measurements 9 5.1 Native Code Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.2 OpenCL Code Performance . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.3 Workgroup Size Considerations . . . . . . . . . . . . . . . . . . . . . . . 11 5.4 Beneﬁts of Using Vector Data Types . . . . . . . . . . . . . . . . . . . . 11 Gaussian Blur Chapter 1 Introduction This paper demonstrates an implementation of Gaussian Blur for mon ochrome (BW) or colored (RGB) images using the OpenCL framework. DISCLAIMER: Images used in this paper are made available by the GNU LGPL license [7]. 1.1 Prerequisites • OpenCL SDK – Intel: http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/ – NVIDIA: http://developer.nvidia.com/opencl • CMake: http://www.cmake.org/ 1.2 Testing Machine Speciﬁcations • Intel R R CoreTM i7 CPU X 990 @ 3.47GHz (Nehalem) • GeForce GTX 570 (Fermi) • Ubuntu 11.04 Natty 2.6.38-8-generic x86 64 • CMake 2.8.3 • g++ 4.4.5 1 Chapter 2 Gaussian Blur Algorithm Gaussian Blur [1] is widely used for image processing, especially in graphics manipulating software, to reduce image noise and/or detail. The resulting blurred image is retrieved by applying the two dimensional Gaussian Filter [2] with a predeﬁned standard deviation (σ). Fig. 2.1 shows the example of applying a 5 × 5 ﬁlter (shown in Fig. 2.2) to the given image. Figure 2.1: Gaussian Blur Example Figure 2.2: Gaussian Filter Gaussian Filter can be produced using the following formula, where G(x, y) is the pixel rate for a pixel in position x, y, and σ is the standard deviation: 1 − x2 +y2 G(x, y) = e 2σ2 2πσ 2 There are two ways of implementing Gaussian Blur to an image, which depends on the type of kernel to use. If we must use a non-separable kernel, then a 4-nested loops implementation is inevitable. If the kernel is a separable kernel, then we can use the convolution separable [3] method which is way faster than the former one. 2 Gaussian Blur We consider the following symbols as: GB = Resulting pixel after applying the Gaussian Filter G = Gaussian ﬁlter I = Pixel intensity, 0 ≤ I ≤ 255 h = Image height w = Image width s2 = Filter size, s is size of one kernel side y = Distance from the origin in the horizontal axis x = Distance from the origin in the vertical axis 2.1 4-Nested Loops • Computational complexity: O(wf ilter hf ilter wimage himage ) • Equation: s s GB[I(y,x) ] = I(y+i− 2 ,x+j− 2 ) G(i,j) s s i=0 j=0 where, s < y < h, y ∈ Z 2 s < x < w, x ∈ Z 2 2.2 Convolution Separable • Computational complexity: O(wf ilter wimage himage ) + O(hf ilter wimage himage ) • Equation: s GBrow [I(y,x) ] = I(y− 2 ,x+j− 2 ) GT s s (j) j=0 where, s < x < w, x ∈ Z 2 and, s GBcolumn [I(y,x) ] = I(y+i− 2 ,x− 2 ) G(i) s s i=0 where, s < y < h, y ∈ Z 2 3 Chapter 3 About This Project We demonstrate two kinds of implementations in this paper. The ﬁrst algorithm imple- ments a scalar version, and the second algorithm implements a Single Instruction Multiple Data (SIMD) version of kernel. The SIMD implementation uses vector data types us- ing the built-in OpenCL vectorizer, which then preserves the data elements inside AVX architecture’s vector registers [4]. Intel SSE architecture is equipped with vector registers that can store a group of data elements; for instance, ﬂoat8 or int8. Some microarchitectures support vector registers with width up to 256 bits. With the use of these vector registers, a considerable speedup can be achieved because they calculate packed up vector data elements in a simultaneous execution [5]. NVIDIA’s GPUs are optimized in a diﬀerent way to fetch these kinds of data types. The most recent NVIDIA’s GPGPU architecture, Fermi, is equipped with very fast 16 load/store [6] units that can commit loads and stores 64-bits to 128-bits in a single transaction on each multiprocessor, which reduces overall memory latency and icreases eﬀective throughput. The hardware can process 256 and 512 byte transcation sizes per warp; hence, a suitably aligned ﬂoat4 load/store request for a warp can be serviced in a single transaction, and a ﬂoat8 load/store request in two transactions. In short, Intel’s CPUs can optimize the Gaussian Blur algorithm with vectorization module of the vector data registers, and NVIDIA’s GPUs can optimize the algorithm by conducting loads and stores with the usage of 16 load/store units. 4 Gaussian Blur 3.1 Workspace Structure All utilities and image ﬁles are placed in ../../../common/ folder .: cl original ./cl: doc LICENSE GaussianFilter.cl GaussianFilter_gold.cpp TestRun README GaussianFilter.cpp CMakeLists.txt ./cl/doc: gaussian.pdf ./cl/TestRun: clean test_run.py ./original: doc LICENSE GaussianFilter.cpp CMakeLists.txt TestRun README GaussianFilter_gold.cpp ./original/doc: gaussian ./original/TestRun: clean 5 3.2 Usage $ ./GaussianFilter -h ./GaussianFilter [--verbose|-v] [--help|-h] [--output|-o FILENAME] [--kernel|-k NUMBER] [--workitems|-w NUMBER] [--use-gpu|-g] [--choose-dev] [--dev-info] [--prep-time] [--comp-result] FILENAME [FILENAME2 ...] * Options * --verbose Be verbose --help Print this message --output=NAME Write results to this file --kernel=KERNEL Kernel mode (0, 1, 2, 3, [4, 5]) -- default = 0 [0] Scalar [1] SIMD = Single Instruction Multiple Data [2] Scalar Fast (Using convolution separable matrix) [3] SIMD Fast (Using convolution separable matrix) --debugging purpose-- [4] STSD = Single Thread Single Data [5] STAD = Single Thread All Data --workitems=NUMBER Number of (local) workitems for Scalar mode --use-gpu Use GPU as the CL device --choose-dev Choose which OpenCL device to use --dev-info Show Device Info --prep-time Show initialization, memory preparation and copyback time --comp-result Compare native and OpenCL results If the output filename is not specified or the input images are more than 1, the output filename by default will be set to [filname]_out.[pgm|ppm] * Examples * ./GaussianFilter [OPTS...] -v -w 256 test_data.pgm test_data2.pgm ./GaussianFilter [OPTS...] --output=test_output.ppm test_data.ppm 6 Gaussian Blur Chapter 4 The OpenCL Implementation The implementation of Gaussian Blur with OpenCL is divided into four kinds of algo- rithms: Scalar, SIMD, Scalar Fast, and SIMD Fast. The Scalar mode runs the Gaussian Blur with two-dimensional circular kernel in parallel by calculating each pixel in one workitem (hence the number of workitems equals to wimage himage ). The SIMD mode is executed by calculating one image row per workitem. This kind of implementation will only bring disadvantages for a GPU since executing less instructions with more workitems (or threads) is better. However, it does not apply for CPUs. Often it is far more subtle to run more instructions with less threads because of the fast CPU clock rate. 4.1 Scalar Gaussian Scalar OpenCL kernel, gaussian scalar() in gaussian.cl, performs the equa- tion GB[I(y,x) ] for each pixel, in the coordinate of y = tid/width and x = tid%width, where tid = get global id(0). 4.2 Scalar Fast gaussian scalar fast() kernel performs one GBrow [I(y,x) ], which is then followed by gaussian scalar fast column() kernel that calculates GBcolumn [Iy,x ] for each pixel. Note that in order to do the column step, we have to synchronize the elements after row calculations, which reduces the overall performance. 4.3 SIMD gaussian simd() kernel also executes GB[I(y,x) ] for each pixel, but this time the vec- tor data type registers play the role. One workitem, consisting of several pixels to be 7 calculated, packs the data elements to a register on the AVX technology of SSE architec- ture, which may range from 128 to 256 bits of size. These elements are then calculated concurrently hence greatly reduces the calculation time. 4.4 SIMD Fast gaussian simd fast() kernel, like the Scalar Fast kernel, performs GBrow [I(y,x) ] and gaussian simd fast column() for GBcolumn [Iy,x ] in the same order, using the vectorizer technique. However, aside of the bottleneck of synchronizing results after the row-wise calculations, SIMD Fast kernel also has another obstacle, which is the impossibility of conducting the column-wise calculation without having to trade-oﬀ something, such as memory. 8 Gaussian Blur Chapter 5 Performance Measurements The performance of these four algorithms is measured by running them with monochrome image (../common/baboon.pgm) as an input, with the size of 512x512. The 2-Dimensional Gaussian Filter is a 5 × 5 ﬁlter with σ = 2, and is circularly symmetric and convolution separable. The following is table symbols’ explanation: • KERNEL: The type of kernel (Scalar, Scalar Fast, SIMD, or SIMD Fast) • DEVICE: The device used to run the code (CPU or GPU) • WORKITEMS: Number of workitems inside a workgroup (local workitem) • T KERNEL[ms]: Time needed to run kernel, in milliseconds • SPEEDUP: Speedup proportion compared to the native code → Scalar and SIMD are compared to c99 scalar, and Scalar Fast and SIMD Fast are compared to c99 fast 9 5.1 Native Code Performance TABLE 1 Speed Comparison of Sequential (Native) Code Performance KERNEL T KERNEL[ms] c99 scalar 27.303 c99 fast 13.430 5.2 OpenCL Code Performance TABLE 2 Speed Comparison of OpenCL Code Performance KERNEL DEVICE T KERNEL[ms] SPEEDUP gaussian scalar CPU 9.333 2.93 gaussian scalar fast CPU 6.449 2.08 gaussian simd CPU 1.497 18.24 gaussian simd fast CPU 2.413 5.57 gaussian scalar GPU 0.828 32.97 gaussian scalar fast GPU 0.706 19.02 gaussian simd GPU 2.313 11.80 gaussian simd fast GPU 2.146 6.26 10 Gaussian Blur 5.3 Workgroup Size Considerations OpenCL enables the size of a workgroup within the range of 1 to 1024 workitems; however, looking at Table 3 below, the best performance is achieved with 256 workitems per workgroup, and this applies for both CPU and GPU. Meanwhile, giving only 1 workitem per workgroup for GPU worsens the performance by a ratio of nearly 30 times. TABLE 3 OpenCL Workgroup Size Comparison Table KERNEL DEVICE WORKITEMS T KERNEL[ms] SPEEDUP gaussian scalar GPU 512 0.816 33.45955882 gaussian scalar GPU 128 0.824 33.13470874 gaussian scalar GPU 256 0.828 32.97463768 gaussian scalar CPU 512 6.908 3.952374059 gaussian scalar CPU 128 9.117 2.99473511 gaussian scalar CPU 256 9.333 2.925425908 gaussian scalar CPU 1 10.049 2.716986765 gaussian scalar GPU 1 19.019 1.435564436 5.4 Beneﬁts of Using Vector Data Types Utilizing vector data types, such as ﬂoat8, explicitly, gives us optimizations through removing unnecessary branches, saving memory bandwidth and cache usage, and also compressing instruction stacks. By looking at Table 4 below, this technique gives us a 6x speedup from the scalar version. TABLE 4 Using Vector Data Type Registers in SSE Architecture KERNEL DEVICE T KERNEL[ms] SPEEDUP gaussian simd CPU 1.497 18.24 gaussian scalar CPU 9.333 2.93 11 References [1] “Gaussian Blur,” from Wikipedia, the free Encyclopedia. 2012. http://en.wikipedia.org/wiki/Gaussian blur [2] “Gaussian Filter,” from Wikipedia, the free Encyclopedia. 2012. http://en.wikipedia.org/wiki/Gaussian ﬁlter [3] V. Podlozhnyuk, “Image Convolution with CUDA,” NVIDIA Corporation, June, 2007. [4] C. Lomont, “Introduction to IntelTM Advanced Vector Extensions,” Intel Corporation, May, 2011. [5] Intel Corporation, “Writing Optimal OpenCLTM Code with Intel R R OpenCL SDK,” Document Number: 325696-001US, Revision 1.3, Intel Corporation, 2011. [6] P. Glaskowsky, “NVIDIA’s Fermi: The First Complete GPU Computing Architec- ture” NVIDIA Corporation, September, 2009. [7] “GNU LGPL Version 3,” http://www.gnu.org/licenses/lgpl.html 12

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 66 |

posted: | 8/17/2012 |

language: | French |

pages: | 14 |

OTHER DOCS BY cuiliqing

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.