Docstoc

gaussian

Document Sample
gaussian Powered By Docstoc
					            Gaussian Blur:
The OpenCL Implementation




                  2012/05/02
               Fixstars Corp.




                 Revision 1.1
Contents
1 Introduction                                                                                                                                 1
  1.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                        1
  1.2 Testing Machine Specifications . . . . . . . . . . . . . . . . . . . . . . . .                                                            1

2 Gaussian Blur Algorithm                                                                                                                      2
  2.1 4-Nested Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                           3
  2.2 Convolution Separable . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                            3

3 About This Project                                                                                                                           4
  3.1 Workspace Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                            5
  3.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                          6

4 The    OpenCL Implementation                                                                                                                 7
  4.1    Scalar . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   7
  4.2    Scalar Fast . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   7
  4.3    SIMD . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   7
  4.4    SIMD Fast . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   8

5 Performance Measurements                                                                                                                      9
  5.1 Native Code Performance . . . . . .                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  5.2 OpenCL Code Performance . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  5.3 Workgroup Size Considerations . . .                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   11
  5.4 Benefits of Using Vector Data Types                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   11
Gaussian Blur




Chapter 1

Introduction

This paper demonstrates an implementation of Gaussian Blur for mon ochrome (BW) or
colored (RGB) images using the OpenCL framework.
DISCLAIMER: Images used in this paper are made available by the GNU LGPL license
[7].


1.1     Prerequisites
   • OpenCL SDK

       – Intel: http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/
       – NVIDIA: http://developer.nvidia.com/opencl

   • CMake: http://www.cmake.org/


1.2     Testing Machine Specifications
   • Intel R R CoreTM i7 CPU X 990 @ 3.47GHz (Nehalem)

   • GeForce GTX 570 (Fermi)

   • Ubuntu 11.04 Natty 2.6.38-8-generic x86 64

   • CMake 2.8.3

   • g++ 4.4.5




                                                                                      1
Chapter 2

Gaussian Blur Algorithm

Gaussian Blur [1] is widely used for image processing, especially in graphics manipulating
software, to reduce image noise and/or detail. The resulting blurred image is retrieved
by applying the two dimensional Gaussian Filter [2] with a predefined standard deviation
(σ). Fig. 2.1 shows the example of applying a 5 × 5 filter (shown in Fig. 2.2) to the given
image.




                          Figure 2.1: Gaussian Blur Example




                               Figure 2.2: Gaussian Filter

Gaussian Filter can be produced using the following formula, where G(x, y) is the pixel
rate for a pixel in position x, y, and σ is the standard deviation:
                                             1 − x2 +y2
                                 G(x, y) =       e 2σ2
                                           2πσ 2
   There are two ways of implementing Gaussian Blur to an image, which depends on
the type of kernel to use. If we must use a non-separable kernel, then a 4-nested loops
implementation is inevitable. If the kernel is a separable kernel, then we can use the
convolution separable [3] method which is way faster than the former one.


2
Gaussian Blur


We consider the following symbols as:
GB = Resulting pixel after applying the Gaussian Filter
G = Gaussian filter
I = Pixel intensity, 0 ≤ I ≤ 255
h = Image height
w = Image width
s2 = Filter size, s is size of one kernel side
y = Distance from the origin in the horizontal axis
x = Distance from the origin in the vertical axis


2.1     4-Nested Loops
   • Computational complexity: O(wf ilter hf ilter wimage himage )
   • Equation:
                                              s    s
                             GB[I(y,x) ] =                 I(y+i− 2 ,x+j− 2 ) G(i,j)
                                                                  s       s

                                             i=0 j=0

      where,
                                         s
                                           < y < h, y ∈ Z
                                         2
                                         s
                                           < x < w, x ∈ Z
                                         2

2.2     Convolution Separable
   • Computational complexity: O(wf ilter wimage himage ) + O(hf ilter wimage himage )
   • Equation:
                                                       s
                               GBrow [I(y,x) ] =               I(y− 2 ,x+j− 2 ) GT
                                                                    s       s
                                                                                 (j)
                                                   j=0

      where,
                                         s
                                           < x < w, x ∈ Z
                                         2
      and,
                                                           s
                              GBcolumn [I(y,x) ] =               I(y+i− 2 ,x− 2 ) G(i)
                                                                        s     s

                                                       i=0
      where,
                                         s
                                           < y < h, y ∈ Z
                                         2

                                                                                         3
Chapter 3

About This Project

We demonstrate two kinds of implementations in this paper. The first algorithm imple-
ments a scalar version, and the second algorithm implements a Single Instruction Multiple
Data (SIMD) version of kernel. The SIMD implementation uses vector data types us-
ing the built-in OpenCL vectorizer, which then preserves the data elements inside AVX
architecture’s vector registers [4].
    Intel SSE architecture is equipped with vector registers that can store a group of data
elements; for instance, float8 or int8. Some microarchitectures support vector registers
with width up to 256 bits. With the use of these vector registers, a considerable speedup
can be achieved because they calculate packed up vector data elements in a simultaneous
execution [5].
    NVIDIA’s GPUs are optimized in a different way to fetch these kinds of data types.
The most recent NVIDIA’s GPGPU architecture, Fermi, is equipped with very fast 16
load/store [6] units that can commit loads and stores 64-bits to 128-bits in a single
transaction on each multiprocessor, which reduces overall memory latency and icreases
effective throughput. The hardware can process 256 and 512 byte transcation sizes per
warp; hence, a suitably aligned float4 load/store request for a warp can be serviced in a
single transaction, and a float8 load/store request in two transactions.
    In short, Intel’s CPUs can optimize the Gaussian Blur algorithm with vectorization
module of the vector data registers, and NVIDIA’s GPUs can optimize the algorithm by
conducting loads and stores with the usage of 16 load/store units.




4
Gaussian Blur


3.1       Workspace Structure
All utilities and image files are placed in ../../../common/ folder

.:
cl   original

./cl:
doc       LICENSE   GaussianFilter.cl    GaussianFilter_gold.cpp
TestRun   README    GaussianFilter.cpp   CMakeLists.txt

./cl/doc:
gaussian.pdf

./cl/TestRun:
clean test_run.py

./original:
doc      LICENSE    GaussianFilter.cpp        CMakeLists.txt
TestRun README      GaussianFilter_gold.cpp

./original/doc:
gaussian

./original/TestRun:
clean




                                                                     5
3.2     Usage
$ ./GaussianFilter -h
./GaussianFilter [--verbose|-v] [--help|-h] [--output|-o FILENAME]
     [--kernel|-k NUMBER] [--workitems|-w NUMBER]
     [--use-gpu|-g] [--choose-dev] [--dev-info] [--prep-time] [--comp-result]
     FILENAME [FILENAME2 ...]

* Options *
 --verbose                  Be verbose
 --help                     Print this message
 --output=NAME              Write results to this file
 --kernel=KERNEL            Kernel mode (0, 1, 2, 3, [4, 5]) -- default = 0
                                       [0] Scalar
                                       [1] SIMD = Single Instruction Multiple Data
                                       [2] Scalar Fast (Using convolution separable matrix)
                                       [3] SIMD Fast (Using convolution separable matrix)
                                       --debugging purpose--
                                       [4] STSD = Single Thread Single Data
                                       [5] STAD = Single Thread All Data
--workitems=NUMBER          Number of (local) workitems for Scalar mode
--use-gpu                   Use GPU as the CL device
--choose-dev                Choose which OpenCL device to use
--dev-info                  Show Device Info
--prep-time                 Show initialization, memory preparation and copyback time
--comp-result               Compare native and OpenCL results

If the output filename is not specified or the input images are more than 1,
the output filename by default will be set to [filname]_out.[pgm|ppm]

 * Examples *
./GaussianFilter [OPTS...] -v -w 256 test_data.pgm test_data2.pgm
./GaussianFilter [OPTS...] --output=test_output.ppm test_data.ppm




6
Gaussian Blur




Chapter 4

The OpenCL Implementation

The implementation of Gaussian Blur with OpenCL is divided into four kinds of algo-
rithms: Scalar, SIMD, Scalar Fast, and SIMD Fast. The Scalar mode runs the Gaussian
Blur with two-dimensional circular kernel in parallel by calculating each pixel in one
workitem (hence the number of workitems equals to wimage himage ). The SIMD mode is
executed by calculating one image row per workitem. This kind of implementation will
only bring disadvantages for a GPU since executing less instructions with more workitems
(or threads) is better. However, it does not apply for CPUs. Often it is far more subtle
to run more instructions with less threads because of the fast CPU clock rate.

4.1     Scalar
Gaussian Scalar OpenCL kernel, gaussian scalar() in gaussian.cl, performs the equa-
tion GB[I(y,x) ] for each pixel, in the coordinate of y = tid/width and x = tid%width,
where tid = get global id(0).


4.2     Scalar Fast
gaussian scalar fast() kernel performs one GBrow [I(y,x) ], which is then followed by
gaussian scalar fast column() kernel that calculates GBcolumn [Iy,x ] for each pixel.
Note that in order to do the column step, we have to synchronize the elements after
row calculations, which reduces the overall performance.


4.3     SIMD
gaussian simd() kernel also executes GB[I(y,x) ] for each pixel, but this time the vec-
tor data type registers play the role. One workitem, consisting of several pixels to be

                                                                                      7
calculated, packs the data elements to a register on the AVX technology of SSE architec-
ture, which may range from 128 to 256 bits of size. These elements are then calculated
concurrently hence greatly reduces the calculation time.


4.4     SIMD Fast
gaussian simd fast() kernel, like the Scalar Fast kernel, performs GBrow [I(y,x) ] and
gaussian simd fast column() for GBcolumn [Iy,x ] in the same order, using the vectorizer
technique. However, aside of the bottleneck of synchronizing results after the row-wise
calculations, SIMD Fast kernel also has another obstacle, which is the impossibility of
conducting the column-wise calculation without having to trade-off something, such as
memory.




8
Gaussian Blur




Chapter 5

Performance Measurements

The performance of these four algorithms is measured by running them with monochrome
image (../common/baboon.pgm) as an input, with the size of 512x512. The 2-Dimensional
Gaussian Filter is a 5 × 5 filter with σ = 2, and is circularly symmetric and convolution
separable.

The following is table symbols’ explanation:

   • KERNEL: The type of kernel (Scalar, Scalar Fast, SIMD, or SIMD Fast)

   • DEVICE: The device used to run the code (CPU or GPU)

   • WORKITEMS: Number of workitems inside a workgroup (local workitem)

   • T KERNEL[ms]: Time needed to run kernel, in milliseconds

   • SPEEDUP: Speedup proportion compared to the native code

       → Scalar and SIMD are compared to c99 scalar, and Scalar Fast and SIMD Fast
         are compared to c99 fast




                                                                                      9
5.1   Native Code Performance
                                TABLE 1
        Speed Comparison of Sequential (Native) Code Performance

                              KERNEL       T KERNEL[ms]
                              c99 scalar      27.303
                               c99 fast       13.430




5.2   OpenCL Code Performance

                               TABLE 2
             Speed Comparison of OpenCL Code Performance

                KERNEL            DEVICE     T KERNEL[ms]   SPEEDUP
             gaussian scalar        CPU           9.333       2.93
           gaussian scalar fast     CPU           6.449       2.08
              gaussian simd         CPU           1.497      18.24
            gaussian simd fast      CPU           2.413       5.57
             gaussian scalar        GPU           0.828      32.97
           gaussian scalar fast     GPU           0.706      19.02
              gaussian simd         GPU           2.313      11.80
            gaussian simd fast      GPU           2.146       6.26




10
Gaussian Blur


5.3     Workgroup Size Considerations
OpenCL enables the size of a workgroup within the range of 1 to 1024 workitems; however,
looking at Table 3 below, the best performance is achieved with 256 workitems per
workgroup, and this applies for both CPU and GPU. Meanwhile, giving only 1 workitem
per workgroup for GPU worsens the performance by a ratio of nearly 30 times.
                                    TABLE 3
                       OpenCL Workgroup Size Comparison Table

             KERNEL          DEVICE    WORKITEMS         T KERNEL[ms]           SPEEDUP
           gaussian scalar    GPU            512               0.816           33.45955882
           gaussian scalar    GPU            128               0.824           33.13470874
           gaussian scalar    GPU            256               0.828           32.97463768
           gaussian scalar    CPU            512               6.908           3.952374059
           gaussian scalar    CPU            128               9.117            2.99473511
           gaussian scalar    CPU            256               9.333           2.925425908
           gaussian scalar    CPU              1               10.049          2.716986765
           gaussian scalar    GPU              1               19.019          1.435564436




5.4     Benefits of Using Vector Data Types
Utilizing vector data types, such as float8, explicitly, gives us optimizations through
removing unnecessary branches, saving memory bandwidth and cache usage, and also
compressing instruction stacks. By looking at Table 4 below, this technique gives us a 6x
speedup from the scalar version.
                                     TABLE 4
                Using Vector Data Type Registers in SSE Architecture

                      KERNEL          DEVICE       T KERNEL[ms]     SPEEDUP
                    gaussian simd      CPU             1.497            18.24
                    gaussian scalar    CPU             9.333            2.93




                                                                                             11
References

[1] “Gaussian    Blur,”     from   Wikipedia,     the    free   Encyclopedia.    2012.
    http://en.wikipedia.org/wiki/Gaussian blur

[2] “Gaussian Filter,”      from Wikipedia,        the   free   Encyclopedia.    2012.
    http://en.wikipedia.org/wiki/Gaussian filter

[3] V. Podlozhnyuk, “Image Convolution with CUDA,” NVIDIA Corporation, June,
    2007.

[4] C. Lomont, “Introduction to IntelTM Advanced Vector Extensions,” Intel Corporation,
    May, 2011.

[5] Intel Corporation, “Writing Optimal OpenCLTM Code with Intel R R OpenCL SDK,”
    Document Number: 325696-001US, Revision 1.3, Intel Corporation, 2011.

[6] P. Glaskowsky, “NVIDIA’s Fermi: The First Complete GPU Computing Architec-
    ture” NVIDIA Corporation, September, 2009.

[7] “GNU LGPL Version 3,” http://www.gnu.org/licenses/lgpl.html




12

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:60
posted:8/17/2012
language:French
pages:14