GPGPU Programming by za187cY

VIEWS: 10 PAGES: 23

									GPGPU Programming

   Dominik Göddeke
Overview



• Choices in GPGPU programming

• Illustrated CPU vs. GPU step by step example

• GPU kernels in detail




                                             2
Choices in GPU Programming

                          Window
 Application
                          manager
    e.g. in
                            e.g.
C/C++, Java,
                          GLUT, Qt,
Fortran, Perl
                         Win32, Motif                      Graphics
                                           Operating
                                            system         hardware
                                              e.g.            e.g.
                                         Windows, Unix,   Radeon (ATI),
                                         Linux, MacOS     GeForce (NV)
           Metaprogramming language
  Shader                  Graphics API
 programs      e.g. BrookGPU, Sh
                               e.g.
   e.g. in             OR
                            OpenGL,
HLSL, GLSL, Self-written libGPU
     Cg                      DirectX
            hides the graphics details



                                                                3
Bottom lines

• This is not as difficult as it seems
  –   Similar choices to be made in all software projects
  –   Some options are mutually exclusive
  –   Some can be used without in-depth knowledge
  –   No direct access to the hardware, the driver does all the
      tedious thread-management anyway

• Advantages and disadvantages
  – Steeper learning curve vs. higher flexibility
  – Focus on algorithm, not on (unnecessary) graphics
  – Portable code vs. platform and hardware specific


                                                            4
Libraries and Abstractions

• Some coding is required
  – no library available that you just link against
  – tremendously hard to massively parallelize existing
    complex code automatically

• Good news
  – much functionality can be added to applications in a
    minimally invasive way, no rewrite from scratch

• First libraries under development
  – Accelerator (Microsoft): linear algebra, BLAS-like
  – Glift (Lefohn et al.): abstract data structures, e.g. trees

                                                              6
Overview



• Choices in GPGPU programming

• Illustrated CPU vs. GPU step by step example

• GPU kernels in detail




                                             7
Native Data Layout

• CPU: 1D array




• GPU: 2D array

                     Indices are floats,
                     addressing array element
                     centers (GL) or top-left
                     corners (D3D).
                     This will be important later.




                                           8
Example Problem

• saxpy (from BLAS)
  – given two vectors x and y of size N and a scalar a
  – compute scaled vector-vector addition y = y + a*x
• CPU implementation
  – store each vector in one array, loop over all elements
           for (i=0; i<N; i++)
           y[i] = y[i] + a*x[i];

• Identify computation inside loop as kernel
  – no logic in this basic kernel, pure computation
  – logic and computation fully separated
                                                             9
Understanding GPU Limitations

• No simultaneous reads and writes into the same
  memory
  – No read-modify-write buffer means no logic required to
    handle read-before-write hazards
  – Not a missing feature, but essential hardware design for
    good performance and throughput
  – saxpy: introduce additional array: ynew = yold + a*x

• Coherent memory access
  – For a given output element, read in from the same index in
    the two input arrays
  – Trivially achieved in this basic example

                                                         10
Performing Computations

• Load a kernel program
  – Detailed examples later on

• Specify the output and input arrays
  – Pseudocode:
          setInputArrays(yold, x);
          setOutputArray(ynew);

• Trigger the computation
  – GPU is after all a graphics processor
  – So just draw something appropriate


                                            11
Computing = Drawing

• Specify input and output regions
  – Set up 1:1 mapping from graphics viewport to output array
    elements, set up input regions
  – saxpy: input and output regions coincide


• Generate data streams
  – Literally draw some geometry that covers all elements in
    the output array
  – In this example, a 4x4 filled quad from four vertices
  – GPU will interpolate output array indices from vertices
    across the output region
  – And generate data stream flowing through the parallel PEs
                                                       12
Example




           Kernel
          y + 0.5*x




                      13
Performing Computations

• High-level view
  – Kernel is executed simultaneously on all elements in the
    output region
  – Kernel knows its output index (and eventually additional
    input indices, more on that later)
  – Drawing replaces CPU loops, foreach-execution
  – Output array is write-only


• Feedback loop (ping-pong technique)
  – Output array can be used read-only as input for next
    operation

                                                           14
Overview



• Choices in GPGPU programming

• Illustrated CPU vs. GPU step by step example

• GPU kernels in detail




                                             15
GPU Kernels: saxpy

• Kernel on the CPU
           y[i] = y[i] + a*x[i]

• Written in Cg for the GPU
        float saxpy(float2 coords: WPOS,      array index
          uniform samplerRECT arrayX,
          uniform samplerRECT arrayY,         input arrays
          uniform float a) : COLOR
        {
          float y = texRECT(arrayY,coords);
          float x = texRECT(arrayX,coords);
                                                gather
          return y+a*x;
        }                                      compute

                                                         16
GPU Kernels: Jacobi Iteration

• Good news
  - Simple linear system solver can be built with exactly these
    basic techniques!

• Example: Finite Differences
  - x: vector of unknowns, sampled with a 5-point stencil
    (offsets)
  - b: right-hand-side
  - regular, equidistant grid
  - `solved´ with Jacobi iteration



                                                            17
GPU Kernels: Jacobi Iteration
                      float jacobi (float2 center : WPOS,
                        uniform samplerRECT x,
                        uniform samplerRECT b,
                        uniform float one_over_h) : COLOR
                      {
                        float2 left   = center – float2(1,0);
                        float2 right = center + float2(1,0);
  calculate offsets     float2 bottom = center – float2(0,1);
                        float2 top    = center + float2(0,1);

                          float   x_center   =   texRECT(x,   center);
                          float   x_left     =   texRECT(x,   left);
                          float   x_right    =   texRECT(x,   right);
   gather values          float   x_bottom   =   texRECT(x,   bottom);
                          float   x_top      =   texRECT(x,   top);
                          float   rhs        =   texRECT(b,   center);

                          float Ax = one_over_h *
                                       ( 4.0 * x_center – x_left -
   matrix-vector                          x_right – x_bottom – x_top );
                          float inv_diag = one_over_h / 4.0;


    Jacobi step           return x_center + inv_diag * (rhs – Ax);
                      }

                                                                         18
Maximum of an Array

• Entirely different operation
  – Output is single scalar, input is array of length N

• Naive approach
  – Use 1x1 array as output, gather all N values in one step
  – Doomed: will only use one PE, no parallelism at all
  – Runs into all sorts of other troubles

• Solution: parallel reduction
  – Idea based on global communication in parallel computing
  – Smart interplay of output and input regions
  – Same technique applies to dot products, norms etc.
                                                          19
Maximum of an Array




       input                       intermediates                       results
 adjust indicesmaximum (float2 coords: WPOS, of
  input array
N/2 x N/2 output
           float               first output
                               maximum
                          uniform samplerRECT array) : COLOR {
 to gather 2x2 topleft = ((coords-0.5)*2.0)+0.5;
             float2             2x2 region
  regions for val1 = texRECT(array, topleft);
             float
             float val2 = texRECT(array, topleft+float2(1,0));
  each output val3 = texRECT(array, topleft+float2(1,1));
             float
                   float val4 = texRECT(array, topleft+float2(0,1));
                   return max(val1,max(val2,max(val3,val4)));
               }
                                                                                 20
Multigrid Transfers

• Restriction
  – Interpolate values from fine into coarse array
  – Local neighborhood weighted gather on both CPU and
    GPU



   2i-1    2i   2i+1                            i




          fine adjust index        output region
                                              coarse
                 to read                      result
                                   coarse array
                neighbors                              21
Multigrid Transfers

• Prolongation
  – Scatter values from fine to coarse with weighting stencil
  – Typical CPU implementation: loop over coarse array with
    stride-2 daxpy‘s




                                                         22
Multigrid Transfers

•    Three cases
    1) Fine node lies in the center of an element (4 interpolants)
    2) Fine node lies on the edge of an element (2 interpolants)
    3) Fine node lies on top of a coarse node (copy)
•    Reformulate scatter as gather for the GPU
    – Set fine array as output region
    – Sample with index offset 0.25

    0.25 snaps back to
     same code for all
      center (case 3)
      three cases, no
       conditionals or
    0.25 snaps to neigh-
       red-black-map
    bors (case 1 and 2)
                                                            23
Conclusions

• This is not as complicated as it might seem
  – Course notes online:
    http://www.mathematik.uni-dortmund.de/~goeddeke/iccs
  – GPGPU community site: http://www.gpgpu.org
     • Developer information, lots of useful references
     • Paper archive
     • Help from real people in the GPGPU forums




                                                    24

								
To top