Docstoc

Particle Swarm Methods for Parameter Optimization on GPU

Document Sample
Particle Swarm Methods for Parameter Optimization on GPU Powered By Docstoc
					CUDA Programming Model Overview
Hardware Hierarchy and Optimization
                  Yukai Hung
             a0934147@gmail.com
          Department of Mathematics
           National Taiwan University
CUDA Development Environment
CUDA Development Environment




       for developers who         for developers who
       wants low-level APIs       wants high-level APIs

   share back-end compiler
   and optimize technology




                              3
CUDA Development Environment




                          4
CUDA Development Environment

     CUDA source files must be compiled by nvcc
     - create host object code directly
     - create device object code directly
     - create Parallel Thread eXecution source code directly
     - compile into device emulation mode for host simulation

     Executable CUDA code requires two libraries
     - dynamic linking with cuda core library
     - dynamic linking with cuda runtime library




                                     5
CUDA Development Environment

     CUDA emulation mode
     - use host to simulate kernel execution
     - no need of any device and CUDA driver
     - each device thread is emulated with a host thread

     Running in device emulation mode can
     - use host native debugger support
     - call any host function from device code and vice-versa
     - access any device-specific data from host code and vice-versa
     - detect deadlock situation caused by improper synchronization




                                     6
CUDA Development Environment


      Linux and Windows environment




                                  7
CUDA Development Environment
                                   float4 me=gx[gtid];
                                   me.x=me.x+me.y*me.z;

                                       C/C++ CUDA




                 Virtual Layer
                                        application


                                          NVCC               CPU code


                                        PTX code


                                       PTX to target
               Physical Layer



                                         compiler


                                 G80        …          GPU

                                        target code
                   ld.global.v4.f32 {$f1,$f3,$f5,$f7},[$r9+0];
                   mad.f32          $f1,$f5,$f3,$f1;




                                                8
CUDA Development Environment

     How to install CUDA?

     How to compile CUDA file?

     How to compile CUDA file into PTX?




                                    9
CUDA Development Environment


      Windows Nexus IDE
      - fist IDE for massively thread parallel applications
      - accelerate coprocessing application development
      - complete Visual Studio integrated development environment




                                   10
CUDA Development Environment


      Linux and Mac Visual Profiler




                                      11
Start from Matrix-Matrix Multiplication
           naïve and tiled algorithm
Naïve Matrix-Matrix Multiplication


      Each thread calculates one result element
      - read one row and one column from global memory
      - perform inner product and store back to global memory
                                     N




                                                  WIDTH
                       M             P




                                                  WIDTH
                            WIDTH         WIDTH




                                    13
Naïve Matrix-Matrix Multiplication

  //compute matrix matrix multiplication P=M*N
  __global__ void SGEMM(float* d_A,float* d_B,float* d_C)
  {
     float value;
     float Melement;
     float Nelement;

      //calculate each thread global index
      ...

      value=0.0f;
      //each thread performs its inner product
      for(int loop=0;loop<dim;loop++)
      {
         Melement=Amatrix[yidx*dim+loop];
         Nelement=Bmatrix[loop*dim+xidx];
         value=value+Melement*Nelement;
      }

      //store final result into P matrix
      Pmatrix[yidx*dim+loop]=value;
  }




                                      14
Naïve Matrix-Matrix Multiplication


      Analyze the global memory usage
      - each thread compute one element on the result matrix
      - load one row and one column elements from matrix M and N
      - perform one multiply and addition for each pair of elements
      - compute and access global memory ratio is close to 1:1
      Analyze the global memory bandwidth
      - 4bytes x 346.5GFLOPs = 1386GB/s requires to achieve peak
      - 86.4GB/s limits the actual performance about 21.6GFLOPs
      - the actual code for naïve algorithm runs at about 15GFLOPs




                                     15
Naïve Matrix-Matrix Multiplication


      Need to drastically cut own memory accessing
      - each row on matrix M is read for many times
      - each column on matrix N is read for many times
                                         N




                                                     WIDTH
                       M                 P




                                                     WIDTH
                            WIDTH            WIDTH




                                    16
Tiled Matrix-Matrix Multiplication


       Use tiled algorithm for matrix data reusing
       - each block computes one square sub-matrix result
       - each thread computes one element on the sub-matrix




                                                                TILE_WIDTH TILE_WIDTH
                                               N




                                                                                           WIDTH
                      M                        P




                                                                             TILE_WIDTHE
                                                    Pdsub




                                                                                            WIDTH
                      TILE_WIDTH TILE_WIDTH        TILE_WIDTH

                                      WIDTH          WIDTH




                                              17
Tiled Matrix-Matrix Multiplication


       Each block should have many threads
       - tiled size 16 creates 16x16=256 threads per block
       - 1024x1024 matrix size needs 64x64=4096 blocks on grid

       Each thread block loads two sub-matrix and compute results
       - load 2x256=512 floats from global memory to shared memory
       - perform 256x(2x16)=512x16 operations on the shared memory
       - global memory bandwidth no longer a limiting factor
       - compute and access global memory ratio is close to 16:1




                                     18
Tiled Matrix-Matrix Multiplication

  //compute matrix matrix multiplication P=M*N
  void SGEMM(float* h_A,float* h_B,float* h_C)
  {
     dim3 gridSize;
     dim3 blockSize;

      //setup the execution configuration
      blockSize.x=TILED_WIDTH;
      blockSize.y=TILED_WIDTH;

      gridSize.x=WIDTH/TILED_WIDTH;
      gridSize.y=WIDTH/TILED_WIDTH;

      //launch device matrix-matrix multiplication kernel
      sgemm<<<gridSize,blockSize>>>(d_A,d_B,d_C);

      ...
  }




                                      19
Tiled Matrix-Matrix Multiplication

  //compute matrix matrix multiplication P=M*N
  __global__ void SGEMM(float* d_A,float* d_B,float* d_C)
  {
     //setup the block index and thread index
     int bx=blockIdx.x;
     int by=blockIdx.y;
     int tx=threadIdx.x;
     int ty=threadIdx.y;

     float value=0.0f;

     //declare the shared memory per block
     __shared__ float Mshared[TILED_WIDTH][TILED_WIDTH];
     __shared__ float Nshared[TILED_WIDTH][TILED_WIDTH];

    ...




                                     20
Tiled Matrix-Matrix Multiplication


     //loop over all sub-matrices on matrix M and N
     //required to compute block sub-matrix results
     for(int loop=0;loop<WIDTH/TILED_WIDTH;loop++)
     {
        //get the pointer to the current sub-matrix M and N
        float* Msub=GetSubMatrix(M,loop,bx,by,tx,ty,WIDTH);
        float* Nsub=GetSubMatrix(N,loop,bx,by,tx,ty,WIDTH);

       //each thread load one element to shared memory
       Msub[ty][tx]=GetMatrixElement(Msub,tx,ty);
       Nsub[ty][tx]=GetMatrixElement(Nsub,tx,ty);

       //synchronize to make sure the sub-matrix are load
       //to shared memory before computing partial result
       __syncthreads();




                                     21
Tiled Matrix-Matrix Multiplication


       void __syncthreads() function
       - synchronize all executed threads in the same block
       - resume normally when all threads have reached the same point
       - use to avoid RAW/WAR/WRW hazards when accessing memory




                                     22
Tiled Matrix-Matrix Multiplication


          //each thread conputes one element of the sub-matrix
          for(int loope=0;loope<TILED_WIDTH;loope++)
             value=value+Msud[ty][loope]*Nsub[loope][tx];

          //synchronize to make sure that the proceding
          //computation is done before loading two new
          //sub-matrices of M and N in the next iteration
          __syncthreads();
      }

      //get the pointer to the block sub-matrix P
      float* Psub=GetSubMatrix(P,loop,bx,by,tx,ty,WIDTH);

      //save the block sub-matrix result to global memory
      SetMatrixElement(Psub,value,bx,by,tx,ty,WIDTH);

      return;
  }




                                       23
Tiled Matrix-Matrix Multiplication




                               24
Tiled Matrix-Matrix Multiplication


       What is the limitation on matrix tiled size?
       - access memory and compute ratio depends on tiled size
       - higher memory usage ratio comes from the lager tiled size
       - the ratio is limited by shared memory size and block size

       What is the performance between two algorithms?
       - naïve matrix-matrix multiplication runs at about 15GFLOPs
       - tiled matrix-matrix multiplication runs at about 90GFLOPs

       What kind of tiled algorithm used on Intel Math Kernel Library?




                                      25
Matrix-Matrix Multiplication



    device memory                                    host memory

    shared memory                                    L3 cache

                                                     L2 cache

                    GPU version        CPU version




                                  26
Matrix-Matrix Multiplication




                               27
Matrix-Matrix Multiplication


       How about the memory accessing pattern?

                            N




                                        WIDTH
                M           P




                                        WIDTH



                    WIDTH       WIDTH




                                                28
Tiled Matrix-Matrix Multiplication


       How to accelerate current tiled algorithm?
       - prefetch the next sub-matrix data while computing
       - overlap the loading and computing times to hide latency




                                      29
Recommend Program Strategy


      Global memory resides in device memory
      - much slower accessing than shared memory
      A profitable way of programming on device is tiled data
      - divide data into subsets that fit into shared memory
      - handle each data subset with one thread block by:
       1: load the subset from global memory to shared memory
          use multiple threads to exploit memory-level parallelism
       2: perform computation on the subset from shared memory
       3: store final results from shared memory to global memory




                                     30
Same Idea for Matrix Transpose
       naïve and tiled algorithm
Naïve Matrix Transpose


      Each thread represents one element of the matrix
      - load one row element from original matrix
      - store one column element into another matrix




                                                 WIDTH
                          WIDTH




                                    32
Naïve Matrix Transpose


                               load data from global memory


               WIDTH
                                           stride = 1

                               store data into global memory
                       WIDTH




                                    stride = dimension




                                      33
Tiled Matrix Transpose




               load data from memory store data into memory
                 0,0 the 0,2            0,0 the 2,0
               Consider0,1 tiled algorithm on1,0 matrix!

                 1,0   1,1   1,2         0,1   1,1   2,1

                 2,0   2,1   2,2         0,2   1,2   2,2




                                    34
Tiled Matrix Transpose


               load data from global memory to shard memory


                       0,0 0,1 0,2            0,0   0,1   0,2




                                                                transpose matrix on
                                            blocks!
                     Consider one of thread1,1 1,2
                      1,0 1,1 1,2      1,0




                                                                  shared memory
                       2,0 2,1 2,2            2,0   2,1   2,2
                      global memory           shared memory
                       0,0   0,1   0,2        0,0   1,0   2,0

                       1,0   1,1   1,2        0,1   1,1   2,1

                       2,0   2,1   2,2        0,2   1,2   2,2


               store data from shared memory to global memory




                                         35
Hardware Hierarchy Architecture Details
Hardware Hierarchy Architecture


                     this unit distributes all blocks into shader multiprocessors
                                              Host

                                    Input Assembler                                          Setup / Rstr / ZCull

                                    Vtx Thread Issue              Work Distribution          Pixel Thread Issue



         SP    SP        SP    SP        SP      SP    SP        SP     SP    SP        SP     SP       SP        SP   SP    SP




                                                                                                                                       Thread Processor
         TF              TF              TF            TF               TF              TF              TF             TF




              L1              L1               L1            L1              L1              L1              L1             L1




                    L2                   L2                      L2                    L2                    L2                   L2


              FB                    FB                      FB                    FB                   FB                    FB




                                                                       37
Hardware Hierarchy Architecture


      all blocks get scheduled         this means you need
       round-robin based on           sufficient blocks to fill
      the number of shaders           all the shader pipeline




                                 38
Hardware Hierarchy Architecture

                                              Host

                                    Input Assembler                                          Setup / Rstr / ZCull

                                    Vtx Thread Issue              Work Distribution          Pixel Thread Issue



         SP    SP        SP    SP        SP      SP    SP        SP     SP    SP        SP     SP       SP        SP   SP    SP




                                                                                                                                       Thread Processor
         TF              TF              TF            TF               TF              TF              TF             TF




              L1              L1               L1            L1              L1              L1              L1             L1




                    L2                   L2                      L2                    L2                    L2                   L2


              FB                    FB                      FB                    FB                   FB                    FB


                              this unit performs graphic texture operations
                                 texture filtering and pixel interpolations




                                                                       39
Hardware Hierarchy Architecture

                                     shader processor array SPA on chip
                              shader number depends on the different hardware
                                               Host

                                     Input Assembler                                          Setup / Rstr / ZCull

                                     Vtx Thread Issue              Work Distribution          Pixel Thread Issue



         SP    SP        SP     SP        SP      SP    SP        SP     SP    SP        SP     SP       SP        SP   SP    SP




                                                                                                                                        Thread Processor
         TF              TF               TF            TF               TF              TF              TF             TF




              L1              L1                L1            L1              L1              L1              L1             L1




                    L2                    L2                      L2                    L2                    L2                   L2


              FB                     FB                      FB                    FB                   FB                    FB




                                                                        40
Hardware Hierarchy Architecture

                                           shader multiprocessor SM on SPA
                                         thread block is scheduled into one SM
                                              Host

                                    Input Assembler                                          Setup / Rstr / ZCull

                                    Vtx Thread Issue              Work Distribution          Pixel Thread Issue



         SP    SP        SP    SP        SP      SP    SP        SP     SP    SP        SP     SP       SP        SP   SP    SP




                                                                                                                                       Thread Processor
         TF              TF              TF            TF               TF              TF              TF             TF




              L1              L1               L1            L1              L1              L1              L1             L1




                    L2                   L2                      L2                    L2                    L2                   L2


              FB                    FB                      FB                    FB                   FB                    FB




                                                                       41
Hardware Hierarchy Architecture

                                           shader multiprocessor SM on SPA
                                         thread block is scheduled into one SM
                                              Host

                                    Input Assembler                                          Setup / Rstr / ZCull

                                    Vtx Thread Issue              Work Distribution          Pixel Thread Issue



         SP    SP        SP    SP        SP      SP    SP        SP     SP    SP        SP     SP       SP        SP   SP    SP




                                                                                                                                       Thread Processor
         TF              TF              TF            TF               TF              TF              TF             TF




              L1              L1               L1            L1              L1              L1              L1             L1




                    L2                   L2                      L2                    L2                    L2                   L2


              FB                    FB                      FB                    FB                   FB                    FB




                                                                       42
Hardware Hierarchy Architecture


                  shader processor SP for regular arithmetic
                thread is scheduled into one shader processor




                                     43
Hardware Hierarchy Architecture


                  low precision special function unit SFU
                   hardware supports intrinsic functions




                                   44
Hardware Hierarchy Architecture


      Intrinsic functions are supported by hardware SFU
      - intrinsic function comes from traditional usage
      - the functions are less accurate but faster version
      - the functions have name prefixed with __function()
      Selectable operation rounding mode
      - function suffixed with _rn uses round-to-nearest-even
      - function suffixed with _rz uses round-towards-zero
      - function suffixed with _rd uses round-down
      - function suffixed with _ru uses round-up
      SFU and SP units are two independent operation units
      - SFU and SP units can be performed simultaneously




                                     45
Hardware Hierarchy Architecture


       on-chip shared memory on each SM fetch instruction and dispatch
       shared data in the same thread block to all threads in the same warp




                                         46
Warp Architecture


      Unrelated to software design which is purely hardware
      - designed to solve variable block size
      - designed to achieve effective divergence
      - designed to hide memory accessing latency
      Each block are divided into several warps
      - each warp contains 32-threads on current shader
      - warp size is decided from some hardware reasons
       Each shader can execute only one warp at the same time
       - each shader can monitor and switch between several warps
      - synchronization not necessary for all threads in the same warp




                                      47
Warp Architecture


                               block size 256
                     1    2    3   4        5   6   7   8
                               warp umber 8

                               block size 280
                     1    2    3   4        5   6   7   8   9
                              warp umber 9
                    8 threads on the last warp are idle




                                       48
Warp Architecture


      Shader multiprocessor scheduling

                            Shader Multiprocessor
                         block 0             block 1

                     block 0   block 0        block 1   block 1
                     warp 0    warp 1         warp 0    warp 1

                     block 0   block 0        block 1   block 1
                     warp 2    warp 3         warp 2    warp 3




                                         49
Warp Architecture


      Shader multiprocessor synchronization

                            Shader Multiprocessor
                         block 0             block 1

                     block 0   block 0        block 1   block 1
                     warp 0    warp 1         warp 0    warp 1

                     block 0   block 0        block 1   block 1
                     warp 2    warp 3         warp 2    warp 3




                                         50
Warp Architecture


      The warp scheduler on each shader is very smart!!




                                    51
Warp Architecture


      SM hardware implements zero-overhead warp scheduling
      - warps whose next instruction has its operand ready for
        consumption are eligible for execution via score boarding
      - eligible warps are selected for execution on a prioritized
        scheduling policy via warp scheduling priority table
      Thread block is partitioned into several warps
      - partition is always the same on different device
      - warp size may be changed from generation to generation
      - thread index within the warp is consecutive and increasing

      All threads in the same warp execute the same instruction
      - synchronize not necessary for threads in the same warp




                                      52
Warp Architecture


           1    2   3   4   5   6        7   8   9 10 11 12

                            block scheduler
           11                                           12
            9                                            10
            7                                            8
            5                                            6
            3                                            4
            1                                            2




                                    53
                             warp scheduler
                             score boarding




                                                  54
Warp Architecture




                    11


                             7
                         9


                                  5
                                       3
                                              1
Warp Architecture


      Why the warp size is equal to 32-threads?
      - depend on the shader instruction dispatch hardware
      - dispatch same instruction to all threads need 4 clock cycle
      - each warp thread needs 4 clock cycles to get next instruction
      - 8 concurrent threads x 4 clock cycles = 32 threads on one warp




                                     55
Warp Architecture



                                      clock cycle 1
                                      clock cycle 2
                                      clock cycle 3
                                      clock cycle 4


                     instruction k
                    instruction k+1




                         56
Warp Architecture


      How to hide the global memory accessing latency?
      - suppose one global access is needed for every n instructions
      - more than 200/4n warps are needed to fully tolerate latency

           warp 0
           warp 1
           warp 2
           warp 3
           warp 4
           warp 5




                                     57
Warp Architecture


      How to hide the global memory accessing latency?
      - suppose one global access is needed for every n instructions
      - more than 200/4n warps are needed to fully tolerate latency

           warp 0
           warp 1
           warp 2
           warp 3




                                     58
Control Flow
Control Flow


      Each instruction is issued per 32 threads or 1 warp

      Main performance concern with branching is divergent
      - threads within a single warp take different control path
      - different execution paths within a warp are serialized
      - the control paths taken by the threads in the same warp
        are traversed one at a time until there is no more

      Different warps are independent for the divergent condition




                                      60
Control Flow

                            //divergent condition
                            if(threadIdx.x<2)
                            {......}
                            else
                            {......}

     - this creates two different control paths for threads in a block
     - the branch granularity is less than warp size (only the first warp)
     - thread 0 and 1 follow different path than the rest of threads in warp 0
                       warp 0   warp 1        warp 2   warp 3




                                         61
Control Flow

                        //no divergent condition
                        if(threadIdx.x/WARP_SIZE<2)
                        {......}
                        else
                        {......}

     - this creates two different control paths for threads in a block
     - the branch granularity is a whole multiple of the warp size
     - all threads in any given warp follow the same control path
                        warp 0   warp 1        warp 2   warp 3




                                          62
Control Flow


      Profiler counts each shader multiprocessor execution
      - count branch and divergent branch
      - count warp serialize and instructions




                                    63
Parallel Vector Reduction
Parallel Vector Reduction


       Give any array and reduce them to a single value in parallel

       Some examples:
       - sum reduction: sum of all values in the array
       - max reduction: maximum of all values in the array




                                       65
Parallel Vector Reduction


       Assume an in-place reduction using shared memory
       - the original vector is in device global memory
       - shared memory used to hold a partial sum result
       - each iteration bring the partial sum close to final sum
       - the partial solution on shared memory will be in element 0




                                      66
Parallel Vector Reduction




                            67
Parallel Vector Reduction

  //naïve parallel vector sum reduction

  //raw data is loaded into shared memory
  __shared__ float array[size];

  ....

  //outer loop for parallel reduction
  for(int stride=1;stride<blockDimx.x;stride=stride*2)
  {
     //check the thread is active or not
     if(threadIdx.x%(2*stride)==0)
        array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+stride];

      //make sure all active threads are done
      __syncthreads();
  }




                                      68
Parallel Vector Reduction

                       Time    Bandwidth




                              69
Parallel Vector Reduction


       Each warp will traverse two control paths sequentially
       - threads that do not perform addition may cost extra
         cycles cause from the implementation of divergence

       No more than half of threads will be executed at any time
       - all old index threads are disable right from the beginning
       - less than ¼ threads will be activated for all warps on average
       - after the 5th iteration, entire warps in each block will be
         disabled, this is poor resource utilization but no divergence




                                       70
Parallel Vector Reduction

  //naïve parallel vector summation reduction

  //raw data is loaded into shared memory       BAD: divergence due to
  __shared__ float array[size];                   interleaved branch
                                                       decisions
  //outer loop for parallel reduction
  for(int stride=1;stride<blockDimx.x;stride=stride*2)
  {
     //check the thread is active or not
     if(threadIdx.x%(2*stride)==0)
        array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+stride];

      //make sure all active threads are done
      __syncthreads();
  }




                                      71
Parallel Vector Reduction

  //naïve parallel vector summation reduction
                                                  take out the mod
  //raw data is loaded into shared memory
  __shared__ float array[size];                 operation on first step

  //outer loop for parallel reduction
  for(int stride=1;stride<blockDimx.x;stride=stride*2)
  {
     index=threadIdx.x*stride*2;

      if(index<blockDim.x)
         array[index]=array[index]+array[index+stride];

      //make sure all active threads are done
      __syncthreads();
  }




                                      72
Parallel Vector Reduction
                                             Step    Cumulative
                       Time    Bandwidth   Speedup    Speedup




                              73
Parallel Vector Reduction




                            74
Parallel Vector Reduction

  //naïve parallel vector summation reduction

  //raw data is loaded into shared memory
  __shared__ float array[size];

  //outer loop for parallel reduction
  for(int stride=blockDim.x/2;stride>0;stride=stride>>1)
  {
     //check the thread is active or not
     if(threadIdx.x<stride)
        array[threadIdx.x]=array[threadIdx.x]
                          +array[threadIdx.x+stride];

      //make sure all active threads are done
      __syncthreads();
  }




                                      75
Parallel Vector Reduction
                                             Step    Cumulative
                       Time    Bandwidth   Speedup    Speedup




                              76
Parallel Vector Reduction


       Idle thread problem
       - half of the threads are idle on the first loop iteration!!
       //raw data is loaded into shared memory         Idle: only a half of
       __shared__ float array[size];                   threads are active
                                                        this is wasteful!!
       //outer loop for parallel reduction
       for(int stride=blockDim.x/2;stride>0;stride=stride>>1)
       {
          //check the thread is active or not
          if(threadIdx.x<stride)
             array[threadIdx.x]+=array[threadIdx.x+stride];

           //make sure all active threads are done
           __syncthreads();
       }




                                         77
Parallel Vector Reduction


       Halve the block size
        - replace original single load with two loads
       - add the first two element for first level reduction

                         input data          input data


                         block size          block size




                                        78
Parallel Vector Reduction


       Original data loading
          //raw data is loaded into shared memory
          __shared__ float array[size];

          //load data from global memory to shared memory
          array[threadIdx.x]= ivector[threadIdx.x];


       New data loading: halve original block size
          //raw data is loaded into in shared memory
          __shared__ float array[size];

          //load data from global memory to shared memory
          array[threadIdx.x]= ivector[threadIdx.x]
                            + ivector[threadIdx.x+blockDim.x];




                                       79
Parallel Vector Reduction
                                             Step    Cumulative
                       Time    Bandwidth   Speedup    Speedup




                              80
Parallel Vector Reduction


       It seems far from bandwidth bound at 17 GB/s
       - we know reduction has low arithmetic intensity
       - therefore a likely bottleneck is instruction overhead
       - consider the address arithmetic and loop overhead

       Active thread number decreases as reduction proceed
      - only one warp is active when stride is less than 32
      - instructions are synchronous within the same warp
        (remove the block synchronization on the last warp)

       Strategy: unroll the last 6 iterations on the inner loop




                                        81
Parallel Vector Reduction

  //outer loop for parallel reduction
  for(int stride=blockDim.x/2; stride>32 ;stride=stride>>1)
  {
     //check the thread is active or not
     if(threadIdx.x<stride)
        array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+stride];
                                 this not occurs divergent
                                branch on the last active warp
      //make sure all active threads are done
      __syncthreads();
  }

  //unroll the last 6 loop and skip synchronize
  if(threadIdx.x<32)
  {
     array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 32 ];
     array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 16 ];
                          remove loop instruction
     array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 8 ];
     array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 4 ];
                         and block synchronization
     array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 2 ];
     array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 1 ];
  }




                                       82
Parallel Vector Reduction
                                                             Step      Cumulative
                                Time        Bandwidth      Speedup      Speedup




          Note: this saves useless work in all warps, not just the last one!!
         All warps execute all for loop and if instructions without unrolling




                                           83
Parallel Vector Reduction


       Completely unroll the loop reduction
       - the block size is limited by 512 threads
       - we are sticking to power-of-2 block size
       - suppose we know the number of iterations

       It is easy to unroll for a fixed block size
      - need a very smart method to unroll all possible size
      - how can we unroll all possible block size at compile time?

       Templates to the rescue!!
       - CUDA supports C++ template parameters on device function




                                      84
Parallel Vector Reduction


       Specify block size as a function template parameter
      //use template to unroll all possible block size
      template<int blockSize>
      __global__ void reduce(float* ivector,float* ovector)


       How to invoke template kernel in different block size?
      //invoke template kernel in different block size
      reduce<512><<<blockNumber,blockSize>>>(ivector,ovector);
      reduce<256><<<blockNumber,blockSize>>>(ivector,ovector);
      reduce<128><<<blockNumber,blockSize>>>(ivector,ovector);
      reduce< 64><<<blockNumber,blockSize>>>(ivector,ovector);
      reduce< 32><<<blockNumber,blockSize>>>(ivector,ovector);
      reduce< 16><<<blockNumber,blockSize>>>(ivector,ovector);
      reduce< 8><<<blockNumber,blockSize>>>(ivector,ovector);
      reduce< 4><<<blockNumber,blockSize>>>(ivector,ovector);
      reduce< 2><<<blockNumber,blockSize>>>(ivector,ovector);
      Reduce< 1><<<blockNumber,blockSize>>>(ivector,ovector);




                                      85
Parallel Vector Reduction

  //unroll the first 3 iterations conditionally
  if(blockSize>=512)
  {
     if(threadIdx.x<256)
        array[threadIdx.x]+=array[threadIdx.x+256];
     __syncthraeds();
  }

  if(blockSize>=256)
  {
     if(threadIdx.x<128)
        array[threadIdx.x]+=array[threadIdx.x+128];
     __syncthraeds();
  }

  if(blockSize>=128)
  {
     if(threadIdx.x<64)
        array[threadIdx.x]+=array[threadIdx.x+64];
     __syncthraeds();
  }




                                     86
Parallel Vector Reduction

  //unroll the last active warp conditionally
  if(threadIdx.x<32)
  {
     if(blockSize>=64) array[threadIdx.x]+=array[threadIdx.x+32];
     if(blockSize>=32) array[threadIdx.x]+=array[threadIdx.x+16];
     if(blockSize>=16) array[threadIdx.x]+=array[threadIdx.x+ 8];
     if(blockSize>= 8) array[threadIdx.x]+=array[threadIdx.x+ 4];
     if(blockSize>= 4) array[threadIdx.x]+=array[threadIdx.x+ 2];
     if(blockSize>= 2) array[threadIdx.x]+=array[threadIdx.x+ 1];
  }

                Note: all code in RED will be evaluated at compile time!!
               This result in a very efficient inner loop, no loop instruction




                                            87
Parallel Vector Reduction
                                             Step    Cumulative
                       Time    Bandwidth   Speedup    Speedup




                              88
Parallel Vector Reduction


       Combine both sequential and parallel reduction
       - each thread loads and sums multiple elements into shared
       - last parallel part performs tree-based reduction in shared

                      input data           input data


                      block size           block size

                      input data


                      block size




                                      89
Parallel Vector Reduction


         //load data from global memory to shared memory
         array[threadIdx.x]=vector[threadIdx.x]
                           +vector[threadIdx.x+blockDim.x];

         int index=threadIdx.x;
         //load data from global memory to shared memory
         While(index<maxn)
         {
            array[threadIdx.x]=vector[index]
                              +vector[index+blockDim.x];
            index=index+blockSize<<1;
         }




                                   90
Parallel Vector Reduction
                                             Step    Cumulative
                       Time    Bandwidth   Speedup    Speedup




                              91
Parallel Vector Reduction


      Why this is suitable for traditional parallel architecture?




                                        92
Matrix-Matrix Multiplication

      How about unroll the matrix-matrix multiplication?




                                      93
Reference
- Mark Harris   http://www.markmark.net/
- Wei-Chao Chen http://www.cs.unc.edu/~ciao/
- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php




                                       94

				
DOCUMENT INFO