Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

CUDA Memories - The University of Akron_1_

VIEWS: 2 PAGES: 59

									            CUDA Lecture 8
            CUDA Memories
Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The
                  University of Akron.
Hardware Implementation of CUDA
Memories
 Each thread can:                Grid
    Read/write per-
                                  Block (0, 0)                  Block (1, 0)
     thread registers
    Read/write per-                     Shared Memory               Shared Memory

     thread local                 Registers      Registers      Registers      Registers
     memory
    Read/write per-              Thread (0, 0) Thread (1, 0)   Thread (0, 0) Thread (1, 0)

     block shared
     memory                Host   Global Memory

    Read/write per-grid
                                  Constant Memory
     global memory
    Read/only per-grid
     constant memory
                                                                CUDA Memories – Slide 2
CUDA Variable Type Qualifiers
 Variable declaration                        Memory Scope         Lifetime
                         int LocalVar;        register   thread      thread

 __device__ __local__    int LocalVar;         local     thread      thread
                         int ArrayVar[10];
 __device__ __shared__   int SharedVar;       shared     block       block

 __device__              int GlobalVar;       global      grid    application

 __device__ __constant__ int ConstantVar;    constant     grid    application




 __device__ is optional when used with
  __local__, __shared__ or __constant__


                                                            CUDA Memories – Slide 3
CUDA Variable Type Qualifiers (cont.)
  Variable declaration                        Memory Scope         Lifetime
                          int LocalVar;        register   thread      thread

  __device__ __local__    int LocalVar;         local     thread      thread
                          int ArrayVar[10];
  __device__ __shared__   int SharedVar;       shared     block       block

  __device__              int GlobalVar;       global      grid    application

  __device__ __constant__ int ConstantVar;    constant     grid    application


 Automatic scalar variables without any qualifier
  reside in a register
   Compiler will spill to thread local memory
 Automatic array variables without any qualifier
  reside in a thread-local memory
                                                             CUDA Memories – Slide 4
CUDA Variable Type Performance
    Variable declaration                        Memory      Penalty
                            int LocalVar;        register        1x

    __device__ __local__    int LocalVar;         local         100x
                            int ArrayVar[10];
    __device__ __shared__   int SharedVar;       shared          1x

    __device__              int GlobalVar;       global         100x

    __device__ __constant__ int ConstantVar;    constant         1x


 scalar variables reside in fast, on-chip registers
 shared variables reside in fast, on-chip memories
 thread-local arrays and global variables reside in
  uncached off-chip memory
 constant variables reside in cached off-chip memory
                                                            CUDA Memories – Slide 5
CUDA Variable Type Scale
   Variable declaration                        Instances Visibility
                           int LocalVar;        100,000s             1

   __device__ __local__    int LocalVar;        100,000s             1
                           int ArrayVar[10];
   __device__ __shared__   int SharedVar;         100s             100s

   __device__              int GlobalVar;          1             100,000s

   __device__ __constant__ int ConstantVar;        1             100,000s



 100Ks per-thread variables, R/W by 1 thread
 100s shared variables, each R/W by 100s of threads
 1 global variable is R/W by 100Ks threads
 1 constant variable is readable by 100Ks threads
                                                           CUDA Memories – Slide 6
Where to declare variables?
                      Can host access it?
                                 Yes No




      Outside of any
                                            In the kernel
        function

                                                 int LocalVar;
 __constant__ int ConstantVar;
                                                 int ArrayVar[10];
 __device__   int GlobalVar;
                                      __shared__ int SharedVar;


                                                        CUDA Memories – Slide 7
Example: Thread-local Variables
// motivate per-thread variables with
// Ten Nearest Neighbors application
__global__ void ten_nn(float2 *result, float2 *ps, float2 *qs,
                       size_t num_qs)
{
  // p goes in a register
  float2 p = ps[threadIdx.x];

    // per-thread heap goes in off-chip memory
    float2 heap[10];

    // read through num_qs points, maintaining
    // the nearest 10 qs to p in the heap
    ...
    // write out the contents of heap to result
    ...
}


                                                   CUDA Memories – Slide 8
Example: Shared Variables
// motivate shared variables with
// Adjacent Difference application:
// compute result[i] = input[i] – input[i-1]
__global__ void adj_diff_naive(int *result, int *input)
{
  // compute this thread’s global index
  unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i > 0)
    {
      // each thread loads two elements from global memory
      int x_i = input[i];
      int x_i_minus_one = input[i-1];

        result[i] = x_i – x_i_minus_one;
    }
}


                                                     CUDA Memories – Slide 9
Example: Shared Variables (cont.)
// motivate shared variables with
// Adjacent Difference application:
// compute result[i] = input[i] – input[i-1]
__global__ void adj_diff_naive(int *result, int *input)
{
  // compute this thread’s global index
  unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i > 0)
    {
      // what are the bandwidth requirements of this kernel?
      int x_i = input[i];
      int x_i_minus_one = input[i-1]; Two loads

        result[i] = x_i – x_i_minus_one;
    }
}


                                                     CUDA Memories – Slide 10
Example: Shared Variables (cont.)
// motivate shared variables with
// Adjacent Difference application:
// compute result[i] = input[i] – input[i-1]
__global__ void adj_diff_naive(int *result, int *input)
{
  // compute this thread’s global index
  unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i > 0)
    {
      // how many times does this kernel load input[i]?
      int x_i = input[i];   // once by thread i
      int x_i_minus_one = input[i-1]; // again by thread i+1

        result[i] = x_i – x_i_minus_one;
    }
}


                                                    CUDA Memories – Slide 11
Example: Shared Variables (cont.)
// motivate shared variables with
// Adjacent Difference application:
// compute result[i] = input[i] – input[i-1]
__global__ void adj_diff_naive(int *result, int *input)
{
  // compute this thread’s global index
  unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i > 0)
    {
      // Idea: eliminate redundancy by sharing data
      int x_i = input[i];
      int x_i_minus_one = input[i-1];

        result[i] = x_i – x_i_minus_one;
    }
}


                                                      CUDA Memories – Slide 12
Example: Shared Variables (cont.)
// optimized version of adjacent difference
__global__ void adj_diff(int *result, int *input)
{
  // shorthand for threadIdx.x
  int tx = threadIdx.x;
  // allocate a __shared__ array, one element per thread
  __shared__ int s_data[BLOCK_SIZE];
  // each thread reads one element to s_data
  unsigned int i = blockDim.x * blockIdx.x + tx;
  s_data[tx] = input[i];

    // avoid race condition:   ensure all loads complete before continuing
    __syncthreads();
    if (tx > 0)
      result[i] = s_data[tx]   – s_data[tx–1];
    else if (i > 0)
    {
      // handle thread block   boundary
      result[i] = s_data[tx]   – input[i-1];
    }
}

                                                              CUDA Memories – Slide 13
Example: Shared Variables (cont.)
// when the size of the array isn’t known at compile time...
__global__ void adj_diff(int *result, int *input)
{
  // use extern to indicate a __shared__ array will be
  // allocated dynamically at kernel launch time
  extern __shared__ int s_data[];
  ...
}
// pass the size of the per-block array, in bytes, as the third
// argument to the triple chevrons
adj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);




                                                      CUDA Memories – Slide 14
Optimization Analysis
     Implementation         Original       Improved
     Global loads              2N       N + N/BLOCK_SIZE
     Global stores             N               N
     Throughput             36.8 GB/s       57.5 GB/s
     Source lines of code      18              35
     (SLOCs)
     Relative improvement      1x             1.57x
     Improvement/SLOC          1x             0.81x
 Experiment performed on a GT200 chip
   Improvement likely better on an older architecture
   Improvement likely worse on a newer architecture
 Optimizations tend to come with a development cost
                                                      CUDA Memories – Slide 15
Variable Type Restrictions
 Pointers can only point to memory allocated or
  declared in global memory:

   Allocated in the host and passed to the kernel:
    __global__ void KernelFunc(float* ptr)

   Obtained as the address of a global variable:
     float* ptr = &GlobalVar;




                                               CUDA Memories – Slide 16
Variable Type Restrictions (cont.)
 So you can use pointers and point at any memory
  space per se:
         __device__ int my_global_variable;
         __constant__ int my_constant_variable = 13;

         __global__ void foo(void)
         {
           __shared__ int my_shared_variable;

             int *ptr_to_global = &my_global_variable;
             const int *ptr_to_constant = &my_constant_variable;
             int *ptr_to_shared = &my_shared_variable;
             ...
             *ptr_to_global = *ptr_to_shared;
         }

                                                         CUDA Memories – Slide 17
Variable Type Restrictions (cont.)
 Pointers aren’t typed on memory space
                __shared__ int *ptr;
   Where does ptr point?
   ptr is a __shared__ pointer variable, not a pointer to
    a __shared__ variable!




                                             CUDA Memories – Slide 18
Don’t confuse the compiler!
   __device__ int my_global_variable;
   __global__ void foo(int *input)
   {
     __shared__ int my_shared_variable;

       int *ptr = 0;
       if (input[threadIdx.x] % 2)
         ptr = &my_global_variable;
       else
         ptr = &my_shared_variable;
       // where does ptr point?
   }
                                      CUDA Memories – Slide 19
Advice
 Prefer dereferencing pointers in simple, regular access
  patterns
 Avoid propagating pointers
 Avoid pointers to pointers
   The GPU would rather not pointer chase
   Linked lists will not perform well
 Pay attention to compiler warning messages
    Warning: Cannot tell what pointer points
    to, assuming global memory space
   Crash waiting to happen

                                             CUDA Memories – Slide 20
A Common Programming Strategy
 Global memory resides in device memory (DRAM)
   Much slower access than shared memory
 So, a profitable way of performing computation on
  the device is to tile data to take advantage of fast
  shared memory:
   Generalize from adjacent_difference example
   Divide and conquer




                                             CUDA Memories – Slide 21
A Common Programming Strategy
(cont.)




 Partition data into subsets that fit into shared
  memory                                       CUDA Memories – Slide 22
A Common Programming Strategy
(cont.)




 Handle each data subset with one thread block as
  follows:                                CUDA Memories – Slide 23
A Common Programming Strategy
(cont.)




 Load the subset from global memory to shared memory,
  using multiple threads to exploit memory-level parallelism
                                               CUDA Memories – Slide 24
A Common Programming Strategy
(cont.)




 Perform the computation on the subset from shared
  memory; each thread can efficiently multi-pass over any
  data element                                 CUDA Memories – Slide 25
A Common Programming Strategy
(cont.)




 Copy the results from shared memory back to global
  memory
                                          CUDA Memories – Slide 26
A Common Programming Strategy
(cont.)
 Constant memory also resides in device memory
  (DRAM)
   Much slower access than shared memory
   But…cached!
   Highly efficient access for read-only data




                                                 CUDA Memories – Slide 27
A Common Programming Strategy
(cont.)
 Carefully partition data according to access patterns
   Read-only  __constant__ memory (very fast if in
    cache)
   R/W & shared within block  __shared__ memory
    (very fast)
   R/W within each thread  registers (very fast)
   Indexed R/W within each thread  local memory
    (slow)
   R/W inputs/results  cudaMalloc’ed global memory
    (very slow)


                                            CUDA Memories – Slide 28
Communication through Memory
    __global__ void race(void)
    {
      __shared__ int my_shared_variable;
      my_shared_variable = threadIdx.x;
    }

 This is a race condition; the result is undefined
 The order in which threads access the variable is
  undefined without explicit coordination
 Two ways to enforce well-defined semantics

                                            CUDA Memories – Slide 29
Communication through Memory
(cont.)
 Use barriers (e.g., __syncthreads) to ensure
  data is ready for access

    __global__ void share_data(int *input)
    {
      __shared__ int data[BLOCK_SIZE];
      data[threadIdx.x] = input[threadIdx.x];
      __syncthreads();
    }

 The state of the entire data array is now well-defined
  for all threads in this block.

                                             CUDA Memories – Slide 30
Communication through Memory
(cont.)
 Use atomic operations (e.g., atomicAdd) to ensure
  exclusive access to a variable
    // assume *result is initialized to 0

    __global__ void sum(int *input, int *result)
    {
      atomicAdd(result, input[threadIdx.x]);
    }

 After this kernel exits, the value of *result will be
  the sum of the inputs

                                             CUDA Memories – Slide 31
Resource Contention
 Atomic operations aren’t cheap; they imply serialized
  access to a variable.
    __global__ void sum(int *input, int *result)
    {
      atomicAdd(result, input[threadIdx.x]);
    }


 How many threads will contend for exclusive access
  to result?

             sum<<<B,N/B>>>(input,result);


                                           CUDA Memories – Slide 32
Hierarchical Atomics
                          S

      S0           S1                           Si




 Divide and Conquer
   Per-thread atomicAdd to a __shared__ partial sum
   Per-block atomicAdd to the total sum
                                         CUDA Memories – Slide 33
Hierarchical Atomics (cont.)
 __global__ void sum(int *input, int *result)
 {
   __shared__ int partial_sum;

     // thread 0 is responsible for initializing partial_sum
     if (threadIdx.x == 0) partial_sum = 0;
     __syncthreads();

     // each thread updates the partial sum
     atomicAdd(&partial_sum, input[threadIdx.x]);
     __syncthreads();

     // thread 0 updates the total sum
     if (threadIdx.x == 0) atomicAdd(result, partial_sum);
 }



                                                    CUDA Memories – Slide 34
Advice
 Use barriers such as __syncthreads to wait until
  __shared__ data is ready
 Prefer barriers to atomics when data access patterns
  are regular or predictable
 Prefer atomics to barriers when data access patterns
  are sparse or unpredictable
 Atomics to __shared__ variables are much faster
  than atomics to global variables
 Don’t synchronize or serialize unnecessarily


                                            CUDA Memories – Slide 35
Example: Matrix Multiplication using
Shared Memory
 Generalize adjacent_difference example
 AB = A * B
   Each element ABij
   = dot(row(A,i),col(B,j))
 Parallelization strategy        B
   Thread  ABij
   2D kernel



                             A   AB
                                      CUDA Memories – Slide 36
First Try: Matrix Multiply Kernel using
Multiple Blocks
   __global__ void mat_mul(float *a, float *b,
                           float *ab, int width)
   {
     // calculate the row & col index of the element
     int row = blockIdx.y * blockDim.y + threadIdx.y;
     int col = blockIdx.x * blockDim.x + threadIdx.x;

       float result = 0;

       // do dot product between row of a and col of b
       for (int k = 0; k < width; ++k)
         result += a[row * width + k] * b[k * width + col];

       ab[row * width+col] = result;
   }



                                                   CUDA Memories – Slide 37
How will this perform?
How many loads per term of dot product? 2 (a and b) = 8 Bytes
How many floating point (FP)             2 (multiply and addition)
operations?
Global memory access to flop ratio       8 Bytes / 2 ops = 4 B/op
(GMAC)
What is the peak FP performance of       805 GFLOPS
GeForce GTX 260?
Lower bound on bandwidth required to     GMAC * Peak FLOPS = 4 * 805 =
reach peak FP performance                 3.2 TB/s
What is the actual memory bandwidth of   112 GB/s
GeForce GTX 260?
Then what is an upper bound on           Actual BW / GMAC = 112 / 4 =
performance of our implementation?        28 GFLOPS

                                                          CUDA Memories – Slide 38
How will this perform? (cont.)
 All threads access               Grid
  global memory for
  their input matrix               Block (0, 0)                   Block (1, 0)


  elements                                Shared Memory               Shared Memory


 The actual code runs             Registers      Registers      Registers       Registers

  at about 15 GFLOPS
 Need to drastically cut          Thread (0, 0) Thread (1, 0)   Thread (0, 0) Thread (1, 0)


  down memory
  accesses to get closer    Host   Global Memory


  to the peak 805                  Constant Memory

  GFLOPS

                                                                 CUDA Memories – Slide 39
Idea: Use __shared__ memory to
reuse global data
 Each input element is
  read by width threads
 Load each element into
  __shared__ memory                    B
  and have several
  threads use the local
  version to reduce the
  memory bandwidth
                           A           AB

                               width    CUDA Memories – Slide 40
Tiled Multiply                  TILE_WIDTH

 Partition kernel loop
  into phases so that the
  data accesses in each
  phase are focused on
  one subset (tile) of A        B
  and B
 Load a tile of both
  matrices into
  __shared__ each
  phase                     A   AB
                                CUDA Memories – Slide 41
Tiled Multiply (cont.)               TILE_WIDTH

 Each phase
   each block computes
    one square sub-matrix
    ABsub of size
    TILE_WIDTH
                                     B
   each phase, each thread
    computes a partial result,
    one element of ABsub


                                 A   AB
                                     CUDA Memories – Slide 42
A Small Example
                             B0,0 B1,0

                             B0,1 B1,1

                             B0,2 B1,2

                             B0,3 B1,3


       A0,0 A1,0 A2,0 A3,0   AB0,0 AB1,0 AB2,0 AB3,0

       A0,1 A1,1 A2,1 A3,1   AB0,1 AB1,1 AB2,1 AB3,1

                             AB0,2 AB1,2 AB2,2 AB3,2

                             AB0,3 AB1,3 AB2,3 AB3,3




                                                       CUDA Memories – Slide 43
A Small Example (cont.)
 Every A and B element is used exactly twice in
  generating a 2-by-2 tile of AB
           AB0,0          AB1,0         AB0,1         AB1,1
         thread0,0      thread1,0     thread0,1     thread1,1
Access   A0,0 * B0,0   A0,0 * B1,0   A0,1 * B0,0   A0,1 * B1,0
order    A1,0 * B0,1   A1,0 * B1,1   A1,1 * B0,1   A1,1 * B1,1
         A2,0 * B0,2   A2,0 * B1,2   A2,1 * B0,2   A2,1 * B1,2
         A3,0 * B0,3   A3,0 * B1,3   A3,1 * B0,3   A3,1 * B1,3


                                                   CUDA Memories – Slide 44
Breaking A and B into Tiles
                              B0,0 B1,0

                              B0,1 B1,1

                              B0,2 B1,2

                              B0,3 B1,3


        A0,0 A1,0 A2,0 A3,0   AB0,0 AB1,0 AB2,0 AB3,0

        A0,1 A1,1 A2,1 A3,1   AB0,1 AB1,1 AB2,1 AB3,1

                              AB0,2 AB1,2 AB2,2 AB3,2

                              AB0,3 AB1,3 AB2,3 AB3,3




                                                        CUDA Memories – Slide 45
Breaking A and B into Tiles (cont.)
 Each phase of a thread block uses one tile from A
  and one from B
                        Phase 1                             Phase 2
    T0,0   A0,0     B0,0     AB0,0 +=          A2,0     B0,2     AB0,0 +=
           ↓        ↓        s_a0,0*s_b0,0 +   ↓        ↓        s_a0,0*s_b0,0 +
           s_a0,0   s_b0,0   s_a1,0*s_b0,1     s_a0,0   s_b0,0   s_a1,0*s_b0,1

    T1,0   A1,0     B1,0     AB1,0 +=          A3,0     B1,2     AB1,0 +=
           ↓        ↓        s_a0,0*s_b1,0 +   ↓        ↓        s_a0,0*s_b1,0 +
           s_a1,0   s_b1,0   s_a1,0*s_b1,1     s_a1,0   s_b1,0   s_a1,0*s_b1,1

    T0,1   A0,1     B0,1     AB0,1 +=          A2,1     B0,3     AB0,1 +=
           ↓        ↓        s_a0,1*s_b0,0 +   ↓        ↓        s_a0,1*s_b0,0 +
           s_a0,1   s_b0,1   s_a1,1*s_b0,1     s_a0,1   s_b0,1   s_a1,1*s_b0,1

    T1,1   A1,1     B1,1     AB1,1 +=          A3,1     B1,3     AB1,1 +=
           ↓        ↓        s_a0,1*s_b1,0 +   ↓        ↓        s_a0,1*s_b1,0 +
           s_a1,1   s_b1,1   s_a1,1*s_b1,1     s_a1,1   s_b1,1   s_a1,1*s_b1,1

                         time                                         CUDA Memories – Slide 46
Tiled Multiply (cont.)               TILE_WIDTH

 Each phase
   each block computes
    one square sub-matrix
    ABsub of size
    TILE_WIDTH
                                     B
   each phase, each thread
    computes a partial result,
    one element of ABsub


                                 A   AB
                                     CUDA Memories – Slide 47
Better Implementation
 Set up the execution configuration

       dim3 dimBlock (TILE_WIDTH, TILE_WIDTH);

       dim3 dimGrid (Width / TILE_WIDTH,
                     Width / TILE_WIDTH);




                                            CUDA Memories – Slide 48
Better Implementation (cont.)
    __global__ void mat_mul(float *a, float *b,
                            float *ab, int width)
    {
      // shorthand
      int tx = threadIdx.x, ty = threadIdx.y;
      int bx = blockIdx.x, by = blockIdx.y;
      // allocate tiles in __shared__ memory
      __shared__ float s_a[TILE_WIDTH][TILE_WIDTH];
      __shared__ float s_b[TILE_WIDTH][TILE_WIDTH];
      // calculate the row & col index
      int row = by * blockDim.y + ty;
      int col = bx * blockDim.x + tx;

      float result = 0;




                                               CUDA Memories – Slide 49
Better Implementation (cont.)
      // loop over the tiles of the input in phases
      for (int p = 0; p < width/TILE_WIDTH; ++p)
      {
        // collaboratively load tiles into __shared__
        s_a[ty][tx] = a[row * width + (p * TILE_WIDTH + tx)];
        s_b[ty][tx] = b[(m * TILE_WIDTH + ty) * width + col];
        __syncthreads();

          // dot product between row of s_a and col of s_b
          for (int k = 0; k < TILE_WIDTH; ++k)
            result += s_a[ty][k] * s_b[k][tx];
          __syncthreads();
      }

      ab[row * width + col] = result;
  }


                                                     CUDA Memories – Slide 50
Use of Barriers in mat_mul
 Two barriers per phase:
   __syncthreads after all data is loaded into
    __shared__ memory
   __syncthreads after all data is read from
    __shared__ memory
   Note that second __syncthreads in phase p guards
    the load in phase p+1

 Use barriers to guard data
   Guard against using uninitialized data
   Guard against bashing live data
                                             CUDA Memories – Slide 51
First Order Size Considerations
 Each thread block should have many threads
   TILE_WIDTH = 16  16*16 = 256 threads
 There should be many thread blocks
   1024-by-1024 matrices  64*64 = 4096 thread blocks
   TILE_WIDTH = 16  gives each SM 3 blocks, 768
    threads
   Full occupancy
 Each thread block performs 2 * 256 = 512 32B loads
  from global memory for 256 * (2 * 16) = 8,192 FP
  operations
   Memory bandwidth no longer a limiting factor
                                            CUDA Memories – Slide 52
Optimization Analysis
   Implementation          Original          Improved
   Global Loads              2N3        2N2 *(N/TILE_WIDTH)
   Throughput             10.7 GFLOPS      183.9 GFLOPS
   SLOCs                      20                44
   Relative Improvement       1x               17.2x
   Improvement/SLOC           1x               7.8x


 Experiment performed on a GT200
 This optimization was clearly worth the effort
 Better performance still possible in theory

                                                 CUDA Memories – Slide 53
Memory Resources as Limit to
Parallelism
     Resource             Per GT200   Full Occupancy on
                             SM             GT200
     Registers              16384     ≤ 16384 / 768 threads
                                      = 21 per thread
     __shared__ Memory      16 KB     ≤ 16 KB / 8 blocks
                                      = 2 KB per block


 Effective use of different memory resources reduces
  the number of accesses to global memory
 These resources are finite!
 The more memory locations each thread requires 
  the fewer threads an SM can accommodate
                                                  CUDA Memories – Slide 54
GT200 Shared Memory and Threading
 Each SM in GT200 has 16KB shared memory
   SM size is implementation dependent!
   For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of
    shared memory.
   Can potentially have up to 8 Thread Blocks actively executing
       This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256
        threads per block)
   The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared
    memory usage per thread block, allowing only up to two thread
    blocks active at the same time
 Using 16x16 tiling, we reduce the accesses to the global
  memory by a factor of 16
   The 112 GB/s bandwidth can now support (112/4)*16 = 448 GFLOPS!

                                                              CUDA Memories – Slide 55
TILE_SIZE Effects




                    CUDA Memories – Slide 56
Summary: Typical Structure of a CUDA
Program
 Global variables declaration
    __host__
    __device__... __global__, __constant__, __texture__
 Function prototypes
    __global__ void kernelOne(…)
    float handyFunction(…)
 Main ()
    allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )
    transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)
    execution configuration setup
    kernel call – kernelOne<<<execution configuration>>>( args… );
    transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…)
    optional: compare against golden (host computed) solution                   repeat
 Kernel – void kernelOne(type args,…)                                             as
    variables declaration - __local__, __shared__                               needed
        automatic variables transparently assigned to registers or local memory
    __syncthreads()…
 Other functions
    float handyFunction(int inVar…);
                                                                   CUDA Memories – Slide 57
Final Thoughts
 Effective use of CUDA memory hierarchy decreases
  bandwidth consumption to increase throughput
 Use __shared__ memory to eliminate redundant
  loads from global memory
   Use __syncthreads barriers to protect __shared__
    data
   Use atomics if access patterns are sparse or
    unpredictable
 Optimization comes with a development cost
 Memory resources ultimately limit parallelism

                                          CUDA Memories – Slide 58
End Credits
 Reading: Chapter 5, “Programming Massively Parallel
  Processors” by Kirk and Hwu.
 Based on original material from
   The University of Illinois at Urbana-Champaign
      David Kirk, Wen-mei W. Hwu
   Stanford University
      Jared Hoberock, David Tarjan
 Revision history: last updated 8/9/2011.




                                             CUDA Memories – Slide 59

								
To top