Compilers, Parallel Computing, and Grid Computing by HC121001032737


        Programming Model
These notes will introduce:

• Basic GPU programming model
• CUDA kernel
• Simple CUDA program to add two vectors together
• Compiling the code on a Linux system

 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011   1
             Programming Model

GPUs historically designed for creating image data for

That application involves manipulating image pixels
(picture elements) and often the same operation each

SIMD (single instruction multiple data) model - An
efficient mode of operation in which the same operation
is done on each data element at the same time

 SIMD (Single Instruction Multiple Data)
Also know as data parallel computation.
One instruction specifies the operation:
                        a[] = a[] + k


       a[0]      a[1]                   a[n-2]      a[n-1]

Very efficient of this is what you want to do. One program.
Can design computers to operate this way.
   Single Instruction Multiple Thread
         Programming Model

A version of SIMD used in GPUs.

GPUs use a thread model to achieve very high parallel
performance and to hide memory latency

Multiple threads, each execute the same instruction sequence.

On a GPU, a very large number of threads (10,000’s) possible.

Threads mapped onto available processors on GPU (100’s of
processors all executing same program sequence)

       Programming applications
          using SIMT model
Matrix operations -- very amenable to SIMT
• Same operations done on different elements of matrices

Some “embarassingly” parallel computations such as
Monte Carlo calculations
• Monte Carlo calculations use random selections
    Random selections are independent of each other

Data manipulations
• Some sorting can be done quite efficiently

           CUDA kernel routine
To write a SIMT program, one needs to write a code
sequence that all the threads on the GPU will do.

In CUDA, this code sequence is called a Kernel routine

Kernal code will be regular C except one typically needs
to use the thread ID in expressions to ensure each thread
accesses different data:
                                        All theads do this
     index = ThreadID;
     A[index] = B[index] + C[index];
           CPU and GPU memory
• Program once compiled has code
  executed on CPU and (kernel) code
  executed on GPU                                    CPU

                                              CPU main memory
• Separate memories on CPU and GPU
                                         Copy from         Copy from
                                         CPU to            GPU to
Need to                                  GPU               CPU
• Explicitly transfer data from CPU to
  GPU for GPU computation, and               GPU global memory

• Explicitly transfer results in GPU
  memory copied back to CPU memory

Basic CUDA program structure
int main (int argc, char **argv ) {

    1. Allocate memory space in device (GPU) for data
    2. Allocate memory space in host (CPU) for data

    3. Copy data to GPU

    4. Call “kernel” routine to execute on GPU
    (with CUDA syntax that defines no of threads and their physical structure)

    5. Transfer results from GPU to CPU

    6. Free memory space in device (GPU)
    7. Free memory space in host (CPU)

     1. Allocating memory space in
         “device” (GPU) for data

Use CUDA malloc routines:

int size = N *sizeof( int);                   // space for N integers

int *devA, *devB, *devC;                     // devA, devB, devC ptrs

cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );

Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
    2. Allocating memory space in
         “host” (CPU) for data
Use regular C malloc routines:
int *a, *b, *c;
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);

or statically declare variables:
#define N 256
int a[N], b[N], c[N];

  3. Transferring data from host
      (CPU) to device (GPU)
Use CUDA routine cudaMemcpy
           Destination Source

cudaMemcpy( devA, A, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_B, B, size, cudaMemcpyHostToDevice);

devA and devB are pointers to destination in device
A and B are pointers to host data
   4. Declaring “kernel” routine to
       execute on device (GPU)
CUDA introduces a syntax addition to C:
Triple angle brackets mark call from host code to device code.
Contains organization and number of threads in two parameters:

                 myKernel<<< n, m >>>(arg1, … );

n and m will define organization of thread blocks and threads in a

For now, we will set n = 1, which say one block and m = N, which
says N threads in this block.

arg1, … , -- arguments to routine myKernel typically pointers to
device memory obtained previously from cudaMallac.
              Declaring a Kernel Routine
A kernel defined using CUDA specifier __global__                             underscores
                                                                             each side

               Example – Adding to vectors A and B
#define N 256
__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition

    int i = threadIdx.x;                CUDA structure that provides thread ID in block
    C[i] = A[i] + B[i];
}                                              Each of the N threads performs one pair-
                                               wise addition:
int main() {                                   Thread 0:   devC[0] = devA[0] + devB[0];
  // allocate device memory &                  Thread 1:   devC[1] = devA[1] + devB[1];
 // copy data to device
                                               Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];
 // device mem. ptrs devA,devB,devC

    vecAdd<<<1, N>>>(devA,devB,devC); // Grid of one block, N threads in block
            Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA
  5. Transferring data from device
        (GPU) to host (CPU)

Use CUDA routine cudaMemcpy
           Destination Source

cudaMemcpy( C, devC, size, cudaMemcpyDeviceToHost);

devC is a pointer in device and C is a pointer in host.

6. Free memory space in “device”

 Use CUDA cudaFree routine:

 cudaFree( dev_a);
 cudaFree( dev_b);
 cudaFree( dev_c);

7. Free memory space in (CPU) host
       (if CPU memory allocated with malloc)

 Use regular C free routine to deallocate memory if
 previously allocated with malloc:

 free( a );
 free( b );
 free( c );

                   #define N 256
                   __global__ void vecAdd(int *A, int *B, int *C) {
CUDA                 int i = threadIdx.x;
                     C[i] = A[i] + B[i];
program            }

                   int main (int argc, char **argv ) {

                     int size = N *sizeof( int);
Adding two           int a[N], b[N], c[N], *devA, *devB, *devC;
vectors, A and       cudaMalloc( (void**)&devA, size) );
B                    cudaMalloc( (void**)&devB, size );
                     cudaMalloc( (void**)&devC, size );
N elements in A
and B, and           cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);
                     cudaMemcpy( devB, b size, cudaMemcpyHostToDevice);
N threads            vecAdd<<<1, N>>>(devA, devB, devC);

(without code to     cudaMemcpy( c, devC size, cudaMemcpyDeviceToHost);
load arrays with
data)                cudaFree( dev_a);
                     cudaFree( dev_b);
                     cudaFree( dev_c);

                     return (0);                                      17
                                               int main(int argc, char *argv[]) {
                                                  int T = 10, B = 1;              // threads per block/blocks per grid
     Complete, with                               int a[N],b[N],c[N];
                                                  int *dev_a, *dev_b, *dev_c;
     keyboard input for                            printf("Size of array = %d\n", N);
     blocks/threads                                do {
                                                      printf("Enter number of threads per block: ");
                                                      printf("\nEnter nuumber of blocks per grid: ");
     (without timing execution,                       scanf("%d",&B);
                                                      if (T * B < N) printf("Error T x B < N, try again");
     see later)                                    } while (T * B < N);

                                                   cudaMalloc((void**)&dev_a,N * sizeof(int));
                                                   cudaMalloc((void**)&dev_b,N * sizeof(int));
                                                   cudaMalloc((void**)&dev_c,N * sizeof(int));
#include <stdio.h>
#include <cuda.h>                                  for(int i=0;i<N;i++) {   // load arrays with some numbers
#include <stdlib.h>                                   a[i] = i;
#include <time.h>                                     b[i] = i*1;
#define N 4096        // size of array
                                                   cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);
                                                   cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);
__global__ void add(int *a,int *b, int *c) {       cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);
  int tid = blockIdx.x*blockDim.x + threadIdx.x;
    if(tid < N){
            c[tid] = a[tid]+b[tid];                cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);
}                                                  for(int i=0;i<N;i++) {

                                                   cudaFree(dev_a);                 // clean up
                                                   return 0;
      Compiling CUDA programs
NVIDIA provides nvcc -- the NVIDIA CUDA “compiler

Will separate out code for host and for device

Regular C/C++ compiler used for host (needs to be

Programmer simply uses nvcc instead of gcc/cc compiler
on a Linux system

Command line options include for GPU features
                Compiling code - Linux

Command line:                                    Directories for #include files

nvcc –O3 –o <exe> <source_file> -I/usr/local/cuda/include
                                –L/usr/local/cuda/lib –lcuda –lcudart
 Optimization level if
 you want optimized                   Directories for libraries   Libraries to be linked

CUDA source file that includes device code has the extension .cu
nvcc separates code for CPU and for GPU and compiles code.
Need regular C compiler installed for CPU.
Make file convenient – see next.

   See “The CUDA Compiler Driver NVCC” from NVIDIA for more details                 20
       Very simple sample Make file
NVCC = /usr/local/cuda/bin/nvcc
CUDAPATH = /usr/local/cuda

LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm

prog1:                                                      A regular C program
    cc -o prog1 prog1.c –lm
                                                    A C program with X11 graphics
    cc -I/usr/openwin/include -o prog2 prog2.c -L/usr/openwin/lib -L/usr/X11R6/lib
-lX11 –lm
                                                               A CUDA program
    $(NVCC) $(NVCCFLAGS) $(LFLAGS) -o prog3
                                                  A CUDA program with X11 graphics
    $(NVCC) $(NVCCFLAGS) $(LFLAGS) -I/usr/openwin/include -o prog4 -L/usr/openwin/lib -L/usr/X11R6/lib -lX11 -lm
           Compilation process
nvcc “wrapper” divides   nvcc –o prog –I/includepath -L/libpath
code into host and
device parts.
Host part compiled by
regular C compiler
                           ptxas                           gcc
Device part compiled
by NVIDIA “ptxas”                       Combine
assembler                                               Object file

Two compiled parts                     executable
combined into one
executable                 Executable file a “fat” binary” with
                              both host and device code            22
               Executing Program

Simple type name of executable created by nvcc:


File includes all the code for host and for device in a “fat binary” file

Host code starts running

When first encounter device kernel, GPU code physically sent to
GPU and function launched on GPU
Hence first launch will be slow!!

Run time environment (cudart) controls memcpy timing and

To top