G80 CUDA

Document Sample
G80 CUDA Powered By Docstoc
					       Brent Oster
Associate Director Allosphere, CNSI
          PhD Studies in:
  Computational Nanotechology


nVidia G80 GPU & CUDA
Previously: Technical Director for Video Games
  (Bioware, EA, Alias|Wavefront, LucasArts )
Modern GPU is More General
  Purpose – Lots of ALU’s
        The nVidia G80 GPU
► 128 streaming floating point processors
  @1.5Ghz
► 1.5 Gb Shared RAM with 86Gb/s bandwidth
► 500 Gflop on one chip (single precision)
  What Has Driven the
Evolution of These Chips?

   Males age 15-35 buy
$10B in video games / year

       Crysis Demo
        Are GPU’s Useful for Scientific
                Computing?




                                               Electronic Structure (DFT)
Finite Element Modeling




                          Molecular Dynamics
                                  &
                             Monte Carlo
nVidia G80 GPU Architecture Overview
                     •16 Multiprocessors Blocks
                     •Each MP Block Has:
                        •8 Streaming Processors
                        (IEEE 754 spfp compliant)
                        •16K Shared Memory
                        •64K Constant Cache
                        •8K Texture Cache
                     •Each processor can access
                     all of the memory at 86Gb/s,
                     but with different latencies:
                     •Shared – 2 cycle latency
                     •Device – 300 cycle latency
       Programming Interface
► Interface to GPU via nVidia’s proprietary API
  – CUDA (very C-like)
► Looks a lot like UPC (simplified CUDA below)

  void AddVectors(float *r, float *a, float *a)
  {
    int tx = threadId.x;     //~processor rank
    r[tx] = a[tx] + b[tx]; //executed in parallel
  }
               Actual CUDA Code
#define MAX_THREADS 512
extern “C” void AddVectors(float *r, float *a, float *b, int n)
{
   int nThreads = MAX_THREADS/2;
   int nBlocks = n / nThreads;
   AddVectorsKernel<nThreads, nBlocks>(r, a, b, n);
}
__global__ void AddVectorsKernel(float *r, float *a, float *b, int n)
{
   int tx = threadID.x;
   int bx = blockID.x;
   int i = tx + bx* MAX_THREADS;
   r[i] = a[i] + b[i];
} // This would be extremely slow and inefficient code – more later
      Still A Specialized Processor
► Very   Efficient For
   Fast Parallel Floating Point Processing
   Single Instruction Multiple Data Operations
   High Computation per Memory Access

► Not   As Efficient For
     Double Precision (need to test performance)
     Logical Operations on Integer Data
     Branching-Intensive Operations
     Random Access, Memory-Intensive Operations
__global__ void NxNGenericOp_Kernel(float *r, float *a, float *b, int n) // r[i] = SUMj(a[i]*b[j])
{
   __shared__ float r_sh[MAX_THREADS];           //Allocate in fast 16K shared memory
   __shared__ float a_sh[MAX_THREADS];
   __shared__ float b_sh[MAX_THREADS];

    int tx = threadID.x;                            //Rank of processor
    int bx = blockID.x;                             //Rank of multiprocessor block
    int i = tx + bx* MAX_THREADS;                   //Compute index from tx, bx
    a_sh[tx] = a[i];                                //Each thread loads a value for a_sh
    r_sh[tx] = 0;                                   //Each thread zeros a value for r_sh

    __synchthreads();                               //sync till all threads reach this point
    for(int J = 0; J < n; J += MAX_THREADS)         //Loop over blocks in b
    {
           b_sh[tx] = b[J+tx];                      //Each thread loads a value for b_sh
            __synchthreads();                       //synch
           for(int j = 0; j < MAX_THREADS; j++)     //For each b_sh
               r_sh[tx] += a_sh[tx] * b_sh[j];      //Compute product a_sh*b_sh, add to r_sh
    }
    __synchthreads();                                //synch
    r[i] = r_sh[tx];                                 //Write results to r
}
Making Optimal Use of 16Kb Shared Memory
16K Shared Memory is
Allocated in 16 Banks

Array data allocated
across banks
B[0] -> bank0
B[1] -> bank1
…
B[n] ->mod(n,nBanks)

No Bank Conflicts if
Each Thread Indexes
A Different Bank

Bank Conflicts if Threads
Access The Same Bank
(results in data stall)
More Detail on GPU Architecture
 Exploiting the Texture Samplers
► Designed  to map textures onto 3D polygons
► Specialty hardware pipelines for:
     Fast data sampling from 1D, 2D, 3D arrays
     Swizzling of 2D, 3D data for optimal access
     Bilinear filtering in zero cycles
     Image compositing & blending operations
► Arrays indexed by u,v,w coordinates – easy
  to program
► Extremely well suited for multigrid & finite
  difference methods – example later
       Experiments in Computational
             Nanotech on GPU




                                               Electronic Structure (DFT)
Finite Element Modeling




                          Molecular Dynamics
                                  &
                             Monte Carlo
 HP XW9400 with Quad AMD CPU
& Dual nVidia Quaddro 5600 GPUs
    = A Teraflop Workstation?
    Molecular Dynamics Trial
► Lennard  Jones inter-atomic potential
► Verlet integration
► Normalized coordinates
► FCC lattice in a NxNxN Simulation Cell
► Periodic Boundary Conditions
► Trials with Rc = ∞, Rc = 3.0
► Tested nVidia 8800 GPU vs 3.0 Ghz Intel
  P4
► Open GL used to implement MD on GPU
MD Timing Tests (NxN brute force)
# Cells   #Atoms   Time/Step   Time/Step   Performance
X,Y,Z     Total    GPU (s)     CPU(s)      Differential
2         32       0.000308    0.000429    139%
3         108      0.00039     0.004513    1157%
4         256      0.000391    0.025295    6469%
5         500      0.000596    0.092766    15565%
6         864      0.001274    0.27681     21728%
7         1372     0.002845    0.689375    24231%
8         2048     0.005665    1.547       27308%
MD Timing Results (bins & Rc=9 Ang)

# Cells   #Atoms    Time/Step   Timesteps
X,Y,Z     Total     GPU (ms)    Per sec
8         2’048     0.532       1879.7
16        16’384    1.984       504.03
32        131’072   16.157      61.89
40        256’000   36.515      27.38
50        500’000   70.985      14.08
Hardware Accelerated DFT Test
               • Real-Space Grid Method
                 (Beck, Bryant,…)
               • LDA, localized basis fn’s
               • Iterative soln KS Equations
               • Finite difference methods
               • Multigrid with FMG-FAS
               • Weighted Jacobi relaxation
               • Merhstellen discritization
               • Grahm-Schmidt
                  orthogonalization on lo-res
               • 64x64x64 grid x 4 orbitals
               • 8 H nuclei, 8 Electrons
               • >1M data elements
               • 81 ms computation time!
                   Where Next?
 G90 Double Precision GPU in Spring 2008




   G90 GPU          nVidia Quaddro    nVidia QuadroPlex
Double-precision    PC Workstation   Cluster - 16 PC Nodes
  1 Teraflop          4 Teraflops        64 Teraflops
    ~$2500             ~$15’000            ~$300’000
   NanoCAD in the Allosphere
California Nano Systems Institute
        How to Find out More
► Download   CUDA and docs from nVidia
   http://developer.nvidia.com/object/cuda.html
► Buy a $600 nVidia GeForce 8800GTX
► Get one free through their developer
  program (talk to me after class)
► CUDA Programming Course through CS
   Fall 07 or Winter 08
   Tobias Hollerer & Myself
        collaborative development –
► NanoCAD
 www.powerofminus9.net

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:26
posted:10/7/2011
language:English
pages:25