GPU-computing

Document Sample
GPU-computing Powered By Docstoc
					 High Performance Computing on
Graphics Processing Units (GPUs)

               David Kaeli
            ECE Department
          Northeastern University
               Boston, MA

               June 17, 2008
About Dave Kaeli
Past
 BS Rutgers University in EE 1981
 Started working for IBM on the mainframe
 MS @ Syracuse while working full-time
 Moved to IBM Research @ Yorktown
 PhD in 1992 while working full-time
 Moved to NU in 1993
Today
 Full Prof and Undergraduate Chair for ECE
 Married to a teacher and have 2 talented
  daughters Melissa and Emma
 I coach 2 soccer teams, and play on a
  third
 I serve on the CRA CCC Council – this
  afternoon’s talk
Motivation
      Explore the use of new parallel computing
       technology for addressing critical societal issues
       that require large computational resources
   “Hundreds of millions of people               “Every three minutes a woman is
   depend upon reefs for all or part of          diagnosed with breast cancer”
   their livelihood” (International Coral        (American Cancer Society)
   Reef Action Network)




                                                              From: National Cancer Institute, USA




               From: National Oceanic and
               Atmospheric Administration, USA
    GPGPU
 Graphics Processing Units – NVIDIA, ATI, IBM-
  Cell - Provides massive parallelism at low cost
 Being applied to the following imaging applications
    Hyperspectral Image Reconstruction – Reef health imaging
    Breast Cancer Tomosynthesis Reconstruction – High
     resolution/contrast cancer screening
    Cardiac CT Reconstruction – Identify vulnerable plague in heart
    Phase Unwrapping for In-vitro Fertilization Studies – Reduce the
     number of birth defects
    Trauma-Pod – searching for shrapnel from road side bombs in
     Iraq
GPU Unified Architecture approach
 GPGPU: General Purpose (GP) applications
  programmed on GPU using Graphics APIs (OpenGL,
  DirectX)
 GPU Unified Architecture
    Pixel and Vertex Processors became one
     programmable unit
    Implementation does not require us to use
     Graphics APIs
    New APIs intend to popularize the use of GPUs
     as highly multi-threaded CPU coprocessors
         NVIDIA CUDA                       ATI/AMD CTM
 C language extension (deeper   Set of commands – GPU is
 SW layer than CTM)             programmed almost directly in
                                assembly (steeper learning curve
                                than CUDA
 Libraries provided by NVIDIA   Development community will
 (CUBLAS, CUFFT, more soon)     develop libraries
Tesla GPU - Architecture Overview
                                  Per MP:
                                  8 Functional units
                                  8192 Registers
                                  16KB shared cache


                                  FU FU FU FU
          16 Multiprocessors
                                       Register File

    MP MP MP MP MP MP MP MP         Shared Cache

                                  FU FU FU FU
  1.5GB
            RAM (Global Memory)

    MP MP MP MP MP MP MP MP
NVIDIA’s CUDA Programming Environment


  Runs (subset of) C standard library on
   NVIDIA GPUs

  Uses a SPMD model

  Coordinates execution of large numbers of
   threads on SIMD GPU hardware
CUDA Programming Model/Microarchitecture

   Block    …     Block                  thread
                                          (0,0)
                                                                 …    thread
                                                                       (0,n)
                                                                                 • Maximum threads/block 512
   (0,0)          (0,N)
                                                                                 • Thread in a block are executed
     …




                                               …
                                                                                 in SIMD groups of 32

   Block                                 thread                                  • Maximum 65,535 blocks per
   (M,0)         GRID 1                   (m,0)                                  grid’s dimension
                                                                 BLOCK(0,N)
                                                                                 • Grids are executed sequentially
           …




        GRID n’                  SP1     SP5
                                                8192 Registers




                                 SP2

                                 SP3
                                         SP6

                                         SP7
                                                                     SP1       ……………                   SP?
• A block is processed by only   SP4     SP8
one multiprocessor
                                 16KB shared memory
                                     in 16 Banks

• Maximum                                                            Const. and Texture Memory 64KB each
threads/multiprocessor                                                         768 MB – 1.5 MB
running concurrently 768
 2 Example Applications


 3-D Tomosynthesis Image
  Reconstruction

 4-D Cardiac Computed Tomography
Tomosynthesis Image Reconstruction                   Provided by MGH Breast Imaging



    X-ray source
     (15 views)            X-ray
                        projections
                   Set 3D volume (guess)



                    Compute projections
                        Forward

                                           5018 Conventional MLO mammogram
                     Correct 3D volume
                         Backward



                        3D volume
                       (1196x2304x45)
     Detector
   (1196x2304)
                                              Tomosynthesis of 5018-8mm
     Tomosynthesis acceleration on a GPU
                                Speedup and $/sec of Breast Tomosynthesis
                                  Reconstruction* on a NVIDIA GTX8800




                                                                                             4157




                                                                                                                        8278




                                                                                                                                                                           2095

                                                                                                                                                                                  1461
                                                                                                                                                                                                                     GTX8800




                                                                                                                                                                                               785
                                    GTX8800
                       3000                                                                                                                                     800                                                  Cluster A - 8 servers
                                    Cluster A - 8 servers
                                    Cluster A - 4 servers                                                                                                                                                            Cluster B - 16 servers
                                    Cluster A - 2 servers                                                                                                       700




                                                                                                                               Cost/Performnace (dollars/sec)
                                    Cluster B - 16 servers
                                                                    2091



                                    Cluster B - 8 servers
                                                                                                                                                                600
                                    Cluster B - 4 servers
Execution time (sec)




                       2000         Workstation (Serial)
                                                                                                                                                                500




                                                                                                                                                                                                               431
                                                                                                                 1248
                                                                                                                                                                400




                                                                                                                                                                                                     326
                                                                                                                                                                300




                                                                                                                                                                                                                                    226
                       1000
                                                                                                    664




                                                                                                                                                                                                                     168
                                                                                       565




                                                                                                                                                                200
                                              539




                                                                           349
                                                              291




                                                                                                          250




                                                                                                                                                                                                                                          76
                                                      191




                                                                                                                                                                100
                                                                                 131




                                                                                                                                                                      36




                                                                                                                                                                                         13
                               72




                                                    72
                                       65
                              27




                                                                                                                                                                                                           8




                                                                                                                                                                                                                                3
                          0                                                                                                                                       0
                                      1                      4                         8                        16                                                         1                   4               8                    16
                                                             Number of iterations                                                                                                             Number of iterations



                                    * Collaboration with Chanelle Green (Spelman), Waleed Meleis (NU),
                                    Richard Moore (MGH) and Dr. Daniel Kopans (MGH)
Impacting heart disease with GPUs


 Currently, coronary heart disease (CHD) is the
single leading cause of death in America
 CT imaging can be used to identify vulnerable
plague buildup
     Speedup of Algebraic Reconstruction* of
     a Cardiac CT image on a NVIDIA GTX8800
                                       Forward         Backward
                                      Projection       Projection
 Serial C Code – 3GHz Xeon             337 secs         65 secs
 GPU Code – GTX8800                     14 secs          2 secs
 Speedup                                  24X            32.5X
 * Collaboration with Clem Karl (BU) and Homer Pien (MGH)
CUDA-Strengths

 Easy to program (small learning curve)

 Success with several complex
  applications
   At least 7X faster than CPU stand-alone
    implementations

 Allows us to read and write data at any
  location in the device memory

 More fast memory close to the processors
  (registers + shared memory)
  CUDA-Limitations

 Some hardwired graphic components are
  hidden
 Better tools are needed
   Profiling
   Memory blocking and layout
   Binary Translation
 Difficult to find optimal values for CUDA
  execution parameters
      Number of thread per block
      Dimension and orientation of blocks and grid
      Use of on-chip memory resources including registers and
       shared memory
*P: A Semi-Automatic Parallel Approach

 Parallel utility developed by ISC and built on MPI
 Runs transparently on supported high level languages –
  MATLAB, Python
 Semi-automated data distribution and automatic parallel
  execution
 Utilizes parallel libraries: ScaLAPACK, PBLAS, etc.
 Transparent
    Minimal code modifications
    Very small learning curve in comparison with MPI

       Star-P Client              Matlab or Python serial source code
       Client Services                 Secure             Data         Automatic
                                      Connection       Distribution   Parallelization
       Star-P Server
       Compute Engine:

                  Task Parallel        Data Parallel
                                                                      MPI
      *P Programming Model

 Data parallelism
          Automatic data management (distribution of data using *P command)
 Loop parallelism (a thread per iteration)
          mechanism for loop parallelism: ppeval




        DATA PARALLEL                               LOOP PARALLEL
                                                         Iteration Iteration N
                                                    Iteration 1 2
      Matrix A          Matrix B
                 Local CPU                                 Local CPU
                        STAR-P                                      STAR-P (PPEVAL)
A1 B1      A2 B2       …     AN BN
                                                          Iteration 2
                                            Iteration 1                 … Iteration N
 P1          P2        …         PN
                                              P1             P2         …        PN
*P applied to Tomosynthesis Reconstruction

    Speedup - 11.5X speedup over serial MATLAB
   while running on 4 nodes Xeon-based Cluster

    Programming Effort
       Vectorization was challenging (~8 hours)
           Algorithm was initially designed for C
       ppeval required 6 lines of code modification
       7 *P commands were required to distribute data


   Lower performance than CUDA, but a lot less
   programming effort needed!
   Lot’s of interesting work to do

 CUDA and *P technologies provide different
  performance/programming effort tradeoffs
 How can we combine multiple GPUS with *P to obtain
  both efficiency in both processing speed and develop
  speed?
    We are presently exploring ways to improve parallelization
     automation both for *P and CUDA
    Preliminary work has begun on using *P to exploit GPU
     nodes on a distributed cluster

      Matrix Multiply on 3D matrices (8192x8192 slices) time per
      slice:
      MATLAB           *P (16 dual core       2 G80 GPUs with
                          processors)                  *P
       300 sec.             32 sec.                 6.6 sec.
What research is being done on GPUs?

  Binary translation to reduce programming
   effort
  New GPUs by Intel and AMD
    Intel Larrabee
    AMD Fusion
  NVIDIA continues to push performance and
   cores
    240 cores @ 1.3 GHz in the 280 GTX
  IBM looking at how to best combine GPUs,
   FPGAs and CPUs on a single platform
  Check out www.gpgpu.org for the latest
   developments in GPGPU computing
  How do you get started?

 Download the NVIDIA GTX/CUDA emulator
 Buy a GPU (~$500) – supercomputing
  performance at a fraction of the cost
 Talk to someone from science or engineering
  about a large computational task that they
  want help on
 Allows for many interdisciplinary projects
 You can also utilize any GPU to realize the
  implementation (AMD/ATI and IBM/Cell has
  their own programming environment)

				
DOCUMENT INFO