An overview of GPU Computing

Document Sample
scope of work template
							       An overview of GPU Computing
       An overview of GPU Computing




                                                    Betatesing Group
 Centre for Development of Advanced Computing (C-DAC),
                                                      Pune University
                                                      vcvrao@cdac.in

International Workshop on High Performance Computing in Observational
Astronomy: Requirements and Challenges ( IUCAA – Pune – Oct 13, 2009)


    VCV.Rao. Betatesting Group, C-DAC, Pune                             1
       An Overview of GPU Computing
       An Overview of GPU Computing


Lecture Outline
Following topics will be discussed

v Part-I : Historical Perspective

v Part-II : An overview of GPU Computing
v Part-III : GPGPUs / GPU Computing Software Products
v Part-IV : Applications - Performance Issues

  Source : NVIDIA, AMD, Intel & References given in the presentation


   VCV.Rao. Betatesting Group, C-DAC, Pune                             2
                            GPU Computing
                            History - An Overview




VCV.Rao. Betatesting Group, C-DAC, Pune             3
    GPU : Massively parallel compute offload


vPerformance
                                            Moore’s Law → 2x < 18 Months
vPower                                      Frequency\Power\Complexity Wall
                                            Parallel → Opportunity for growth
vPrice
vProgramming Models

          GPU is the first successful massively parallel COMMODITY
          architecture with a programming model that managed to use
          1000’s of parallel threads in hardware to perform useful work
          efficiently, based on innovative design in Algorithms for
          Applications to boost performance, showing an acceleration
          of 5x, 10x, 20x compare to CPU performance



  VCV.Rao. Betatesting Group, C-DAC, Pune                                       4
                                 What is GPU ?
                                 What is GPU ?
Graphics Processing Unit

v GPU also occasionally called visual
  processing unit or VPU

v Dedicated graphics rendering device for a
  personal computer, workstation, or game
  console.

v GPU is viewed as compute device that :     Without GPU               With GPU
   • Is a coprocessor to CPU or host machine
   • Has its own DRAM (on the device) runs
     many threads in parallel                   Application
v GPU is dedicated super-threaded, massively
  data parallel co-processor
                                                                 CPU     GPU

                                           Source : References
     VCV.Rao. Betatesting Group, C-DAC, Pune                                      5
                              Why Are GPUs So Fast?
                              Why Are GPUs So Fast?


v GPU originally specialized for math-intensive, highly
  parallel computation
v So, more transistors can be devoted to data
  processing rather than data caching and flow control
                        ALU
                                                                            AMD
                 ALU
      Control
                  ALU   ALU


            Cache

                DRAM                          DRAM

                CPU                           GPU
                                                                            NVIDIA

v Commodity industry: provides economies of scale
v Competitive industry: fuels innovation


    VCV.Rao. Betatesting Group, C-DAC, Pune          Source : NVIDIA. AMD            6
 Computer Graphics ::GPU – Programmable Pipeline
 Computer Graphics GPU – Programmable Pipeline
             v Dealing complex with Graphics API
 Application v Sequential Flow of Execution
             v Limited Communication             per thread
                                                    Input Registers
 Command                                                                    per Shader
                                                                            per Context

 Geometry                                              Fragment               Texture
                                                       Program

Rasterization                                                                 Constants

                                                                               Temp
  Texture                                                                     Registers


 Fragment
                                                   Output Registers

  Display                                             FB Memory


    VCV.Rao. Betatesting Group, C-DAC, Pune   Source : OPENGL, References                 7
Computer Graphics ::GPU – Programmable Pipeline
Computer Graphics GPU – Programmable Pipeline
                                   3D Application
                                                        GPU                    Shaders

 Application
                                       3D API       GPU front End
 Command                                                                            Vertex
                                       CPU            Primitive
                                                                                    Processor

  Geometry                                           Assembly

Rasterization
                                                     Rasterization
                                                                                    Fragment
  Texture                                                                           Processor
                                                        Raster
                                                      Operations
 Fragment


  Display
                                                    Frame Buffer


         VCV.Rao. Betatesting Group, C-DAC, Pune      Source : OPENGL, References               8
Computer Graphics ::GPU Programmable Pipeline
Computer Graphics GPU Programmable Pipeline

                                                                            CPU
  CPU                  v Hardware mimicked
                         graphics APIs                                   Geometry
Geometry               v It is possible to formulate
                         many problems in this                           Rasterize
Rasterize                framework
                            •     Uses graphics APIs                    Shade pixels
Shade pixels                •     Classical GPGPU
                                                                            Display
  Display
                                              DO NOT DO THIS ANYMORE!
                                                      (Unless for graphics)



    VCV.Rao. Betatesting Group, C-DAC, Pune          Source : OPENGL, References       9
                                An Overview of
                             GPU Computing




VCV.Rao. Betatesting Group, C-DAC, Pune          10
                  GPU Computing : Think in Parallel

Some Design Goals

vScale to 100’s of cores, 1000’s of parallel
                                                          0    1   2    3   4   5   6   7

vthreads
                                                          ……
vLet programmers focus on parallel                        float x = input[threadID];
                                                          float y = func(x);
 algorithms & Re-writing the Code                         output[threadID] = y;
                                                          …

   • Not on the mechanics of a parallel
     programming language
vEnable heterogeneous systems (i.e. CPU
 + GPU)
   • CPU and GPU are separate devices
     with separate DRAMs

    VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, References               11
                       GPU Computing : Think in Parallel


vPerformance = parallel hardware
                               +
                          scalable parallel program

vGPU Computing drives new applications
   • Reducing “Time to Discovery”                        Application

   • 100 x Speedup changes science &
     research methods                                        CPU                GPU
vNew applications drive the future of GPUs
   • Drives new GPU capabilities
   • Drives hunger for more performance

    VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, AMD,References     12
                GPU Computing : Think in Parallel

v Speedups of 8 x to 30x are quite common
  for certain class of applications
v The GPU is a data-parallel processor
   • Thousands of parallel threads                              Application
   • Thousands of data elements to process
   • All data processed by the same program
                                                                    CPU         GPU
        Ø SPMD computation model
   • Contrast with task parallelism and ILP
v Best results when you “Think Data Parallel”
   • Design your algorithm for data-parallelism
   • Understand parallel algorithmic complexity and efficiency
   • Use data-parallel algorithmic primitives as building blocks
   VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, AMD, References         13
                       GPU Computing : Think in Parallel


Why Are GPUs So Fast?

 vOptimized for structured parallel execution
    • Extensive ALU counts & Memory Bandwidth
    • Cooperative multi-threading hides latency
 vShared Instructions Resources
 vFixed function units for parallel workloads dispatch
 vExtensive exploitations of Locality

•Performance /(Cost/Watt); Power for Core
•Structured Parallelism enables more flops less watts

    VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, AMD, References   14
                      GPU Computing : Think in Parallel


GPU Computing : Optimise Algorithms for the GPU

vMaximize independent parallelism
vMaximize arithmetic intensity (math/bandwidth)
vSometimes it’s better to recompute than to cache
   • GPU spends its translators on ALUs, not memory
vDo more computation on the GPU to avoid costly data
 transfers
   • Even low parallelism computations can sometimes
     be faster than transferring back and forth to host


   VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, AMD, References   15
                     GPU Computing : Think in Parallel



 GPU Computing : Use Parallelism Efficiently

vPartition your computation to keep the GPU
 multiprocessors equally busy
   • Many threads, many thread blocks
vKeep resource usage low enough to support
 multiple active thread blocks per multiprocessor
   • Registers, shared memory




  VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA,AMD, References   16
                   GPU Computing : Think in Parallel

GPU Computing : Take Advantage of Shared Memory
vHundreds of times faster than global
 memory
vThreads can cooperate via shared memory
vUse one/ a few threads to load/computer
 data shared by all threads                                         Application

vUse it to avoid non-coalesced access
   • Stage loads and stores in shared                                 CPU         GPU
     memory to re-order non-coalesceable
     addressing
   • Matrix transpose example later

    VCV.Rao. Betatesting Group, C-DAC, Pune   Source : References                   17
          GPU Programming : Two Main Challenges

GPU Challenges with regard to Scientific Computing

Challenge 1 : Programmability
v Example : Matrix Computations                            Application
  • To port an existing scientific
    application to a GPU
                                                                CPU              GPU
v GPU memory exists on the card itself
   • Must send matrix array over PCI-Express Bus
      − Send A, B, C to GPU over PCIe
      −Perform GPU-based computations on A,B, C
      − Read result C from GPU over PCIe
v The user must focus considerable effort on optimizing
  performance by manually orchestrating data movement
  and managing thread level parallelism on GPU.
    VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, AMD, References         18
         GPU Programming : Two Main Challenges


Challenge 2 : Accuracy
     v Example : Non-Scientific Computation - Video Games (Frames)
       (A single bit difference in a rendered pixel in a real-time graphics
       program may be discarded when generating subsequence
       frames)

     v Scientific Computing : Single bit error - Propagates overall error

     v Past History : Most GPUs support single precision, 32 bit
       floating point operation, - all GPUs have necessarily implemented
       the full IEEE Standard for Binary Floating-Point Arithmetic (IEEE
       754)




   VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, AMD, References   19
         GPU Programming : Two Main Challenges

GPU Challenges with regard to Scientific Computing
v Most Recent GPUs – Improves IEEE compliance and
  essential Double Precision support

v Computational Power offered by GPUs is Excellent

v Cluster of GPUs for certain class of Applications
   • Issues – Mixed Programming Environment (MPI,
     NVIDIA-CUDA Prog., AMD-Fire Stream-Brook + Prog.
     OpenCL Prog..)
   • Issues – Memory Issues, Reliability, Error Detection and
     Correction techniques
v Wide range of Applications mapping – performance
  benefits from GPU.
   VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, AMD, References   20
                       GPU Computing : Think in Parallel

v    Good strategies for extracting high performance from
     individual subsystems on the GPU
v    NVIDIA - CUDA / AMD Fire Stream Brook+ (Plenty of
     opportunities for further optimizations) & Intel Larrabee :




    Intel
    Intel                        AMD
                                 AMD                 NVIIDIA
                                                     NVIIDIA                            OpenCL




    VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, AMD, Intel, References            21
                     GPGPUs / GPU Computing
                        Software Products
                     NVIDIA – GPU Computing




VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, References   22
         NVIDIA – GPU computing Products - History

                Finance Applications
                - Pricing and risk
                - Higher accuracy, faster
                                                                   Oil & Industry /Seismic
                                                                   - Energy discovery
                                                                   - Broad adoption

          Defense
          - Signal analysis
          - Very high need for                                     Supercomputing
            computer resource                                      - World-class science
                                                                   - Top 500


                                            Universities
                                            - Desk supercomputing
                                            - Thousands of customers


Nvidia estimates that the total available market (TAM) for GPU computing is at
least half as large as the desktop-PC market for GPUs. The upside potential
looks greater. Whereas the PC market is maturing, GPU computing barely
existed four years ago and is growing fast.

    VCV.Rao. Betatesting Group, C-DAC, Pune          Source : NVIDIA, References             23
   NVIDIA - GPU Computing CUDA Kernels and Threads

Arrays of Parallel Threads
       v A CUDA kernel is executed by an array of threads
             • All threads run the same code
             • Each thread has an ID that it uses to compute
               memory addresses and make control decisions


            threadID            0   1   2    3   4   5   6    7



                                ……
                                float x = input[threadID];
                                float y = func(x);
                                output[threadID] = y;
                                ……




   VCV.Rao. Betatesting Group, C-DAC, Pune           Source : NVIDIA   24
         NVIDA :CUDA - Quick terminology review

vThread: concurrent code and associated state executed on
 the CUDA device (in parallel with other threads)
   • The unit of parallelism in CUDA
   • Note difference from CPU threads: creation cost, resource usage, and
     switching cost of GPU threads is much smaller

vWarp: a group of threads executed physically in parallel
 (SIMD)
vThread Block: a group of threads that are execute together
 and can share memory on a single multiprocessor
vGrid: a group of thread blocks that execute a single CUDA
 program logically in parallel
vDevice: GPU ; Host: CPU                     SM: Streaming Multiprocessor
   VCV.Rao. Betatesting Group, C-DAC, Pune     Source : NVIDIA              25
   NVIDIA GPU Computing - CUDA Kernels and Threads

vParallel portions of an application are executed on the device as
 kernels
    • One kernel is executed at a time
    • Many threads execute each kernel
vDifferences between CUDA and CPU threads
    • CUDA threads are extremely lightweight
        – Very little creation overhead
        – Instant switching
    • CUDA uses 1000’s of threads to achieve efficiency
        – Multi-core CPUs can use only a few
                                     Definitions
                          Device = GPU;        Host = CPU
                      Kernel = function that runs on the device
    VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA        26
     NVIDIA GPU Computing - CUDA Kernels and Threads

v NEW: GPU Computing with CUDA
                                                                   C          Application
    • CUDA = Compute Unified Device Architecture                   P
                                                                   U
    • Co-designed hardware & software for direct                         CUDA
                                                                        Libraries
      GPU computing
v Hardware: fully general data-parallel architecture
                                                                        CUDA Runtime
    • General thread launch; Global load-store
    • Parallel data cache
                                                                            CUDA Driver
v Software: program the GPU in C /C++
    • Scalable data-parallel execution/ memory                     G
                                                                   P
      model; Single/Double precision                               U

v Hundreds of times faster than global memory                    Compute Unified Device
                                                                 Architecture Software Stack
v Use one/ a few threads to load/computer data
  shared by all thread
     VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA                                 27
NVIDIA GPU Computing - CUDA Kernels and Threads


CUDA Software Development
CUDA 2.3 with Tesla C1060 is used for GPU Computations



 CUDA Optimized Libraries:                 Integrated CPU + GPU
 math.h, FFT, BLAS, …                          C Source Code


                          NVIDIA C Compiler


         NVIDIA Assembly
                                                  CPU Host Code
       for Computing (PTX)


     CUDA                                      Standard C Compiler
                         Profile
     Driver

                 GPU                                    CPU


 VCV.Rao. Betatesting Group, C-DAC, Pune         Source : NVIDIA     28
 NVIDIA GPU Computing - CUDA Kernels and Threads


Thread Cooperation
 vThe Missing Piece: threads may need to cooperate
 vThread co-operation is valuable
    • Share results to avoid redundant computation
    • Share memory accesses
          – Drastic bandwidth reduction
 vThread co-operation is a powerful feature of CUDA
 vCooperation between a monolithic array of threads is not
  scalable
    • Cooperation within smaller batches of threads is scalable


   VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA      29
NVIDIA GPU Computing - CUDA Kernels and Threads

Thread Batching
  vKernel launches a grid of thread blocks
       • Threads within a block cooperate via shared memory
       • Threads within a block can synchronize
       • Threads in different blocks cannot cooperate
  vAllows programs to transparently scale to different GPUs

   Grid
          Thread Block 0         Thread Block 1                 Thread Block n


                                                    ...
          Shared Memory         Shared Memory                   Shared Memory




 VCV.Rao. Betatesting Group, C-DAC, Pune          Source : NVIDIA                30
         NVIDIA --CUDA GPU Programming Toolkit
         NVIDIA CUDA GPU Programming Toolkit

Key Characteristics
v Decomposes work into a grid of thread blocks -
  commonly executed by a pool of SIMT multiprocessors

v Each Thread block normally contains 64-512 threads,
  which are executed by the processing units within a
  single multiprocessor.

v Each SIMT multiprocessor executes a group of threads,
  known as “WARPS” , in lock step.
v CUDA Kernels are written efficiently -handle barrier
  synchronization, ordering of operations among peer
  threads within a thread block

   VCV.Rao. Betatesting Group, C-DAC, Pune   Source - NVIDIA   31
          NVIDIA --CUDA GPU Programming Toolkit
          NVIDIA CUDA GPU Programming Toolkit

Key Characteristics

•   For high data locality that programmers satisfy - CUDA
    Kernels are often an excellent starting point - Similar to
    Cache based Multi-core processor.

•   Effective use of GPU global Memory is required

•   For Performance point of view, CPU with useful work to
    overlap with asynchronous GPU kernel

•   Intelligent division of computations - CPU & GPU almost
    equally occupied. (Optimal balance of workload between
    CPU & GPU)
•   Speed UP; Under-Load the CPU than overload
    VCV.Rao. Betatesting Group, C-DAC, Pune   Source - NVIDIA    32
         NVIDIA --CUDA GPU Programming Toolkit
         NVIDIA CUDA GPU Programming Toolkit

Key Characteristics
v The Expansion of kernel into a grid of thread-blocks is
   determined by the kernel launch parameters specified a
   runtime, and may be varied dynamically according to
   problem size or other attributes

v Handles - Data Dependency – Instructions

v Handles - Data alignment requirements for high
  performance global memory operations in CUDA

v The Virtualization of processing resources provided by the
  CUDA programming model allows applications written with
  existing GPUs to scale up with future hardware designs.
   VCV.Rao. Betatesting Group, C-DAC, Pune   Source - NVIDIA   33
        NVIDIA – GPU computing Products - History

November 2006


GEForce 8800-No. of Programmable shaders : 112


G80 : (16 SMs -Streaming Multiprocessors)
    128 CUDA Cores per chip (16 SMs X 8 CUDA Cores per SM)

     Share Common Resources such as local memory, register
     files, load/store units and thread schedulers
     With time Slicing and fast thread switching, a SM can                   run
     thousands of parallel threads on these cores

                                                 > 350 Gflops – Single Precision,
   VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, Reference             34
          NVIDIA – GPU Computing Products - History

NVIDIA GeForce 8800 GTX Block Diagram




Number of Stream Processors : 112                     Peak : > 300 GFlops
Memory : 512 MB DDR3;                          Memory Clock freq. : 900 MHz,
Memory Bandwidth : 57.6 GB/s                            Memory Interface 256-bit
Interface Type : PCI Express 2.0 x16                  -DIMM @ 800 MHz

     VCV.Rao. Betatesting Group, C-DAC, Pune    Source : NVIDIA, Reference         35
         NVIDIA – GPU Computing Products - History

NVIDIA G80 Block Diagram




Total of 128 stream processing units (CUDA Cores) distributed
across sixteen Stream MultiProcessors (SM) each with shared
memory, cache and registers

    VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, Reference   36
          NVIDIA – GPU Computing Products - History

June 2008                             Vital for Scientific and Engineering Programs

  Retained 8 Cores per SM but the No. of SMs is increased to 30.



 GT200 : (30 SMs -Streaming Multiprocessors)
    240 CUDA Cores per chip (30 SMs X 8 CUDA Cores per SM)

       Share Common Resources such as local memory, register
       files, load/store units and thread schedulers
       With time Slicing and fast thread switching, a SM can                          run
       thousands of parallel threads on these cores

                     GTX280, Quadro,                       Support Double Precision,
               HPC : Tesla T10, C1060
     VCV.Rao. Betatesting Group, C-DAC, Pune            Source : NVIDIA, References         37
                       NVIDIA – GPU Computing Products - History
NVIDIA GT200 GPU Block Diagram GT200 :
Tesla Architecture incorporated in Tesla C1060 & S1070 products.

                                            Streaming Processor Array

      TPC             TPC                   TPC    TPC   TPC    TPC      TPC     TPC    TPC      TPC



                                                                                                                   Constant Cache

                                                                                                                   64 KB, read-only
                                                               Streaming Multiprocessor
   Texture Processor
        Cluster                                                 Instruction L1              Data L1
                                                                                                                      FP64 Unit
                                                                     Instruction Fetch/Dispatch
                                                  SM                      Shared Memory
     Texture Unit




                                                                                                                      Special
                    1/2/3-D interpolation
                    8KB spatial cache,




                                                                     FP64 Unit (double precision)                   Function Unit
                                                                                                                    SIN, EXP,
                                                  SM            SP                     SP                           RSQRT, Etc..
                    Read-only,




                                                                SP                     SP
                                                                          SFU                    SFU                  Streaming
                                                  SM            SP                     SP                             Processor
                                                                SP                     SP                           ADD, SUB,
                                                                                                                    MAD, Etc…
              VCV.Rao. Betatesting Group, C-DAC, Pune                                            Source : NVIDIA, References          38
NVIDIA GT200 GPU Features and CUDA Prog Models


GT200 Architecture supports a high bandwidth
(140 GB/s) and handle global memory Latency effectively

Warp : The instruction unit in the stream multiprocessor
drives the right scalar streaming processors with a single
instruction stream that is 32 threads wide


SIMT: GT200 & its predecessors implemented a single-
instruction multiple-thread (SIMT) instruction unit


Support double precision floating point Arithmetic,
improved hardware for coalescing global memory accesses;
bi-directional overlapping of asynchronous I/O and GPU
kernel execution

  VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, References   39
NVIDIA GT200 GPU Features and CUDA Prog Models


  CUDA Prog. Model decomposes work into a grid of thread
  blocks that are concurrently executed by a pool of SIMT
  multiprocessors.

  Each thread block             normally contains 64-512 threads,
  which are executed             by the processing units within a single
  multiprocessor.


  Each SIMT multiprocessor executes a group of threads,
  known as “Warp”, in lock step.


  CUDA Kernels are written as a serial code sequence, with
  barrier synchronization to enforce ordering of operations
  among peer threads within a thread block.

   VCV.Rao. Betatesting Group, C-DAC, Pune        Source : NVIDIA, References   40
NVIDIA GT200 GPU Features and CUDA Prog Models

  The expansion of a kernel into a grid of thread blocks is
  determined by the kernel launch parameters specified at
  runtime, and may be varied dynamically according to the
  problem size or other attributes.


  The virtualization of processing resources provided by the
  CUDA programming model allows applications written with
  existing GPUS to scale up with future hardware designs.



  The key issue that must be considered when designing
  CUDA kernels involve the decomposition of work into tiles
  that can be mapped to thread blocks, further decomposed
  into warps and individuals threads.


  VCV.Rao. Betatesting Group, C-DAC, Pune   Source : NVIDIA, References   41
            NVIDIA – GPU Computing Products - History


Tesla C1060 :The NVIDIA® Tesla™ C1060 GPU ( PCI Express 16x Interface )

 • # of Tesla GPUs                         • 1                   Application
 • # of Streaming Processor                • 240
   Cores (CUDA Cores)
 • Frequency of processors Cores           • 1.3 GHz
 • Single Precision floating               • 936 Gflops
   pointing performance (peak)                                        CPU                  GPU
 • Double Precision floating
   pointing performance (peak)
 • Floating Point Precision                • IEEE 754 single & Double

 •   Total Dedicated Memory                •   4 GB
 •   Memory Inferface                      •   512-bit
 •   Memory Bandwidth                      •   102 GB/sec
 •   Max Power Consumption                 •   800 W
 •   System Interface                      •   PCIe x 16
 •   Software Development Tools            •   C-based CUDA Toolkit



       VCV.Rao. Betatesting Group, C-DAC, Pune               Source : NVIDIA, References         42
            NVIDIA – GPU Computing Products - History

NVIDIA – Tesla S1070
Tesla S1070 :The NVIDIA® Tesla™ S1070 Computing System (Dual PCI Express
2.0 cable connections) is a four-teraflop 1U system ( one-teraflop processor)


 • # of Tesla GPUs                          • 4                         Application
                                            • 960 (240 per processor)
 • # of Streaming Processor Cores
                                            • 1.296 to 1.44 GHz
 • Frequency of processors Cores
                                            • 3.73 to 4.14 Tflops
 • Single Precision floating
   pointing performance (peak)                                             CPU
 • Double Precision floating
                                            • 311 to 345 Gflops                       GPU
   pointing performance (peak)
                                            • IEEE 754 single & Double
 • Floating Point Precision
                                            •    16 GB
 •   Total Dedicated Memory
                                            •    512-bit
 •   Memory Inferface
                                            •    408 GB/sec
 •   Memory Bandwidth
                                            •    800 W
 •   Max Power Consumption
                                            •    PCIe x 16 or x8
 •   System Interface
                                            •    C-based CUDA Toolkit
 •   Software Development Tools


       VCV.Rao. Betatesting Group, C-DAC, Pune                      Source - NVIDIA         43
        Fermi - NVIDIA GPU computing Products -

October 2009                          Vital for Scientific and Engineering Programs


 32 CUDA Cores for Streaming multiprocessor (Four times as
 many as the GT200 and G80)


 Fermi : (Initially 16 SMs -Streaming Multiprocessors)
    512 CUDA Cores per chip (16 SMs X 32 CUDA Cores per SM)

    Boost throughput, but additional enhancements deliver
    even more performance
     CUDA programming model – Performance enhancement with
      Warp (32 cores are designed to work in parallel on 32
      instructions at a time from a bundle of 32 threads)


     VCV.Rao. Betatesting Group, C-DAC, Pune         Source : NVIDIA, References      44
           Fermi - NVIDIA GPU Computing Products

        Feature                   Nvidia Fermi           Nvidia GT200               Nvidia G80

      Introduction                    2009                     2008                    2006
  Single-Precision FP         512 fused multiply-      240 multiply-add           128 multiply-add
                               add ops per cycle        ops per cycle              ops per cycle
  Double Precision FP         256 fused multiply-     30 fused multiply-                 -
                               add ops per cycle      add ops per cycle
     IEEE 754-2008              Full, CP and DP              DP only                     -
      Compliance
      Streaming                         16                       30                     16
    Multiprocessors
  CUDA Cores Per SM                     32                       8                       8
Warp Schedulers per SM                   2                       1                       1
Special Functions Units                  4                       2                       2
        per SM
Shared Memory per SM             48KB or 16KB                  16KB                    16KB
                                 (configurable)
L1 instruction Cache per               Yes                        -                      -
           SM
      VCV.Rao. Betatesting Group, C-DAC, Pune       Source : NVIDIA, References                      45
               Fermi - NVIDIA GPU Computing Products

L1 Data Cache per SM*             16KB or 48KB               -                           -
                                  (configurable)
Total CUDA Processor Cores        512                        240                         128
L2 Cache                          768Kb, shared              -                           -
Error-Correction Codes            DRAM, shared               -                           -
                                  memories, L2 cache,
                                  registers
Concurrent Kernels                Up to 16                   -                           -
Fully Predicated ISA              Yes                        -                           -
Memory Addressing                 40 bits (1TB), unified     32 bits (4GB), divided      32 bits (4GB),
                                                                                         divided
Memory I/O Interfaces             6 x 64 bits, 6GB DRAM      8 x 64 bits, 1GB            6 x 64 bits, 768MB
                                                             GDDR3 DRAM                  GDDR3
C++ Programmable                  Yes                        -                           -
Transistors                       3.0 billion                1.4 billion                 681 billion


Nvidia’s three CUDA-capable GPU architectures : The G80, GT200 and Fermi Architectures
*The   shared memory and L1 data cache in each streaming multiprocessor is 64KB of SRAM.



          VCV.Rao. Betatesting Group, C-DAC, Pune          Source : NVIDIA, References                        46
                     GPGPUs / GPU Computing
                        Software Products
                         AMD - GPGPUs




VCV.Rao. Betatesting Group, C-DAC, Pune   Source : AMD, References   47
 AMD – ATI GPGPUs ::Processing Efficiency
 AMD – ATI GPGPUs Processing Efficiency




VCV.Rao. Betatesting Group, C-DAC, Pune   Source : AMD, References   48
     AMD Stream SDK Software Development Kit
     AMD Stream SDK Software Development Kit
                                      Applications

           Compilers                 Libraries                3rd Party Tools

                Brook+                                                                   Graphics API
                                  OpenCL          ACM             Rapidmind
           & Other App Specific                                                            DirectX0
                                                                                           OpenGL




       AMD Runtime                           Compute Abstraction Layer (CAL)



                                                                                  AMD
                         Multicore-Core                                         Stream
                          AMD CPUs                                          Processors




VCV.Rao. Betatesting Group, C-DAC, Pune          Source : AMD, References                               49
                  AMD Stream Processing Strategy
                  AMD Stream Processing Strategy

AMD Stream Processing Strategy                 Single Programming Environment




     VCV.Rao. Betatesting Group, C-DAC, Pune     Source : AMD, References       50
                AMD GPGPUs :: Brook+ Stream
                AMD GPGPUs Brook+ Stream
                   Programming Language
                    Programming Language

v Stream Programming Model : Extension to the C-language for stream
  programming originally developed by Stanford University
   • Enforce Data Parallel computing
   • Encourage Arithmetic Intensity
   • Provide fundamental ops for stream computing


v Brook+ is an implementation by AMD of the Brook GPU Spec on AMD’s
  compute abstraction layer (CAL) with some enhancements

v Asynchronous CPU -> GPU transfers (CPU->GPU still synchronous

v Linux & Windows Environment


                                               Source : AMD, References
     VCV.Rao. Betatesting Group, C-DAC, Pune                              51
                              AMD – FireStreamTM 9250
                              AMD – FireStreamTM 9250
v AMD’s Second Generation Stream Computing Product
v Single PCI Slot
v Computational Power
   • One T-FLOPS Single Precision Float
   • 200 GFLOPS Double Precision
v 1 GB GDDR3
v 150 Watts → 8 GFLOPS/Watt
v Familiar 32 and 64 bit Linux® and Windows® Environment
v Stream software supports multiple GPUs per system
v Brook+ (Open Source C-level language & Compiler)
   • GPU Shader Analyzer
   • AMD Code Analyst
v AMD’s Compute Abstraction Layer (CAL)
     VCV.Rao. Betatesting Group, C-DAC, Pune   Source : AMD, References   52
            OpenCL- Processor Parallelism
            OpenCL- Processor Parallelism

                CPUs                                                  GPUs
         Multiple cores driving                           Increasingly general purpose
         performance increases                               data-parallel computing
                                                         improving numerical precision
                                           Emerging
                                          Intersection




                                     OpenCL
                                      Heterogeneous
                                        Computing                       Graphics APIs
           Multi-processor                                               and Shading
         programming – e.g.                                               Languages
              OpenMP




                       OpenCL – Open Computing Language
                Open, royalty-free standard for portable, parallel programming of
               heterogonous parallel computing CPUs, GPUs, and other processor



VCV.Rao. Betatesting Group, C-DAC, Pune                  Source : Khronous, References [43], [45]   53
                    OpenCL Working Group
                    OpenCL Working Group




VCV.Rao. Betatesting Group, C-DAC, Pune   Source : Khronous, References [43], [45]   54
                               Applications -
                             Performance Issues




VCV.Rao. Betatesting Group, C-DAC, Pune           55
            Application -1 ::
             Application -1
Quantum Monte Carlo Simulation using GPUs
Quantum Monte Carlo Simulation using GPUs

Dynamic Cluster Approximation (DCA)
v Condensed Matter Physics - The Study of
  high temperature
v One Approach – Formulate the problem on
  a regular lattice and performance
  simulations on a lattice.
     • Matrix Multiplication totally dominates in
       each step - in Green’s function

v 90 % of the total runtime were spent within
  the QMC update step. On GPU, efficient
  Matrix Computations are required.

   VCV.Rao. Betatesting Group, C-DAC, Pune   Source : References   56
             Application -1 ::
              Application -1
 Quantum Monte Carlo Simulation using GPUs
 Quantum Monte Carlo Simulation using GPUs
Dynamic Cluster Approximation (DCA)
v Updates (Single/Large) on each Quantum Monte-Carlo (QMC) Step
    • Approximations are required – to handle complexity
    • Large Lattice /Small Lattices
    • Large Lattice approximation – Cluster
    HTSC: 1023                                      Dynamic cluster
                                 2D Hubbard model   approximation: Map
    interacting electrons         for CuO planes    Hubbard model onto
                                                    embedded cluster




v Statistical formulation /Green’s function is used in comp.
v Computation Drivers on GPUs: Vector Product; Dense
  Complex Matrix Multiplication; Eigven Value Computation
    VCV.Rao. Betatesting Group, C-DAC, Pune               Source : References   57
        Application -2 :: Multilevel Simulation of
        Application -2 Multilevel Simulation of
         Electrostatic Potentials using GPUs
         Electrostatic Potentials using GPUs

Molecular Dynamics Simulation
v Bimolecular Systems




                                              Lang-Range Part
                                                                       Lattice cutoff (GPU)

  Containing over 1 Million                                          Lattice cutoff (GPU)
  atoms to a 100 - Million
  atom Molecular Dynamics                                           Lattice cutoff (GPU)
  Simulation is required.
                                                                Short-Range part (GPU + CPU)
v Compute Electrostatic
  Potential based on Lattice
  point position, Atomic
  Coordinates, Charge
v Long Range & Short
  Range on the Lattices
v EX : - NAMD
    VCV.Rao. Betatesting Group, C-DAC, Pune                               Source : References   58
            Application -2 :: Multilevel Simulation of
            Application -2 Multilevel Simulation of
             Electrostatic Potentials using GPUs
             Electrostatic Potentials using GPUs
v Calculate an approximation to O(N2) pairwise interactions with (O(n)
  computational work
v GPU- Parallelizing the short-range part; Parallelizing the lattice cutoff part
v Many to Many Communications – GPU : 3 D FFT computations
v Multi-Level : Operators - Interpolation, Restriction; Prolongation; Lattice
  Cut-off, Top-level


  •     •      •     •                                                 Radius Measured in lattice points
  •     •      •     •                                                 in the same for each level.

  •     •      •     •
  •     •      •     •
  Sub-Cube of Lattice                                                   Lattice point potential is the sum
  potentials assigning to           Schematic of the Lattice cutoff     of enclosed distance –weighted
  thread block                                                          Lattice Changes
                                    calculations

v Efficient Data Structures for GPU computation – Threads Cooperation

      VCV.Rao. Betatesting Group, C-DAC, Pune                         Source : References                    59
           Application -2 :: Multilevel Simulation of
           Application -2 Multilevel Simulation of
            Electrostatic Potentials using GPUs
            Electrostatic Potentials using GPUs
  v GPU – Arrangement of Data in Matrix Blocks & mapping to Thread blocks

                                                          .    .    .    .
 Schematic of the optimal                                                        Sub-matrix Is
 reading of Matrix-Data                                   .    .    .    .       assigned to thread
 from GPU constant                                                               block
                                                          .    .    .    .
 memory
                                                          .    .    .    .
                    .   .    .    .    .   .    .    .
                    .   .    .    .    .   .    .    .
                    .   .    .    .    .   .    .    .
                    .   .    .    .    .   .    .    .
                    .   .    .    .    .   .    .    .
                    .   .    .    .    .   .    .    .
Sliding Block
Movement            .   .    .    .    .   .    .    .
                    .   .    .    .    .   .    .    .
                     Sub-blocks matrix data stored in GPU shared memory
         VCV.Rao. Betatesting Group, C-DAC, Pune                       Source : References            60
             Application -3 :: Molecular Dynamics
             Application -3 Molecular Dynamics
•   GPU implementation Issues - Mesh Partition / Distribution – Complex ;
    Short Range & Long Range; Efficient Data Structures
•   GPU may Require Sparse Matrix Computations; FFT Computations

Molecular dynamics schematic
                                                                  Employing a
                                                                  finite
                                        rcut                      cutoff reduces
            bend                         stretch                  problem
                                                                  from O(N2) to
                                                                  O(N)
               Non-bonded
               interactions


      VCV.Rao. Betatesting Group, C-DAC, Pune      Source : References             61
               Application -3 :: Molecular Dynamics
               Application -3 Molecular Dynamics
GPU implementation Issues –
•   A schematic 2-D representation of the link-cell neighbour search pattern used in the domain
    decomposition.
•   The thick black lines mark the boundaries between regions which different processors are
    responsible for. In this example all the 16 link-cells that are resident on are searched using the
    pattern shown. In addition, all the particles in the received cell 16 are interacted with those in 22.


          22          23          24         25         26
                                                                           Domain Boundaries

                                                                    Link-cells receiving
          16          17          18         19         20
                                                                    coordinate data

                                                                   Link-cells interacting with:
          11          12          13         14         15
                                                                       2 neighbours

          6           7           8         9          10              4 neighbours


                                                                       5 neighbours
          1           2           3          4          5
                                                                       6 neighbours

        VCV.Rao. Betatesting Group, C-DAC, Pune                    Source : References                   62
             Application -3 :: Molecular Dynamics
             Application -3 Molecular Dynamics
Force decomposition scheme
GPU implementation Issues - A schematic 2-D representation of Force Decomposition; Data
Structures and block re-arrangement of neighbour information, Movement of Data from Host
to GPU
           Atom
           Lists        1       2       3            4 5       6           7       8    9    10 11 12 13 14 15 16

                                    1                                  2                                 3                          4
             1                                                 1                             1                                  1
                                                3                                                            11                         15
             2                                                 2               7             2                                  2

             3      1                                          3                             3                                  3
                            2                    4                             8                             12                         16
                                                                                             4
             4                                                 4                                                                4

                                                                                                                                             Indicates transpose
             5              1
                                                                                                 5                              5
                            2
                                            5              5                   7                             11                         15   processors
             6      2                                                                            6                              6
                            3                                                                                                   7
                                                                                                 7
             7              4
                                            6              6                   8
                                                                                                             12                         16
                                                                                                 8                              8
             8
             9              1                                  5
                                                                                            9                 11             9
                                                                               9
            10              2
                                            9                  6                                                            10
                                                                                                                                        15

                    3
            11              3                                  7
                                                                               10           10                12
                                                                                                                            11
                                            10                                                                                          16
            12              4                                  8                                                            12



            13              1
                                            13
                                                                   5                                 9
                                                                                                                  13       13           15
                            2                                                      13            10
                                                                   6
            14      4       3                                      7                             11
            15                              14                                     14                             14       14           16
                            4                                      8                             12
            16

      VCV.Rao. Betatesting Group, C-DAC, Pune                                                                          Source : References                         63
   Application -4 :: Parallel Finite Element
   Application -4 Parallel Finite Element
Computations (Unstructured Grids) using GPUs
Computations (Unstructured Grids) using GPUs
 GPU implementation Issues – Typical Finite Element Computations – Mesh Partitioning
 using Metis; Hybrid Computing - MPI & GPU Implementation;




Mesh Partition/Distribution - Complex
• Elements (evenly distributed) / Processor-
  level nodes / Global nodes (aligned with
  processor-level nodes as much as possible)
• Mesh Partition/Distribution – Complex
• Sparse Matrix Computations
                                                         Source : References
     VCV.Rao. Betatesting Group, C-DAC, Pune                                           64
   Application -4 :: Parallel Finite Element
   Application -4 Parallel Finite Element
Computations (Unstructured Grids) using GPUs
Computations (Unstructured Grids) using GPUs
 GPU implementation of Sparse Matrix Computations in an iterative solver

                                                     Thread Block 1           Thread Block 1
Parallel Matrix Solution Techniques


 Iterative /Direct Methods                          Shared Memory             Shared Memory



                                                        Domain 3              Domain 4
Ordering of Sparse linear system
Symbolic Factorization, Numeric
Factorization. Solving a triangular
system. Sparse Matrix -Vector
Multiplications; Matrix Preconditioning


 Graph partitioning algorithms
                                                        Domain 1              Domain 2

      VCV.Rao. Betatesting Group, C-DAC, Pune           Source : References                    65
       Application -4 :: Parallel Finite Element
       Application -4 Parallel Finite Element
    Computations (Unstructured Grids) using GPUs
    Computations (Unstructured Grids) using GPUs
    Sparse Matrix Computations : Compressed sparse row (CSR) storage,
    and a basic CSR-based implementation for GPU Programming
A                                           val                                         k




                                            ind                                          k



                                                                              Thread Block 1
                                            ptr   m+1


// Basic implementation,
// y <- y + A*x, where A is in CSR
                                                                          Shared Memory
  for (i = 0; i < m; ++i) {
     double y0 = y[i];
     for {k = ptr[i]; k < ptr[i+i]; ++k)
       y0 += val[k] * x[ind[k]];
     y[i] = y0;
    }
                                                        Source : References
        VCV.Rao. Betatesting Group, C-DAC, Pune                                              66
      Application -4 :: Parallel Finite Element
      Application -4 Parallel Finite Element
   Computations (Unstructured Grids) using GPUs
   Computations (Unstructured Grids) using GPUs

 Schematic of the optimal                               .    .    .    .
 reading of Sparse Matrix-                                                    Sub-matrix Is
 Data from GPU constant                                 .    .    .    .      assigned to thread
 memory                                                                       block
                                                        .    .    .    .
                                                        .    .    .    .
                   .     .    .   .    .   .   .    .
                   .     .    .   .    .   .   .    .
                   .     .    .   .    .   .   .    .
                   .     .    .   .    .   .   .    .
                   .     .    .   .    .   .   .    .            Sparse Matrix Data is re-
Sliding Block      .     .    .   .    .   .   .    .            arranged in Blocks for ease of
Movement                                                         computation on GPUs as well
                   .     .    .   .    .   .   .    .
                                                                 as Data Movement
                   .     .    .   .    .   .   .    .
                       Sub-blocks matrix data stored in GPU shared memory
         VCV.Rao. Betatesting Group, C-DAC, Pune                            Source : References    67
   Application -5 :: Parallel Finite Difference
    Application -5 Parallel Finite Difference
  Computations (structured Grids) using GPUs
  Computations (structured Grids) using GPUs
Case Study- Poisson Equation Solver

∂U − ∂2U ∂2U           Ω ⊆ℜ2; t∈[to, tf]
        +    = f(x,y);
∂t   ∂x2 ∂y2
                                                         U(x,y,t0) = g on ∂Ω
                                              Rank = 2


                                              Rank = 1



                                              Rank = 0

    VCV.Rao. Betatesting Group, C-DAC, Pune                 Source : References   68
   Application -5 :: Parallel Finite Difference
    Application -5 Parallel Finite Difference
  Computations (structured Grids) using GPUs
  Computations (structured Grids) using GPUs
Case Study- Poisson Equation Solver
    v Initially, four arrays are required for computation,

         •   To Store old values of at each grid point, U i.e Uold,
         •   To Store new values of U i.e Unew and
         •   One each for the storing difference between Uold & Unew and
             storing the index values of the interior points.

    v Memory for these arrays is allocated on the host and they are
      initialized.

         •   Four arrays are allocated on the GPU (device).
         •   The values of the arrays in the host machine are copied onto
             the arrays allocated on the device.

         •   The computed solution on GPU is achieved in three steps.
    VCV.Rao. Betatesting Group, C-DAC, Pune                                 69
   Application -5 :: Parallel Finite Difference
    Application -5 Parallel Finite Difference
  Computations (structured Grids) using GPUs
  Computations (structured Grids) using GPUs
Case Study- Poisson Equation Solver : GPU Implementation
       v Step 1:We begin the computation with an initial solution Uold vector
         and apply boundary conditions on both Uold and Unew vector by setting
         the boundary values at corresponding boundary points.

       v Step 2: We compute the Unew solution vector at each grid point

                        Ui,j = ¼ (Ui-1,j + Ui, j+1+ Ui, j-1 + Ui+1, j)


       v Step 3: We compute the difference between Unew and Uold solution
         vector variables and the convergence criteria is checked to stop the
         computations.
          • If the convergence criteria are not satisfied, then we assign Unew
             solution vector to Uold solution and the computations are repeated
             until the convergence criteria are satisfied.
    VCV.Rao. Betatesting Group, C-DAC, Pune                                   70
  Application -5 :: Parallel Finite Difference
   Application -5 Parallel Finite Difference
 Computations (structured Grids) using GPUs
 Computations (structured Grids) using GPUs
Case Study- Poisson Equation Solver : GPU Implementation
   CUDA APIs used is given below.
  - Allocates memory on device
    cudaMalloc(void** array, int size)
  - Frees memory allocated on device
    cudaFree(void* array)
  - Copies from host to device
    cudaMemcpy((void*)device_array, (void*)host_array,
    sizecudaMemcpyHostToDevice)
  - Copies from device to host
    cudaMemcpy((void*)host_array, (void*)device_array, size,
    cudaMemcpyDeviceToHost)
  - For multiplication of vector x, by scalar alpha and adds results to vector y and it stores result
    to vector Y, the function
  void cublasSaxpy (int n, float alpha, const float *x, int incx, float *y, int incy)
  - To find the smallest index of the maximum magnitude element of vector x
  int cublasIsamax(int n, const float *x, int incx)
  -     To copy the the vector x to the vector
    void cublasScopy (int n, const float *x, int incx, float *y, int incy)

     VCV.Rao. Betatesting Group, C-DAC, Pune                                         Source - NVIDIA    71
      Application -5 :: Parallel Finite Difference
       Application -5 Parallel Finite Difference
     Computations (structured Grids) using GPUs
     Computations (structured Grids) using GPUs
    Case Study- Poisson Equation Solver : GPU Implementation
v   Use Global 1D Array Array                     Performance results for Poisson Equation based on
                                                   finite difference method on NVIDIA Tesla C1060 *
v   Use block of Grid (Partition)
v   Data Transfer from Host to GPU                                     Computation Time (Seconds)
                                                                       (Penta Diagonal Formulation)
v   Jacobi Method Employed
                                              Grid Size (n)
                                              No. of Iterations      Timings on
                                                                     CPU (Seconds)      Timings on
                                                                     -O3 used           GPU (sec)


                                                   1000 (1000)            01.39             0.12

                                                   8100 (19900)             -              12.59

                                                  10000 (21000)             -              18.67

                                                  14400 (29000)             -              35.97


     GPU : Efficient Matrix (Grid) Decomposition & Matrix Mapping to avoid
     frequent data transfer from Host to GPU & Vice versa is not done
        VCV.Rao. Betatesting Group, C-DAC, Pune                                                       72
    Case Study :: Dense Matrix Computations
    Case Study Dense Matrix Computations
               using GPU & CPU
                using GPU & CPU
Matrix Computations - CUDA API – Naive implementation
Multiple Iterations - Multiple Passes as per Application
v The CUDA API does have support for BLAS Calls - CUBLAS
   (Overheads incurred about using GPU can be estimated.)
     •   Require allocation of GPU based memory for matrix inputs
     •   Transfer of the matrices to the GPU.
     •   Transfer of the result back to Host GPU
     •   Deallocation of the GPU local memory

v The data transfer overheads (Initially retain the overheads)
v Amortize the matrix allocation /deallocation overheads
    (Allocate space in GPU-local memory only once and when each BLAS
    call occurs, re-use the same local memory to receive data transfers.)
v   Transfer all input matrices to and all result matrices from the GPU to
    each BLAS Call (Initial implementation)
    VCV.Rao. Betatesting Group, C-DAC, Pune     Source - NVIDIA          73
   Case Study :: Dense Matrix Computations
   Case Study Dense Matrix Computations
              using GPU & CPU
               using GPU & CPU
• Matrix – Matrix computations – Matrix Data
      • GEForce; Tesla C1070,                                      gcc compiler, nvcc Compiler
                                             40
                                                  CPU
                    Performance (Speed-UP)        GPU

                                             30



                                             20



                                             10


                                              1
                                                  !K       2K     4K   8K
                                                       Matrix size

• Optimization of Matrix – Matrix Multiplication algorithms and its
  performance on Single Core /Multi-Core processor is under progress.

    VCV.Rao. Betatesting Group, C-DAC, Pune                                                      74
References
1.    Randi J. Rost, OpenGL – shading Language, Second Edition, Addison Wesley 2006
2.    GPGPU Reference       http://www.gpgpu.org
3.    NVIDIA http://www.nvidia.com
4.    NVIDIA tesla     http://www.nvidia.com/object/tesla_computing_solutions.html
5.    NVIDIA CUDA Reference           http://www.nvidia.com/object/cuda_home.html
6.    CUDA sample source code: http://www.nvidia.com/object/cuda_get_samples.html
7.    List of NVIDIA GPUs compatible with CUDA: The href://www.nvidia.com/object/cuda_learn_products.html
8.    Download the CUDA SDK: www.nvidia.com/object/cuda_get.html
9.    Specifications of nVIDIA GeForce 8800 GPUs:
10.   RAPIDMIND http://www.rapidmind.net
11.   Peak Stream - Parallel Processing (Acquired by Google in 2007) http:/www.google.com
12.   guru3d.com http://www.guru3d.com/news/sandra-2009-gets-gpgpu-support/
      ATI & AMD http://ati.amd.com/products/radeon9600/radeon9600pro/index.html
13.   AMD http:www.amd.com
14.   AMD Stream Processors http://ati.amd.com/products/streamprocessor/specs.html
15.   RAPIDMIND & AMD http://www.rapidmind.net/News-Aug4-08-SIGGRAPH.php
16.   Merrimac - Stream Architecture Standford Brook for GPUs
      http://www-graphics.stanford.edu/projects/brookgpu/
17.   Standford : Merrimac - Stream Architecture http://merrimac.stanford.edu/
18.   ATI RADEON - AMD http://www.canadacomputers.com/amd/radeon/
19.   ATI & AMD - Technology Products http://ati.amd.com/products/index.html
20.   Sparse Matrix Solvers on the GPU ; conjugate Gradients and Multigrid by Jeff Bolts, Ian Farmer, Eitan
      Grinspum, Peter Schroder , Caltech Report (2003); Supported in part by NSF, nVIDIA, etc....
21.   Scan Primitives for GPU Computing by Shubhabrata Sengupta, Mark Harris*, Yao Zhang and John D
      Owens University of California Davis & *nVIDIA Corporation Graphic Hardware (2007).
22.   Horm D; Stream reduction operations for GPGPU applciations in GPU Genes 2 Phar M., (Ed.) Addison
      Weseley, March 2005; Chapter 36, pp. 573-589 Graphic Hardware (2007).
23.   Bollz J., Farmer I., Grinspun F., Schroder F : Sparse Matris Solvers on the GPU ; Conjugate Gradients
      and multigrid ACM Transactions on Graphics (Proceedings of ACM SIGRAPH 2003) 22, 2 (Jul y2003) pp
      917-924 Graphic Hardware (2007).
24.   NVIDIA CUDA Compute Unified Device Architecture - Programming Guide - Version 1.1 November 2007


        VCV.Rao. Betatesting Group, C-DAC, Pune                                                               75
References

25.   Tom R. Halfhill, Number crunching with GPUs PeakStream Math API Exploits Parallelism in Graphics
      Processors, Ocotober 2006; Microprocessor http://www.mdronline.com
26.   Tom R. Halfhill, Parallel Processing with CUDA Nvidia's High-Performance Computing Platform Uses
      Massive Multithreading ; Microprocessors, Volume 22, Archive 1, January 2008
      http://www.mdronline.com
27.   J. Tolke, M.Krafczyk Towards Three-dimensional teraflop CFD Computing on a desktop PC using
      graphics hardware Institute for Computational Modeling in Civil Engineering, TU Braunschweig (2008)
28.   I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Hoston, P.Hanrahan, Brook for GPUs ;
      Stream Computing on GRaphics Hadrware, ACM Tran. GRaph (SIGGRAPH) 2008
29.   Z. Fan, F. Qin, A.E. Kaufamm, S. Yoakum-Stover, GPU cluster for Hgh Performance Computing in :
      Proceedings of ACM/IEEE Superocmputing Conference 2004 pp. 47-59.
30.   J. Kriiger, R. Wetermann, Linear Algeria operators for GPU implementation of Numerical Algorithms
      ACm Tran, Graph (SIGGRAPH) 22 (3) pp. 908-916. (2003)
31.   Tutorial SC 2007 SC05 : High Performance Computing with CUDA
32.   FASTRA http://www.fastra.ua.ac.bc/en/faq.html
33.   AMD Stream Computing software Stack ; http://www.amd.com
34.   BrookGPU : http://graphics standafrod.edu/projects/brookgpu/index.html
35.   FFT – Fast Fourier Transform www.fftw.org
36.   BLAS – Basic Linear Algebra Suborutines – www.netlibr.org/blas
37.   LAPACK : Linear Algebra Package – www.netlib.org/lapack
38.   Dr. Larry Seller, Senipr Principal Engineer; Larrabee : A Many-core Intel Architecture for Visual
      computing, Intel Deverloper FORUM 2008
39.   Tom R Halfhill, Intel’s Larrabee Redefines GPUs – Fully Programmable Many core Processor Reaches
      Beyond Graphics, Microprocessor Report September 29, 2008
40.   Tom R Halfhill AMD’s Stream Becomes a River – Parallel Processing Platform for ATI GPUs Reaches
      More Systems, Microprocessor Report December 2008
41.   AMD’s ATI Stream Platform http://www.amd.com/stream
42.   General-purpose computing on graphics processing units (GPGPU)
      http://en.wikipedia.org/wiki/GPGPU
43.   Khronous Group, OpenGL 3, December 2008 URL : http://www.khronos.org/opencl

      VCV.Rao. Betatesting Group, C-DAC, Pune                                                               76
References

 44.   Mary Fetcher and Vivek Sarkar, Introduction to GPGPUs – Seminar on Heterogeneous Processors,
       Dept. of computer Science, Rice University, October 2007
 45.   OpenCL - The open standard for parallel programming of heterogeneous systems URL :
       http://www.khronos.org/opencl
 46.   Tom R. Halfhill, Parallel Processing with CUDA Nvidia's High-Performance Computing Platform Uses
       Massive Multithreading ; Microprocessors, Volume 22, Archive 1, January 2008 http://www.mdronline.com
 47.   Matt Pharr (Author), Randima Fernando, GPU Gems 2: Programming Techniques for High-Performance
       Graphics and General-Purpose Computation ,Addison Wesley , August 2007
 48.   NVIDIA GPU Programming Guide http://www.nvidia.com
 49.   Perry H. Wang1, Jamison D. Collins1, Gautham N. Chinya1, Hong Jiang2, Xinmin Tian3 , EXOCHI: Architecture and
       Programming Environment for A Heterogeneous Multi-core Multithreaded System, PLDI’07
 50.   Karl E. Hillesland, Anselmo Lastra GPU Floating-Point Paranoia, University of North Carolina at Chapel Hill
 51.   KARPINSKI, R. 1985. Paranoia: A floating-point benchmark. Byte Magazine 10, 2 (Feb.), 223–235.
 52.   GPGPU Web site : http://www.ggpu.org
 53.   Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce - 6800 GPU, Ajit Datar, Apurva
       Padhye Computer Architecture
 54.   Nvidia 6800 chapter from GPU Gems 2
       http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch30.pdf
 55.   OpenGL design http://graphics.stanford.edu/courses/cs448a-01-fall/design_opengl.pdf
 56.   OpenGL programming guide (ISBN: 0201604582)
 57.   Real time graphics architectures lecture notes http://graphics.stanford.edu/courses/cs448a-01-fall/
 58.   GeForce 256 overview http://www.nvnews.net/reviews/geforce_256/gpu_overviews.html
 59.   GPU Programming “Languages http://www.cis.upenn.edu/~suvenkat/700/
 60.   Programming the GPU and a brief intro to the OPENGL shading language – Marcel Cohan & VVR Talk
 61.   Johan Seland, GPU Programming and Computing, Workshop on High-Performance and Parallel
       Computing Simula Research Laboratory October 24, 2007
 62.   Daniel Weiskopf, Basics of GPU-Based Programming, Institute of Visualization and Interactive
       Systems, Interactive Visualization of Volumetric Data on Consumer PC Hardware: Basics of Hardware-Based
       Programming University of Stuttgart, VIS 2003



       VCV.Rao. Betatesting Group, C-DAC, Pune                                                                          77
  References
63.   J.S Meredith, G. Alvare, T.A. Maier, T.C.Schulthess and J.S Vetter; Accuracy and performance of
      graphic processors : A quantum Monte Carlo application Case Study, Parallel Computing Systems &
      Applications ; volume 35, Issue no. 3, March 2009, pp 138-150
64.   D.J. Hardy, J.E Stone and K Schulten; Multi-level Summation of electronic potentials using graphics
      processing units; Volume 35 Issue No. 3, March 2009, Page 151-163
65.   S.Williams,L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and Demmel, Optimization of sparse matrix-vector
      multiplication on emerging multi-core platforms, Volume 35 Issue No. 3, March 2009, Page 178-194
66.    Nathan Bell, Michael Garland, Implementing Sparse Matrix-Vector Multiplication on Throughput-
      Oriented Processors NVIDIA Research,
67.   N.Bell and M. Garland CUSP : Generic Parallel algorithms for sparse matrix and graph computations
      http://code.google.com/p/cusp-library/
68.   Looking Beyond Graphics – NVIDIA’s New GPU Architecture Energizes High-Performance Computing
      Tom R Halfhill, October 5, 2009, Microprocessor Report
69.   NVIDIA – Fermi - http://www.nvidia.com/object/fermi_architecture.html
70.   NVIDIA –Fermi– http:/www.nvidia.com/fermi
71.   Fermi is the world's first complete GPU computing architecture.“ Peter Glaskowsky Technology
      Analyst, Envisioneering Group
72.   C-DAC OPECG-2009 workshop proceedings (NVIDIA & AMD Presentation)
      http://www.cdac.in/opecg2009/
74.   C-DAC PEEP-2008 workshop proceedings NVIDIA & AMD Presentation)
       http://www.cdac.in/html/events/beta-test/PEEP-2008-web-page/peep2008-index.html
75.    NVIDIA’s Fermi: The first Complete GPU Computing Architecture; A white paper by Peter
      N.Glaskowsky (Prepared under Contract with NVIDIA Corporation)
76.    http://www.in.stat.com
77.   Metis – Graph Partitioning Software – Prof. Vipin Kumar, Prof. George Karypis, Department of
      Computer Science, University of Minnesota, Minneapolis.
78.   In-Stat White Paper, Looking Beyond Graphics, NVIDIA’s Next-Generation CUDA Compute and
      Graphics Architecture Code-Named Fermi, Adds Muscle for Parallel Processing http://www.in.stat.com/
79.   Intel Larrabee http://www.intel.com/
80.   http://www.product-reviews.net/2009/09/23/intel-larrabee-graphics-processor-first-demo/
81.   Michael Mentor, Senior GPU Computer Architect / Fellow AMD Graphics Product Group, AMD
                        TM
82.   AMD FireStream 9170 & AMD FireStream 9250 Stream Processor :
      http://ati.amd.com/products/streamprocessor/specs.html

        VCV.Rao. Betatesting Group, C-DAC, Pune                                                               78
                   GPU Computing ::
                    GPU Computing
          Performance of Scientific Applications
          Performance of Scientific Applications

Conclusions
   v An overview of GPU Computing
   v GPGPUs / GPU Computing Software Products
      and Applications - Performance issues
Future Work

v GPU : Implementation of Finite Element /Finite Volume
  Comps. – Sparse Matrix Iterative Conjugate Gradient
  Solver

                                    Acknowledgements : NVIDIA, AMD, GPU researchers

    VCV.Rao. Betatesting Group, C-DAC, Pune             Source - NVIDIA         79
                                     Any Questions ?




VCV.Rao. Betatesting Group, C-DAC, Pune                80

						
Related docs