An overview of GPU Computing
Document Sample


An overview of GPU Computing
An overview of GPU Computing
Betatesing Group
Centre for Development of Advanced Computing (C-DAC),
Pune University
vcvrao@cdac.in
International Workshop on High Performance Computing in Observational
Astronomy: Requirements and Challenges ( IUCAA – Pune – Oct 13, 2009)
VCV.Rao. Betatesting Group, C-DAC, Pune 1
An Overview of GPU Computing
An Overview of GPU Computing
Lecture Outline
Following topics will be discussed
v Part-I : Historical Perspective
v Part-II : An overview of GPU Computing
v Part-III : GPGPUs / GPU Computing Software Products
v Part-IV : Applications - Performance Issues
Source : NVIDIA, AMD, Intel & References given in the presentation
VCV.Rao. Betatesting Group, C-DAC, Pune 2
GPU Computing
History - An Overview
VCV.Rao. Betatesting Group, C-DAC, Pune 3
GPU : Massively parallel compute offload
vPerformance
Moore’s Law → 2x < 18 Months
vPower Frequency\Power\Complexity Wall
Parallel → Opportunity for growth
vPrice
vProgramming Models
GPU is the first successful massively parallel COMMODITY
architecture with a programming model that managed to use
1000’s of parallel threads in hardware to perform useful work
efficiently, based on innovative design in Algorithms for
Applications to boost performance, showing an acceleration
of 5x, 10x, 20x compare to CPU performance
VCV.Rao. Betatesting Group, C-DAC, Pune 4
What is GPU ?
What is GPU ?
Graphics Processing Unit
v GPU also occasionally called visual
processing unit or VPU
v Dedicated graphics rendering device for a
personal computer, workstation, or game
console.
v GPU is viewed as compute device that : Without GPU With GPU
• Is a coprocessor to CPU or host machine
• Has its own DRAM (on the device) runs
many threads in parallel Application
v GPU is dedicated super-threaded, massively
data parallel co-processor
CPU GPU
Source : References
VCV.Rao. Betatesting Group, C-DAC, Pune 5
Why Are GPUs So Fast?
Why Are GPUs So Fast?
v GPU originally specialized for math-intensive, highly
parallel computation
v So, more transistors can be devoted to data
processing rather than data caching and flow control
ALU
AMD
ALU
Control
ALU ALU
Cache
DRAM DRAM
CPU GPU
NVIDIA
v Commodity industry: provides economies of scale
v Competitive industry: fuels innovation
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA. AMD 6
Computer Graphics ::GPU – Programmable Pipeline
Computer Graphics GPU – Programmable Pipeline
v Dealing complex with Graphics API
Application v Sequential Flow of Execution
v Limited Communication per thread
Input Registers
Command per Shader
per Context
Geometry Fragment Texture
Program
Rasterization Constants
Temp
Texture Registers
Fragment
Output Registers
Display FB Memory
VCV.Rao. Betatesting Group, C-DAC, Pune Source : OPENGL, References 7
Computer Graphics ::GPU – Programmable Pipeline
Computer Graphics GPU – Programmable Pipeline
3D Application
GPU Shaders
Application
3D API GPU front End
Command Vertex
CPU Primitive
Processor
Geometry Assembly
Rasterization
Rasterization
Fragment
Texture Processor
Raster
Operations
Fragment
Display
Frame Buffer
VCV.Rao. Betatesting Group, C-DAC, Pune Source : OPENGL, References 8
Computer Graphics ::GPU Programmable Pipeline
Computer Graphics GPU Programmable Pipeline
CPU
CPU v Hardware mimicked
graphics APIs Geometry
Geometry v It is possible to formulate
many problems in this Rasterize
Rasterize framework
• Uses graphics APIs Shade pixels
Shade pixels • Classical GPGPU
Display
Display
DO NOT DO THIS ANYMORE!
(Unless for graphics)
VCV.Rao. Betatesting Group, C-DAC, Pune Source : OPENGL, References 9
An Overview of
GPU Computing
VCV.Rao. Betatesting Group, C-DAC, Pune 10
GPU Computing : Think in Parallel
Some Design Goals
vScale to 100’s of cores, 1000’s of parallel
0 1 2 3 4 5 6 7
vthreads
……
vLet programmers focus on parallel float x = input[threadID];
float y = func(x);
algorithms & Re-writing the Code output[threadID] = y;
…
• Not on the mechanics of a parallel
programming language
vEnable heterogeneous systems (i.e. CPU
+ GPU)
• CPU and GPU are separate devices
with separate DRAMs
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 11
GPU Computing : Think in Parallel
vPerformance = parallel hardware
+
scalable parallel program
vGPU Computing drives new applications
• Reducing “Time to Discovery” Application
• 100 x Speedup changes science &
research methods CPU GPU
vNew applications drive the future of GPUs
• Drives new GPU capabilities
• Drives hunger for more performance
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, AMD,References 12
GPU Computing : Think in Parallel
v Speedups of 8 x to 30x are quite common
for certain class of applications
v The GPU is a data-parallel processor
• Thousands of parallel threads Application
• Thousands of data elements to process
• All data processed by the same program
CPU GPU
Ø SPMD computation model
• Contrast with task parallelism and ILP
v Best results when you “Think Data Parallel”
• Design your algorithm for data-parallelism
• Understand parallel algorithmic complexity and efficiency
• Use data-parallel algorithmic primitives as building blocks
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, AMD, References 13
GPU Computing : Think in Parallel
Why Are GPUs So Fast?
vOptimized for structured parallel execution
• Extensive ALU counts & Memory Bandwidth
• Cooperative multi-threading hides latency
vShared Instructions Resources
vFixed function units for parallel workloads dispatch
vExtensive exploitations of Locality
•Performance /(Cost/Watt); Power for Core
•Structured Parallelism enables more flops less watts
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, AMD, References 14
GPU Computing : Think in Parallel
GPU Computing : Optimise Algorithms for the GPU
vMaximize independent parallelism
vMaximize arithmetic intensity (math/bandwidth)
vSometimes it’s better to recompute than to cache
• GPU spends its translators on ALUs, not memory
vDo more computation on the GPU to avoid costly data
transfers
• Even low parallelism computations can sometimes
be faster than transferring back and forth to host
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, AMD, References 15
GPU Computing : Think in Parallel
GPU Computing : Use Parallelism Efficiently
vPartition your computation to keep the GPU
multiprocessors equally busy
• Many threads, many thread blocks
vKeep resource usage low enough to support
multiple active thread blocks per multiprocessor
• Registers, shared memory
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA,AMD, References 16
GPU Computing : Think in Parallel
GPU Computing : Take Advantage of Shared Memory
vHundreds of times faster than global
memory
vThreads can cooperate via shared memory
vUse one/ a few threads to load/computer
data shared by all threads Application
vUse it to avoid non-coalesced access
• Stage loads and stores in shared CPU GPU
memory to re-order non-coalesceable
addressing
• Matrix transpose example later
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 17
GPU Programming : Two Main Challenges
GPU Challenges with regard to Scientific Computing
Challenge 1 : Programmability
v Example : Matrix Computations Application
• To port an existing scientific
application to a GPU
CPU GPU
v GPU memory exists on the card itself
• Must send matrix array over PCI-Express Bus
− Send A, B, C to GPU over PCIe
−Perform GPU-based computations on A,B, C
− Read result C from GPU over PCIe
v The user must focus considerable effort on optimizing
performance by manually orchestrating data movement
and managing thread level parallelism on GPU.
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, AMD, References 18
GPU Programming : Two Main Challenges
Challenge 2 : Accuracy
v Example : Non-Scientific Computation - Video Games (Frames)
(A single bit difference in a rendered pixel in a real-time graphics
program may be discarded when generating subsequence
frames)
v Scientific Computing : Single bit error - Propagates overall error
v Past History : Most GPUs support single precision, 32 bit
floating point operation, - all GPUs have necessarily implemented
the full IEEE Standard for Binary Floating-Point Arithmetic (IEEE
754)
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, AMD, References 19
GPU Programming : Two Main Challenges
GPU Challenges with regard to Scientific Computing
v Most Recent GPUs – Improves IEEE compliance and
essential Double Precision support
v Computational Power offered by GPUs is Excellent
v Cluster of GPUs for certain class of Applications
• Issues – Mixed Programming Environment (MPI,
NVIDIA-CUDA Prog., AMD-Fire Stream-Brook + Prog.
OpenCL Prog..)
• Issues – Memory Issues, Reliability, Error Detection and
Correction techniques
v Wide range of Applications mapping – performance
benefits from GPU.
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, AMD, References 20
GPU Computing : Think in Parallel
v Good strategies for extracting high performance from
individual subsystems on the GPU
v NVIDIA - CUDA / AMD Fire Stream Brook+ (Plenty of
opportunities for further optimizations) & Intel Larrabee :
Intel
Intel AMD
AMD NVIIDIA
NVIIDIA OpenCL
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, AMD, Intel, References 21
GPGPUs / GPU Computing
Software Products
NVIDIA – GPU Computing
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 22
NVIDIA – GPU computing Products - History
Finance Applications
- Pricing and risk
- Higher accuracy, faster
Oil & Industry /Seismic
- Energy discovery
- Broad adoption
Defense
- Signal analysis
- Very high need for Supercomputing
computer resource - World-class science
- Top 500
Universities
- Desk supercomputing
- Thousands of customers
Nvidia estimates that the total available market (TAM) for GPU computing is at
least half as large as the desktop-PC market for GPUs. The upside potential
looks greater. Whereas the PC market is maturing, GPU computing barely
existed four years ago and is growing fast.
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 23
NVIDIA - GPU Computing CUDA Kernels and Threads
Arrays of Parallel Threads
v A CUDA kernel is executed by an array of threads
• All threads run the same code
• Each thread has an ID that it uses to compute
memory addresses and make control decisions
threadID 0 1 2 3 4 5 6 7
……
float x = input[threadID];
float y = func(x);
output[threadID] = y;
……
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA 24
NVIDA :CUDA - Quick terminology review
vThread: concurrent code and associated state executed on
the CUDA device (in parallel with other threads)
• The unit of parallelism in CUDA
• Note difference from CPU threads: creation cost, resource usage, and
switching cost of GPU threads is much smaller
vWarp: a group of threads executed physically in parallel
(SIMD)
vThread Block: a group of threads that are execute together
and can share memory on a single multiprocessor
vGrid: a group of thread blocks that execute a single CUDA
program logically in parallel
vDevice: GPU ; Host: CPU SM: Streaming Multiprocessor
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA 25
NVIDIA GPU Computing - CUDA Kernels and Threads
vParallel portions of an application are executed on the device as
kernels
• One kernel is executed at a time
• Many threads execute each kernel
vDifferences between CUDA and CPU threads
• CUDA threads are extremely lightweight
– Very little creation overhead
– Instant switching
• CUDA uses 1000’s of threads to achieve efficiency
– Multi-core CPUs can use only a few
Definitions
Device = GPU; Host = CPU
Kernel = function that runs on the device
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA 26
NVIDIA GPU Computing - CUDA Kernels and Threads
v NEW: GPU Computing with CUDA
C Application
• CUDA = Compute Unified Device Architecture P
U
• Co-designed hardware & software for direct CUDA
Libraries
GPU computing
v Hardware: fully general data-parallel architecture
CUDA Runtime
• General thread launch; Global load-store
• Parallel data cache
CUDA Driver
v Software: program the GPU in C /C++
• Scalable data-parallel execution/ memory G
P
model; Single/Double precision U
v Hundreds of times faster than global memory Compute Unified Device
Architecture Software Stack
v Use one/ a few threads to load/computer data
shared by all thread
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA 27
NVIDIA GPU Computing - CUDA Kernels and Threads
CUDA Software Development
CUDA 2.3 with Tesla C1060 is used for GPU Computations
CUDA Optimized Libraries: Integrated CPU + GPU
math.h, FFT, BLAS, … C Source Code
NVIDIA C Compiler
NVIDIA Assembly
CPU Host Code
for Computing (PTX)
CUDA Standard C Compiler
Profile
Driver
GPU CPU
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA 28
NVIDIA GPU Computing - CUDA Kernels and Threads
Thread Cooperation
vThe Missing Piece: threads may need to cooperate
vThread co-operation is valuable
• Share results to avoid redundant computation
• Share memory accesses
– Drastic bandwidth reduction
vThread co-operation is a powerful feature of CUDA
vCooperation between a monolithic array of threads is not
scalable
• Cooperation within smaller batches of threads is scalable
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA 29
NVIDIA GPU Computing - CUDA Kernels and Threads
Thread Batching
vKernel launches a grid of thread blocks
• Threads within a block cooperate via shared memory
• Threads within a block can synchronize
• Threads in different blocks cannot cooperate
vAllows programs to transparently scale to different GPUs
Grid
Thread Block 0 Thread Block 1 Thread Block n
...
Shared Memory Shared Memory Shared Memory
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA 30
NVIDIA --CUDA GPU Programming Toolkit
NVIDIA CUDA GPU Programming Toolkit
Key Characteristics
v Decomposes work into a grid of thread blocks -
commonly executed by a pool of SIMT multiprocessors
v Each Thread block normally contains 64-512 threads,
which are executed by the processing units within a
single multiprocessor.
v Each SIMT multiprocessor executes a group of threads,
known as “WARPS” , in lock step.
v CUDA Kernels are written efficiently -handle barrier
synchronization, ordering of operations among peer
threads within a thread block
VCV.Rao. Betatesting Group, C-DAC, Pune Source - NVIDIA 31
NVIDIA --CUDA GPU Programming Toolkit
NVIDIA CUDA GPU Programming Toolkit
Key Characteristics
• For high data locality that programmers satisfy - CUDA
Kernels are often an excellent starting point - Similar to
Cache based Multi-core processor.
• Effective use of GPU global Memory is required
• For Performance point of view, CPU with useful work to
overlap with asynchronous GPU kernel
• Intelligent division of computations - CPU & GPU almost
equally occupied. (Optimal balance of workload between
CPU & GPU)
• Speed UP; Under-Load the CPU than overload
VCV.Rao. Betatesting Group, C-DAC, Pune Source - NVIDIA 32
NVIDIA --CUDA GPU Programming Toolkit
NVIDIA CUDA GPU Programming Toolkit
Key Characteristics
v The Expansion of kernel into a grid of thread-blocks is
determined by the kernel launch parameters specified a
runtime, and may be varied dynamically according to
problem size or other attributes
v Handles - Data Dependency – Instructions
v Handles - Data alignment requirements for high
performance global memory operations in CUDA
v The Virtualization of processing resources provided by the
CUDA programming model allows applications written with
existing GPUs to scale up with future hardware designs.
VCV.Rao. Betatesting Group, C-DAC, Pune Source - NVIDIA 33
NVIDIA – GPU computing Products - History
November 2006
GEForce 8800-No. of Programmable shaders : 112
G80 : (16 SMs -Streaming Multiprocessors)
128 CUDA Cores per chip (16 SMs X 8 CUDA Cores per SM)
Share Common Resources such as local memory, register
files, load/store units and thread schedulers
With time Slicing and fast thread switching, a SM can run
thousands of parallel threads on these cores
> 350 Gflops – Single Precision,
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, Reference 34
NVIDIA – GPU Computing Products - History
NVIDIA GeForce 8800 GTX Block Diagram
Number of Stream Processors : 112 Peak : > 300 GFlops
Memory : 512 MB DDR3; Memory Clock freq. : 900 MHz,
Memory Bandwidth : 57.6 GB/s Memory Interface 256-bit
Interface Type : PCI Express 2.0 x16 -DIMM @ 800 MHz
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, Reference 35
NVIDIA – GPU Computing Products - History
NVIDIA G80 Block Diagram
Total of 128 stream processing units (CUDA Cores) distributed
across sixteen Stream MultiProcessors (SM) each with shared
memory, cache and registers
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, Reference 36
NVIDIA – GPU Computing Products - History
June 2008 Vital for Scientific and Engineering Programs
Retained 8 Cores per SM but the No. of SMs is increased to 30.
GT200 : (30 SMs -Streaming Multiprocessors)
240 CUDA Cores per chip (30 SMs X 8 CUDA Cores per SM)
Share Common Resources such as local memory, register
files, load/store units and thread schedulers
With time Slicing and fast thread switching, a SM can run
thousands of parallel threads on these cores
GTX280, Quadro, Support Double Precision,
HPC : Tesla T10, C1060
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 37
NVIDIA – GPU Computing Products - History
NVIDIA GT200 GPU Block Diagram GT200 :
Tesla Architecture incorporated in Tesla C1060 & S1070 products.
Streaming Processor Array
TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC
Constant Cache
64 KB, read-only
Streaming Multiprocessor
Texture Processor
Cluster Instruction L1 Data L1
FP64 Unit
Instruction Fetch/Dispatch
SM Shared Memory
Texture Unit
Special
1/2/3-D interpolation
8KB spatial cache,
FP64 Unit (double precision) Function Unit
SIN, EXP,
SM SP SP RSQRT, Etc..
Read-only,
SP SP
SFU SFU Streaming
SM SP SP Processor
SP SP ADD, SUB,
MAD, Etc…
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 38
NVIDIA GT200 GPU Features and CUDA Prog Models
GT200 Architecture supports a high bandwidth
(140 GB/s) and handle global memory Latency effectively
Warp : The instruction unit in the stream multiprocessor
drives the right scalar streaming processors with a single
instruction stream that is 32 threads wide
SIMT: GT200 & its predecessors implemented a single-
instruction multiple-thread (SIMT) instruction unit
Support double precision floating point Arithmetic,
improved hardware for coalescing global memory accesses;
bi-directional overlapping of asynchronous I/O and GPU
kernel execution
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 39
NVIDIA GT200 GPU Features and CUDA Prog Models
CUDA Prog. Model decomposes work into a grid of thread
blocks that are concurrently executed by a pool of SIMT
multiprocessors.
Each thread block normally contains 64-512 threads,
which are executed by the processing units within a single
multiprocessor.
Each SIMT multiprocessor executes a group of threads,
known as “Warp”, in lock step.
CUDA Kernels are written as a serial code sequence, with
barrier synchronization to enforce ordering of operations
among peer threads within a thread block.
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 40
NVIDIA GT200 GPU Features and CUDA Prog Models
The expansion of a kernel into a grid of thread blocks is
determined by the kernel launch parameters specified at
runtime, and may be varied dynamically according to the
problem size or other attributes.
The virtualization of processing resources provided by the
CUDA programming model allows applications written with
existing GPUS to scale up with future hardware designs.
The key issue that must be considered when designing
CUDA kernels involve the decomposition of work into tiles
that can be mapped to thread blocks, further decomposed
into warps and individuals threads.
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 41
NVIDIA – GPU Computing Products - History
Tesla C1060 :The NVIDIA® Tesla™ C1060 GPU ( PCI Express 16x Interface )
• # of Tesla GPUs • 1 Application
• # of Streaming Processor • 240
Cores (CUDA Cores)
• Frequency of processors Cores • 1.3 GHz
• Single Precision floating • 936 Gflops
pointing performance (peak) CPU GPU
• Double Precision floating
pointing performance (peak)
• Floating Point Precision • IEEE 754 single & Double
• Total Dedicated Memory • 4 GB
• Memory Inferface • 512-bit
• Memory Bandwidth • 102 GB/sec
• Max Power Consumption • 800 W
• System Interface • PCIe x 16
• Software Development Tools • C-based CUDA Toolkit
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 42
NVIDIA – GPU Computing Products - History
NVIDIA – Tesla S1070
Tesla S1070 :The NVIDIA® Tesla™ S1070 Computing System (Dual PCI Express
2.0 cable connections) is a four-teraflop 1U system ( one-teraflop processor)
• # of Tesla GPUs • 4 Application
• 960 (240 per processor)
• # of Streaming Processor Cores
• 1.296 to 1.44 GHz
• Frequency of processors Cores
• 3.73 to 4.14 Tflops
• Single Precision floating
pointing performance (peak) CPU
• Double Precision floating
• 311 to 345 Gflops GPU
pointing performance (peak)
• IEEE 754 single & Double
• Floating Point Precision
• 16 GB
• Total Dedicated Memory
• 512-bit
• Memory Inferface
• 408 GB/sec
• Memory Bandwidth
• 800 W
• Max Power Consumption
• PCIe x 16 or x8
• System Interface
• C-based CUDA Toolkit
• Software Development Tools
VCV.Rao. Betatesting Group, C-DAC, Pune Source - NVIDIA 43
Fermi - NVIDIA GPU computing Products -
October 2009 Vital for Scientific and Engineering Programs
32 CUDA Cores for Streaming multiprocessor (Four times as
many as the GT200 and G80)
Fermi : (Initially 16 SMs -Streaming Multiprocessors)
512 CUDA Cores per chip (16 SMs X 32 CUDA Cores per SM)
Boost throughput, but additional enhancements deliver
even more performance
CUDA programming model – Performance enhancement with
Warp (32 cores are designed to work in parallel on 32
instructions at a time from a bundle of 32 threads)
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 44
Fermi - NVIDIA GPU Computing Products
Feature Nvidia Fermi Nvidia GT200 Nvidia G80
Introduction 2009 2008 2006
Single-Precision FP 512 fused multiply- 240 multiply-add 128 multiply-add
add ops per cycle ops per cycle ops per cycle
Double Precision FP 256 fused multiply- 30 fused multiply- -
add ops per cycle add ops per cycle
IEEE 754-2008 Full, CP and DP DP only -
Compliance
Streaming 16 30 16
Multiprocessors
CUDA Cores Per SM 32 8 8
Warp Schedulers per SM 2 1 1
Special Functions Units 4 2 2
per SM
Shared Memory per SM 48KB or 16KB 16KB 16KB
(configurable)
L1 instruction Cache per Yes - -
SM
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 45
Fermi - NVIDIA GPU Computing Products
L1 Data Cache per SM* 16KB or 48KB - -
(configurable)
Total CUDA Processor Cores 512 240 128
L2 Cache 768Kb, shared - -
Error-Correction Codes DRAM, shared - -
memories, L2 cache,
registers
Concurrent Kernels Up to 16 - -
Fully Predicated ISA Yes - -
Memory Addressing 40 bits (1TB), unified 32 bits (4GB), divided 32 bits (4GB),
divided
Memory I/O Interfaces 6 x 64 bits, 6GB DRAM 8 x 64 bits, 1GB 6 x 64 bits, 768MB
GDDR3 DRAM GDDR3
C++ Programmable Yes - -
Transistors 3.0 billion 1.4 billion 681 billion
Nvidia’s three CUDA-capable GPU architectures : The G80, GT200 and Fermi Architectures
*The shared memory and L1 data cache in each streaming multiprocessor is 64KB of SRAM.
VCV.Rao. Betatesting Group, C-DAC, Pune Source : NVIDIA, References 46
GPGPUs / GPU Computing
Software Products
AMD - GPGPUs
VCV.Rao. Betatesting Group, C-DAC, Pune Source : AMD, References 47
AMD – ATI GPGPUs ::Processing Efficiency
AMD – ATI GPGPUs Processing Efficiency
VCV.Rao. Betatesting Group, C-DAC, Pune Source : AMD, References 48
AMD Stream SDK Software Development Kit
AMD Stream SDK Software Development Kit
Applications
Compilers Libraries 3rd Party Tools
Brook+ Graphics API
OpenCL ACM Rapidmind
& Other App Specific DirectX0
OpenGL
AMD Runtime Compute Abstraction Layer (CAL)
AMD
Multicore-Core Stream
AMD CPUs Processors
VCV.Rao. Betatesting Group, C-DAC, Pune Source : AMD, References 49
AMD Stream Processing Strategy
AMD Stream Processing Strategy
AMD Stream Processing Strategy Single Programming Environment
VCV.Rao. Betatesting Group, C-DAC, Pune Source : AMD, References 50
AMD GPGPUs :: Brook+ Stream
AMD GPGPUs Brook+ Stream
Programming Language
Programming Language
v Stream Programming Model : Extension to the C-language for stream
programming originally developed by Stanford University
• Enforce Data Parallel computing
• Encourage Arithmetic Intensity
• Provide fundamental ops for stream computing
v Brook+ is an implementation by AMD of the Brook GPU Spec on AMD’s
compute abstraction layer (CAL) with some enhancements
v Asynchronous CPU -> GPU transfers (CPU->GPU still synchronous
v Linux & Windows Environment
Source : AMD, References
VCV.Rao. Betatesting Group, C-DAC, Pune 51
AMD – FireStreamTM 9250
AMD – FireStreamTM 9250
v AMD’s Second Generation Stream Computing Product
v Single PCI Slot
v Computational Power
• One T-FLOPS Single Precision Float
• 200 GFLOPS Double Precision
v 1 GB GDDR3
v 150 Watts → 8 GFLOPS/Watt
v Familiar 32 and 64 bit Linux® and Windows® Environment
v Stream software supports multiple GPUs per system
v Brook+ (Open Source C-level language & Compiler)
• GPU Shader Analyzer
• AMD Code Analyst
v AMD’s Compute Abstraction Layer (CAL)
VCV.Rao. Betatesting Group, C-DAC, Pune Source : AMD, References 52
OpenCL- Processor Parallelism
OpenCL- Processor Parallelism
CPUs GPUs
Multiple cores driving Increasingly general purpose
performance increases data-parallel computing
improving numerical precision
Emerging
Intersection
OpenCL
Heterogeneous
Computing Graphics APIs
Multi-processor and Shading
programming – e.g. Languages
OpenMP
OpenCL – Open Computing Language
Open, royalty-free standard for portable, parallel programming of
heterogonous parallel computing CPUs, GPUs, and other processor
VCV.Rao. Betatesting Group, C-DAC, Pune Source : Khronous, References [43], [45] 53
OpenCL Working Group
OpenCL Working Group
VCV.Rao. Betatesting Group, C-DAC, Pune Source : Khronous, References [43], [45] 54
Applications -
Performance Issues
VCV.Rao. Betatesting Group, C-DAC, Pune 55
Application -1 ::
Application -1
Quantum Monte Carlo Simulation using GPUs
Quantum Monte Carlo Simulation using GPUs
Dynamic Cluster Approximation (DCA)
v Condensed Matter Physics - The Study of
high temperature
v One Approach – Formulate the problem on
a regular lattice and performance
simulations on a lattice.
• Matrix Multiplication totally dominates in
each step - in Green’s function
v 90 % of the total runtime were spent within
the QMC update step. On GPU, efficient
Matrix Computations are required.
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 56
Application -1 ::
Application -1
Quantum Monte Carlo Simulation using GPUs
Quantum Monte Carlo Simulation using GPUs
Dynamic Cluster Approximation (DCA)
v Updates (Single/Large) on each Quantum Monte-Carlo (QMC) Step
• Approximations are required – to handle complexity
• Large Lattice /Small Lattices
• Large Lattice approximation – Cluster
HTSC: 1023 Dynamic cluster
2D Hubbard model approximation: Map
interacting electrons for CuO planes Hubbard model onto
embedded cluster
v Statistical formulation /Green’s function is used in comp.
v Computation Drivers on GPUs: Vector Product; Dense
Complex Matrix Multiplication; Eigven Value Computation
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 57
Application -2 :: Multilevel Simulation of
Application -2 Multilevel Simulation of
Electrostatic Potentials using GPUs
Electrostatic Potentials using GPUs
Molecular Dynamics Simulation
v Bimolecular Systems
Lang-Range Part
Lattice cutoff (GPU)
Containing over 1 Million Lattice cutoff (GPU)
atoms to a 100 - Million
atom Molecular Dynamics Lattice cutoff (GPU)
Simulation is required.
Short-Range part (GPU + CPU)
v Compute Electrostatic
Potential based on Lattice
point position, Atomic
Coordinates, Charge
v Long Range & Short
Range on the Lattices
v EX : - NAMD
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 58
Application -2 :: Multilevel Simulation of
Application -2 Multilevel Simulation of
Electrostatic Potentials using GPUs
Electrostatic Potentials using GPUs
v Calculate an approximation to O(N2) pairwise interactions with (O(n)
computational work
v GPU- Parallelizing the short-range part; Parallelizing the lattice cutoff part
v Many to Many Communications – GPU : 3 D FFT computations
v Multi-Level : Operators - Interpolation, Restriction; Prolongation; Lattice
Cut-off, Top-level
• • • • Radius Measured in lattice points
• • • • in the same for each level.
• • • •
• • • •
Sub-Cube of Lattice Lattice point potential is the sum
potentials assigning to Schematic of the Lattice cutoff of enclosed distance –weighted
thread block Lattice Changes
calculations
v Efficient Data Structures for GPU computation – Threads Cooperation
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 59
Application -2 :: Multilevel Simulation of
Application -2 Multilevel Simulation of
Electrostatic Potentials using GPUs
Electrostatic Potentials using GPUs
v GPU – Arrangement of Data in Matrix Blocks & mapping to Thread blocks
. . . .
Schematic of the optimal Sub-matrix Is
reading of Matrix-Data . . . . assigned to thread
from GPU constant block
. . . .
memory
. . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Sliding Block
Movement . . . . . . . .
. . . . . . . .
Sub-blocks matrix data stored in GPU shared memory
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 60
Application -3 :: Molecular Dynamics
Application -3 Molecular Dynamics
• GPU implementation Issues - Mesh Partition / Distribution – Complex ;
Short Range & Long Range; Efficient Data Structures
• GPU may Require Sparse Matrix Computations; FFT Computations
Molecular dynamics schematic
Employing a
finite
rcut cutoff reduces
bend stretch problem
from O(N2) to
O(N)
Non-bonded
interactions
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 61
Application -3 :: Molecular Dynamics
Application -3 Molecular Dynamics
GPU implementation Issues –
• A schematic 2-D representation of the link-cell neighbour search pattern used in the domain
decomposition.
• The thick black lines mark the boundaries between regions which different processors are
responsible for. In this example all the 16 link-cells that are resident on are searched using the
pattern shown. In addition, all the particles in the received cell 16 are interacted with those in 22.
22 23 24 25 26
Domain Boundaries
Link-cells receiving
16 17 18 19 20
coordinate data
Link-cells interacting with:
11 12 13 14 15
2 neighbours
6 7 8 9 10 4 neighbours
5 neighbours
1 2 3 4 5
6 neighbours
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 62
Application -3 :: Molecular Dynamics
Application -3 Molecular Dynamics
Force decomposition scheme
GPU implementation Issues - A schematic 2-D representation of Force Decomposition; Data
Structures and block re-arrangement of neighbour information, Movement of Data from Host
to GPU
Atom
Lists 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4
1 1 1 1
3 11 15
2 2 7 2 2
3 1 3 3 3
2 4 8 12 16
4
4 4 4
Indicates transpose
5 1
5 5
2
5 5 7 11 15 processors
6 2 6 6
3 7
7
7 4
6 6 8
12 16
8 8
8
9 1 5
9 11 9
9
10 2
9 6 10
15
3
11 3 7
10 10 12
11
10 16
12 4 8 12
13 1
13
5 9
13 13 15
2 13 10
6
14 4 3 7 11
15 14 14 14 14 16
4 8 12
16
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 63
Application -4 :: Parallel Finite Element
Application -4 Parallel Finite Element
Computations (Unstructured Grids) using GPUs
Computations (Unstructured Grids) using GPUs
GPU implementation Issues – Typical Finite Element Computations – Mesh Partitioning
using Metis; Hybrid Computing - MPI & GPU Implementation;
Mesh Partition/Distribution - Complex
• Elements (evenly distributed) / Processor-
level nodes / Global nodes (aligned with
processor-level nodes as much as possible)
• Mesh Partition/Distribution – Complex
• Sparse Matrix Computations
Source : References
VCV.Rao. Betatesting Group, C-DAC, Pune 64
Application -4 :: Parallel Finite Element
Application -4 Parallel Finite Element
Computations (Unstructured Grids) using GPUs
Computations (Unstructured Grids) using GPUs
GPU implementation of Sparse Matrix Computations in an iterative solver
Thread Block 1 Thread Block 1
Parallel Matrix Solution Techniques
Iterative /Direct Methods Shared Memory Shared Memory
Domain 3 Domain 4
Ordering of Sparse linear system
Symbolic Factorization, Numeric
Factorization. Solving a triangular
system. Sparse Matrix -Vector
Multiplications; Matrix Preconditioning
Graph partitioning algorithms
Domain 1 Domain 2
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 65
Application -4 :: Parallel Finite Element
Application -4 Parallel Finite Element
Computations (Unstructured Grids) using GPUs
Computations (Unstructured Grids) using GPUs
Sparse Matrix Computations : Compressed sparse row (CSR) storage,
and a basic CSR-based implementation for GPU Programming
A val k
ind k
Thread Block 1
ptr m+1
// Basic implementation,
// y <- y + A*x, where A is in CSR
Shared Memory
for (i = 0; i < m; ++i) {
double y0 = y[i];
for {k = ptr[i]; k < ptr[i+i]; ++k)
y0 += val[k] * x[ind[k]];
y[i] = y0;
}
Source : References
VCV.Rao. Betatesting Group, C-DAC, Pune 66
Application -4 :: Parallel Finite Element
Application -4 Parallel Finite Element
Computations (Unstructured Grids) using GPUs
Computations (Unstructured Grids) using GPUs
Schematic of the optimal . . . .
reading of Sparse Matrix- Sub-matrix Is
Data from GPU constant . . . . assigned to thread
memory block
. . . .
. . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . . Sparse Matrix Data is re-
Sliding Block . . . . . . . . arranged in Blocks for ease of
Movement computation on GPUs as well
. . . . . . . .
as Data Movement
. . . . . . . .
Sub-blocks matrix data stored in GPU shared memory
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 67
Application -5 :: Parallel Finite Difference
Application -5 Parallel Finite Difference
Computations (structured Grids) using GPUs
Computations (structured Grids) using GPUs
Case Study- Poisson Equation Solver
∂U − ∂2U ∂2U Ω ⊆ℜ2; t∈[to, tf]
+ = f(x,y);
∂t ∂x2 ∂y2
U(x,y,t0) = g on ∂Ω
Rank = 2
Rank = 1
Rank = 0
VCV.Rao. Betatesting Group, C-DAC, Pune Source : References 68
Application -5 :: Parallel Finite Difference
Application -5 Parallel Finite Difference
Computations (structured Grids) using GPUs
Computations (structured Grids) using GPUs
Case Study- Poisson Equation Solver
v Initially, four arrays are required for computation,
• To Store old values of at each grid point, U i.e Uold,
• To Store new values of U i.e Unew and
• One each for the storing difference between Uold & Unew and
storing the index values of the interior points.
v Memory for these arrays is allocated on the host and they are
initialized.
• Four arrays are allocated on the GPU (device).
• The values of the arrays in the host machine are copied onto
the arrays allocated on the device.
• The computed solution on GPU is achieved in three steps.
VCV.Rao. Betatesting Group, C-DAC, Pune 69
Application -5 :: Parallel Finite Difference
Application -5 Parallel Finite Difference
Computations (structured Grids) using GPUs
Computations (structured Grids) using GPUs
Case Study- Poisson Equation Solver : GPU Implementation
v Step 1:We begin the computation with an initial solution Uold vector
and apply boundary conditions on both Uold and Unew vector by setting
the boundary values at corresponding boundary points.
v Step 2: We compute the Unew solution vector at each grid point
Ui,j = ¼ (Ui-1,j + Ui, j+1+ Ui, j-1 + Ui+1, j)
v Step 3: We compute the difference between Unew and Uold solution
vector variables and the convergence criteria is checked to stop the
computations.
• If the convergence criteria are not satisfied, then we assign Unew
solution vector to Uold solution and the computations are repeated
until the convergence criteria are satisfied.
VCV.Rao. Betatesting Group, C-DAC, Pune 70
Application -5 :: Parallel Finite Difference
Application -5 Parallel Finite Difference
Computations (structured Grids) using GPUs
Computations (structured Grids) using GPUs
Case Study- Poisson Equation Solver : GPU Implementation
CUDA APIs used is given below.
- Allocates memory on device
cudaMalloc(void** array, int size)
- Frees memory allocated on device
cudaFree(void* array)
- Copies from host to device
cudaMemcpy((void*)device_array, (void*)host_array,
sizecudaMemcpyHostToDevice)
- Copies from device to host
cudaMemcpy((void*)host_array, (void*)device_array, size,
cudaMemcpyDeviceToHost)
- For multiplication of vector x, by scalar alpha and adds results to vector y and it stores result
to vector Y, the function
void cublasSaxpy (int n, float alpha, const float *x, int incx, float *y, int incy)
- To find the smallest index of the maximum magnitude element of vector x
int cublasIsamax(int n, const float *x, int incx)
- To copy the the vector x to the vector
void cublasScopy (int n, const float *x, int incx, float *y, int incy)
VCV.Rao. Betatesting Group, C-DAC, Pune Source - NVIDIA 71
Application -5 :: Parallel Finite Difference
Application -5 Parallel Finite Difference
Computations (structured Grids) using GPUs
Computations (structured Grids) using GPUs
Case Study- Poisson Equation Solver : GPU Implementation
v Use Global 1D Array Array Performance results for Poisson Equation based on
finite difference method on NVIDIA Tesla C1060 *
v Use block of Grid (Partition)
v Data Transfer from Host to GPU Computation Time (Seconds)
(Penta Diagonal Formulation)
v Jacobi Method Employed
Grid Size (n)
No. of Iterations Timings on
CPU (Seconds) Timings on
-O3 used GPU (sec)
1000 (1000) 01.39 0.12
8100 (19900) - 12.59
10000 (21000) - 18.67
14400 (29000) - 35.97
GPU : Efficient Matrix (Grid) Decomposition & Matrix Mapping to avoid
frequent data transfer from Host to GPU & Vice versa is not done
VCV.Rao. Betatesting Group, C-DAC, Pune 72
Case Study :: Dense Matrix Computations
Case Study Dense Matrix Computations
using GPU & CPU
using GPU & CPU
Matrix Computations - CUDA API – Naive implementation
Multiple Iterations - Multiple Passes as per Application
v The CUDA API does have support for BLAS Calls - CUBLAS
(Overheads incurred about using GPU can be estimated.)
• Require allocation of GPU based memory for matrix inputs
• Transfer of the matrices to the GPU.
• Transfer of the result back to Host GPU
• Deallocation of the GPU local memory
v The data transfer overheads (Initially retain the overheads)
v Amortize the matrix allocation /deallocation overheads
(Allocate space in GPU-local memory only once and when each BLAS
call occurs, re-use the same local memory to receive data transfers.)
v Transfer all input matrices to and all result matrices from the GPU to
each BLAS Call (Initial implementation)
VCV.Rao. Betatesting Group, C-DAC, Pune Source - NVIDIA 73
Case Study :: Dense Matrix Computations
Case Study Dense Matrix Computations
using GPU & CPU
using GPU & CPU
• Matrix – Matrix computations – Matrix Data
• GEForce; Tesla C1070, gcc compiler, nvcc Compiler
40
CPU
Performance (Speed-UP) GPU
30
20
10
1
!K 2K 4K 8K
Matrix size
• Optimization of Matrix – Matrix Multiplication algorithms and its
performance on Single Core /Multi-Core processor is under progress.
VCV.Rao. Betatesting Group, C-DAC, Pune 74
References
1. Randi J. Rost, OpenGL – shading Language, Second Edition, Addison Wesley 2006
2. GPGPU Reference http://www.gpgpu.org
3. NVIDIA http://www.nvidia.com
4. NVIDIA tesla http://www.nvidia.com/object/tesla_computing_solutions.html
5. NVIDIA CUDA Reference http://www.nvidia.com/object/cuda_home.html
6. CUDA sample source code: http://www.nvidia.com/object/cuda_get_samples.html
7. List of NVIDIA GPUs compatible with CUDA: The href://www.nvidia.com/object/cuda_learn_products.html
8. Download the CUDA SDK: www.nvidia.com/object/cuda_get.html
9. Specifications of nVIDIA GeForce 8800 GPUs:
10. RAPIDMIND http://www.rapidmind.net
11. Peak Stream - Parallel Processing (Acquired by Google in 2007) http:/www.google.com
12. guru3d.com http://www.guru3d.com/news/sandra-2009-gets-gpgpu-support/
ATI & AMD http://ati.amd.com/products/radeon9600/radeon9600pro/index.html
13. AMD http:www.amd.com
14. AMD Stream Processors http://ati.amd.com/products/streamprocessor/specs.html
15. RAPIDMIND & AMD http://www.rapidmind.net/News-Aug4-08-SIGGRAPH.php
16. Merrimac - Stream Architecture Standford Brook for GPUs
http://www-graphics.stanford.edu/projects/brookgpu/
17. Standford : Merrimac - Stream Architecture http://merrimac.stanford.edu/
18. ATI RADEON - AMD http://www.canadacomputers.com/amd/radeon/
19. ATI & AMD - Technology Products http://ati.amd.com/products/index.html
20. Sparse Matrix Solvers on the GPU ; conjugate Gradients and Multigrid by Jeff Bolts, Ian Farmer, Eitan
Grinspum, Peter Schroder , Caltech Report (2003); Supported in part by NSF, nVIDIA, etc....
21. Scan Primitives for GPU Computing by Shubhabrata Sengupta, Mark Harris*, Yao Zhang and John D
Owens University of California Davis & *nVIDIA Corporation Graphic Hardware (2007).
22. Horm D; Stream reduction operations for GPGPU applciations in GPU Genes 2 Phar M., (Ed.) Addison
Weseley, March 2005; Chapter 36, pp. 573-589 Graphic Hardware (2007).
23. Bollz J., Farmer I., Grinspun F., Schroder F : Sparse Matris Solvers on the GPU ; Conjugate Gradients
and multigrid ACM Transactions on Graphics (Proceedings of ACM SIGRAPH 2003) 22, 2 (Jul y2003) pp
917-924 Graphic Hardware (2007).
24. NVIDIA CUDA Compute Unified Device Architecture - Programming Guide - Version 1.1 November 2007
VCV.Rao. Betatesting Group, C-DAC, Pune 75
References
25. Tom R. Halfhill, Number crunching with GPUs PeakStream Math API Exploits Parallelism in Graphics
Processors, Ocotober 2006; Microprocessor http://www.mdronline.com
26. Tom R. Halfhill, Parallel Processing with CUDA Nvidia's High-Performance Computing Platform Uses
Massive Multithreading ; Microprocessors, Volume 22, Archive 1, January 2008
http://www.mdronline.com
27. J. Tolke, M.Krafczyk Towards Three-dimensional teraflop CFD Computing on a desktop PC using
graphics hardware Institute for Computational Modeling in Civil Engineering, TU Braunschweig (2008)
28. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Hoston, P.Hanrahan, Brook for GPUs ;
Stream Computing on GRaphics Hadrware, ACM Tran. GRaph (SIGGRAPH) 2008
29. Z. Fan, F. Qin, A.E. Kaufamm, S. Yoakum-Stover, GPU cluster for Hgh Performance Computing in :
Proceedings of ACM/IEEE Superocmputing Conference 2004 pp. 47-59.
30. J. Kriiger, R. Wetermann, Linear Algeria operators for GPU implementation of Numerical Algorithms
ACm Tran, Graph (SIGGRAPH) 22 (3) pp. 908-916. (2003)
31. Tutorial SC 2007 SC05 : High Performance Computing with CUDA
32. FASTRA http://www.fastra.ua.ac.bc/en/faq.html
33. AMD Stream Computing software Stack ; http://www.amd.com
34. BrookGPU : http://graphics standafrod.edu/projects/brookgpu/index.html
35. FFT – Fast Fourier Transform www.fftw.org
36. BLAS – Basic Linear Algebra Suborutines – www.netlibr.org/blas
37. LAPACK : Linear Algebra Package – www.netlib.org/lapack
38. Dr. Larry Seller, Senipr Principal Engineer; Larrabee : A Many-core Intel Architecture for Visual
computing, Intel Deverloper FORUM 2008
39. Tom R Halfhill, Intel’s Larrabee Redefines GPUs – Fully Programmable Many core Processor Reaches
Beyond Graphics, Microprocessor Report September 29, 2008
40. Tom R Halfhill AMD’s Stream Becomes a River – Parallel Processing Platform for ATI GPUs Reaches
More Systems, Microprocessor Report December 2008
41. AMD’s ATI Stream Platform http://www.amd.com/stream
42. General-purpose computing on graphics processing units (GPGPU)
http://en.wikipedia.org/wiki/GPGPU
43. Khronous Group, OpenGL 3, December 2008 URL : http://www.khronos.org/opencl
VCV.Rao. Betatesting Group, C-DAC, Pune 76
References
44. Mary Fetcher and Vivek Sarkar, Introduction to GPGPUs – Seminar on Heterogeneous Processors,
Dept. of computer Science, Rice University, October 2007
45. OpenCL - The open standard for parallel programming of heterogeneous systems URL :
http://www.khronos.org/opencl
46. Tom R. Halfhill, Parallel Processing with CUDA Nvidia's High-Performance Computing Platform Uses
Massive Multithreading ; Microprocessors, Volume 22, Archive 1, January 2008 http://www.mdronline.com
47. Matt Pharr (Author), Randima Fernando, GPU Gems 2: Programming Techniques for High-Performance
Graphics and General-Purpose Computation ,Addison Wesley , August 2007
48. NVIDIA GPU Programming Guide http://www.nvidia.com
49. Perry H. Wang1, Jamison D. Collins1, Gautham N. Chinya1, Hong Jiang2, Xinmin Tian3 , EXOCHI: Architecture and
Programming Environment for A Heterogeneous Multi-core Multithreaded System, PLDI’07
50. Karl E. Hillesland, Anselmo Lastra GPU Floating-Point Paranoia, University of North Carolina at Chapel Hill
51. KARPINSKI, R. 1985. Paranoia: A floating-point benchmark. Byte Magazine 10, 2 (Feb.), 223–235.
52. GPGPU Web site : http://www.ggpu.org
53. Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce - 6800 GPU, Ajit Datar, Apurva
Padhye Computer Architecture
54. Nvidia 6800 chapter from GPU Gems 2
http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch30.pdf
55. OpenGL design http://graphics.stanford.edu/courses/cs448a-01-fall/design_opengl.pdf
56. OpenGL programming guide (ISBN: 0201604582)
57. Real time graphics architectures lecture notes http://graphics.stanford.edu/courses/cs448a-01-fall/
58. GeForce 256 overview http://www.nvnews.net/reviews/geforce_256/gpu_overviews.html
59. GPU Programming “Languages http://www.cis.upenn.edu/~suvenkat/700/
60. Programming the GPU and a brief intro to the OPENGL shading language – Marcel Cohan & VVR Talk
61. Johan Seland, GPU Programming and Computing, Workshop on High-Performance and Parallel
Computing Simula Research Laboratory October 24, 2007
62. Daniel Weiskopf, Basics of GPU-Based Programming, Institute of Visualization and Interactive
Systems, Interactive Visualization of Volumetric Data on Consumer PC Hardware: Basics of Hardware-Based
Programming University of Stuttgart, VIS 2003
VCV.Rao. Betatesting Group, C-DAC, Pune 77
References
63. J.S Meredith, G. Alvare, T.A. Maier, T.C.Schulthess and J.S Vetter; Accuracy and performance of
graphic processors : A quantum Monte Carlo application Case Study, Parallel Computing Systems &
Applications ; volume 35, Issue no. 3, March 2009, pp 138-150
64. D.J. Hardy, J.E Stone and K Schulten; Multi-level Summation of electronic potentials using graphics
processing units; Volume 35 Issue No. 3, March 2009, Page 151-163
65. S.Williams,L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and Demmel, Optimization of sparse matrix-vector
multiplication on emerging multi-core platforms, Volume 35 Issue No. 3, March 2009, Page 178-194
66. Nathan Bell, Michael Garland, Implementing Sparse Matrix-Vector Multiplication on Throughput-
Oriented Processors NVIDIA Research,
67. N.Bell and M. Garland CUSP : Generic Parallel algorithms for sparse matrix and graph computations
http://code.google.com/p/cusp-library/
68. Looking Beyond Graphics – NVIDIA’s New GPU Architecture Energizes High-Performance Computing
Tom R Halfhill, October 5, 2009, Microprocessor Report
69. NVIDIA – Fermi - http://www.nvidia.com/object/fermi_architecture.html
70. NVIDIA –Fermi– http:/www.nvidia.com/fermi
71. Fermi is the world's first complete GPU computing architecture.“ Peter Glaskowsky Technology
Analyst, Envisioneering Group
72. C-DAC OPECG-2009 workshop proceedings (NVIDIA & AMD Presentation)
http://www.cdac.in/opecg2009/
74. C-DAC PEEP-2008 workshop proceedings NVIDIA & AMD Presentation)
http://www.cdac.in/html/events/beta-test/PEEP-2008-web-page/peep2008-index.html
75. NVIDIA’s Fermi: The first Complete GPU Computing Architecture; A white paper by Peter
N.Glaskowsky (Prepared under Contract with NVIDIA Corporation)
76. http://www.in.stat.com
77. Metis – Graph Partitioning Software – Prof. Vipin Kumar, Prof. George Karypis, Department of
Computer Science, University of Minnesota, Minneapolis.
78. In-Stat White Paper, Looking Beyond Graphics, NVIDIA’s Next-Generation CUDA Compute and
Graphics Architecture Code-Named Fermi, Adds Muscle for Parallel Processing http://www.in.stat.com/
79. Intel Larrabee http://www.intel.com/
80. http://www.product-reviews.net/2009/09/23/intel-larrabee-graphics-processor-first-demo/
81. Michael Mentor, Senior GPU Computer Architect / Fellow AMD Graphics Product Group, AMD
TM
82. AMD FireStream 9170 & AMD FireStream 9250 Stream Processor :
http://ati.amd.com/products/streamprocessor/specs.html
VCV.Rao. Betatesting Group, C-DAC, Pune 78
GPU Computing ::
GPU Computing
Performance of Scientific Applications
Performance of Scientific Applications
Conclusions
v An overview of GPU Computing
v GPGPUs / GPU Computing Software Products
and Applications - Performance issues
Future Work
v GPU : Implementation of Finite Element /Finite Volume
Comps. – Sparse Matrix Iterative Conjugate Gradient
Solver
Acknowledgements : NVIDIA, AMD, GPU researchers
VCV.Rao. Betatesting Group, C-DAC, Pune Source - NVIDIA 79
Any Questions ?
VCV.Rao. Betatesting Group, C-DAC, Pune 80
Related docs
Get documents about "