# CUDA Lecture 1 Introduction to Massively Parallel Computing

Document Sample

```					Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The
University of Akron.
 A quiet revolution and potential buildup
 Computation: TFLOPs vs. 100 GFLOPs
T12

GT200

G80

3GHz Xeon
G70                                        Westmere
NV30   NV40            Duo

 CPU in every PC – massive volume and potential impact

Introduction to Massively Parallel Processing – Slide 2
Introduction to Massively Parallel Processing – Slide 2
• Solve computation-intensive problems faster

• Make infeasible problems feasible
•   Reduce design time

• Solve larger problems in the same amount of time

•   Reduce design time

Introduction to Massively Parallel Processing – Slide 4
• Another factor: modern scientific method
Nature

Observation

Numerical       Physical                                 Theory
Simulation   Experimentation

Introduction to Massively Parallel Processing – Slide 5
 A growing number of applications in science,
engineering, business and medicine are requiring
computing speeds that cannot be achieved by the
current conventional computers because
• Calculations too computationally intensive

• Too complicated to model

• Too hard, expensive, slow or dangerous for the
laboratory.
 Parallel processing is an approach which helps to make
these computations feasible.
Introduction to Massively Parallel Processing – Slide 6
 One example: grand challenge problems that cannot
be solved in a reasonable amount of time with today’s
computers.
•   Modeling large DNA structures
•   Global weather forecasting
•   Modeling motion of astronomical bodies
 Another example: real-time applications which have
time constrains.
 If data are not processed during some time the result
become meaningless.
Introduction to Massively Parallel Processing – Slide 7
 Atmosphere modeled by dividing it into 3-
dimensional cells. Calculations of each cell repeated
many times to model passage of time.
 For example
 Suppose whole global atmosphere divided into cells of
size 1 mile  1 mile  1 mile to a height of 10 miles (10
cells high) – about 5108 cells.
 If each calculation requires 200 floating point
operations, 1011 floating point operations necessary in
one time step.

Introduction to Massively Parallel Processing – Slide 8
 Example continued
 To forecast the weather over 10 days using 10-minute
intervals, a computer operating at 100 megaflops (108
floating point operations per second) would take 107
seconds or over 100 days.
 To perform the calculation in 10 minutes would require a
computer operating at 1.7 teraflops (1.71012 floating
point operations per second).

Introduction to Massively Parallel Processing – Slide 9
 Two main types
 Parallel computing: Using a computer with more than one
processor to solve a problem.
 Distributed computing: Using more than one computer to
solve a problem.
 Motives
• Usually faster computation
• Very simple idea – that n computers operating
simultaneously can achieve the result n times faster
• It will not be n times faster for various reasons.

• Other motives include: fault tolerance, larger amount of
memory available, …                  Introduction to Massively Parallel Processing – Slide 10
“... There is therefore nothing new in the idea of parallel
programming, but its application to computers. The
author cannot believe that there will be any insuperable
difficulty in extending it to computers. It is not to be
expected that the necessary programming techniques will
be worked out overnight. Much experimenting remains to
be done. After all, the techniques that are commonly used
in programming today were only won at considerable toil
several years ago. In fact the advent of parallel
programming may do something to revive the pioneering
spirit in programming which seems at the present to be
degenerating into a rather dull and routine occupation …”
- S. Gill, “Parallel Programming”, The Computer Journal, April 1958

Introduction to Massively Parallel Processing – Slide 11
 Applications are typically written from scratch (or
manually adapted from a sequential program)
assuming a simple model, in a high-level language (i.e.
C) with explicit parallel computing primitives.
 Which means
 Components difficult to reuse so similar components
coded again and again
 Code difficult to develop, maintain or debug
 Not really developing a long-term software solution

Introduction to Massively Parallel Processing – Slide 12
Processors                 Registers

Highest level                                Small, fast, expensive
Memory level 1

Memory level 2

Memory level 3

Lowest level                                     Large, slow, cheap
Main Memory
Introduction to Massively Parallel Processing – Slide 13
 Execution speed relies on exploiting data locality
 temporal locality: a data item just accessed is likely to be
used again in the near future, so keep it in the cache
 spatial locality: neighboring data is also likely to be used
soon, so load them into the cache at the same time using
a ‘wide’ bus (like a multi-lane motorway)
 from a programmer point of view, all handled
automatically – good for simplicity but maybe not for
best performance

Introduction to Massively Parallel Processing – Slide 14
 Useful work (i.e. floating point ops) can only be done
at the top of the hierarchy.
 So data stored lower must be transferred to the registers
before we can work on it.
 Possibly displacing other data already there.
 This transfer among levels is slow.
 This then becomes the bottleneck in most computations.
 Spend more time moving data than doing useful work.

 The result: speed depends to a large extent on problem
size.

Introduction to Massively Parallel Processing – Slide 15
 Good algorithm design then requires
 Keeping active data at the top of the hierarchy as long as
possible.
 Minimizing movement between levels.
 Need enough work to do at the top of the hierarchy to
mask the transfer time at the lower levels.
 The more processors you have to work with, the larger
the problem has to be to accomplish this.

Introduction to Massively Parallel Processing – Slide 16
Introduction to Massively Parallel Processing – Slide 17
 Each cycle, CPU takes data from registers, does an
operation, and puts the result back
 Load/store operations (memory registers) also take
one cycle
 CPU can do different operations each cycle
 Output of one operation can be input to next

 CPUs haven’t been this simple for a long time!
Introduction to Massively Parallel Processing – Slide 18
“The complexity for minimum component
costs has increased at a rate of roughly a
factor of two per year... Certainly over the
short term this rate can be expected to
continue, if not to increase. Over the
longer term, the rate of increase is a bit
more uncertain, although there is no
reason to believe it will not remain nearly
constant for at least 10 years. That means
by 1975, the number of components per
integrated circuit for minimum cost will be
65,000. I believe that such a large circuit
can be built on a single wafer.”
- Gordon E. Moore, co-founder, Intel
Electronics Magazine, 4/19/1965
Photo credit: Wikipedia

Introduction to Massively Parallel Processing – Slide 19
 Translation: The number of transistors on an integrated
circuit doubles every 1½ - 2 years.

Data credit: Wikipedia
Introduction to Massively Parallel Processing – Slide 20
 As the chip size (amount of circuitry) continues to
double every 18-24 months, the speed of a basic
microprocessor also doubles over that time frame.
 This happens because, as we develop tricks, they are
built into the newest generation of CPUs by the
manufacturers.

Introduction to Massively Parallel Processing – Slide 21
 Instruction-level parallelism (ILP)
 Out-of-order execution, speculation, …
 Vanishing opportunities in power-constrained world
 Data-level parallelism
 Vector units, SIMD execution, …
 Increasing opportunities; SSE, AVX, Cell SPE,
Clearspeed, GPU, …
 Increasing opportunities; multithreading, multicore, …
 Intel Core2, AMD Phenom, Sun Niagara, STI Cell,
NVIDIA Fermi, …
Introduction to Massively Parallel Processing – Slide 22
 Most processors have
multiple pipelines for
start a number of
different operations each
cycle.
 For example consider the
Intel core architecture.

Introduction to Massively Parallel Processing – Slide 23
 Each core in an Intel Core 2 Duo chip has
 14 stage pipeline
 3 integer units (ALU)
 1 floating-point addition unit (FPU)
 1 floating-point multiplication unit (FPU)
 FP division is very slow
 In principle capable of producing 3 integer and 2 FP
results per cycle

Introduction to Massively Parallel Processing – Slide 24
 Technical challenges
 Compiler to extract best performance, reordering
instructions if necessary
 Out-of-order CPU execution to avoid delays waiting for
 Branch prediction to minimize delays due to conditional
branching (loops, if-then-else)
 Memory hierarchy to deliver data to registers fast
enough to feed the processor
 These all limit the number of pipelines that can be
used and increase the chip complexity
 90% of Intel chip devoted to control and data
Introduction to Massively Parallel Processing – Slide 25
Power

Performance                   (courtesy Mark Horowitz and Kevin Skadron)

Introduction to Massively Parallel Processing – Slide 26
 CPU clock stuck at about 3 GHz since 2006 due to
problems with power consumption (up to 130W per
chip)
 Chip circuitry still doubling every 18-24 months
 More on-chip memory and memory management units
(MMUs)
 Specialized hardware (e.g. multimedia, encryption)
 Multi-core (multiple CPUs on one chip)
 Thus peak performance of chip still doubling every 18-
24 months

Introduction to Massively Parallel Processing – Slide 27
Processor   Memory            Processor           Memory

Global Memory

 Handful of processors each supporting ~1 hardware
 On-chip memory near processors (cache, RAM or
both)
 Shared global memory space (external DRAM)
Introduction to Massively Parallel Processing – Slide 28
Processor   Memory            Processor           Memory
•••

Global Memory

 Many processors each supporting many hardware
 On-chip memory near processors (cache, RAM or
both)
 Shared global memory space (external DRAM)
Introduction to Massively Parallel Processing – Slide 29
 Four 3.33 GHz cores, each of which can run 2 threads
 Integrated MMU
Introduction to Massively Parallel Processing – Slide 30
 Cannot continue to scale processor frequencies
 No 10 GHz chips

 Cannot continue to increase power consumption
 Can’t melt chips

 Can continue to increase transistor density
 As per Moore’s Law

Introduction to Massively Parallel Processing – Slide 31
 Exciting applications in future mass computing
“supercomputing applications”
 Molecular dynamic simulation, video and audio coding
and manipulation, 3D imaging and visualization,
consumer game physics, virtual reality products, …
 These “super apps” represent and model physical,
concurrent world
 Various granularities of parallelism exist, but…
 Programming model must not hinder parallel
implementation
 Data delivery needs careful management
Introduction to Massively Parallel Processing – Slide 32
 Computers no longer get faster, just wider

 You must rethink your algorithms to be parallel!

 Data-parallel computing is most scalable solution
 Otherwise: refactor code for 2 cores, 4 cores, 8 cores, 16
cores, …
 You will always have more data than cores – build the
computation around the data

Introduction to Massively Parallel Processing – Slide 33
 Traditional parallel architectures cover some super
apps
 DSP, GPU, network and scientific applications, …
 The game is to grow mainstream architectures “out” or
domain-specific architectures “in”
 CUDA is the latter

Current architecture
coverage

New applications

Domain-specific
architecture coverage

Obstacles

Introduction to Massively Parallel Processing – Slide 34
Introduction to Massively Parallel Processing – Slide 35
 Massive economies of scale
 Massively parallel

Introduction to Massively Parallel Processing – Slide 36
 Produced in vast numbers for computer graphics
 Increasingly being used for
 Computer game “physics”
 Video (e.g. HD video decoding)
 Audio (e.g. MP3 encoding)
 Computational finance
 Oil and gas
 Medical imaging
 Computational science

Introduction to Massively Parallel Processing – Slide 37
 The GPU sits on a PCIe graphics card inside a standard
PC/server with one or two multicore CPUs

Introduction to Massively Parallel Processing – Slide 38
 Up to 448 cores on a single chip
 Simplified logic (no out-of-order execution, no branch
prediction) means much more of the chip is devoted
to computation
   Arranged as multiple units with each unit being
effectively a vector unit, all cores doing the same thing
at the same time
   Very high bandwidth (up to 140GB/s) to graphics
memory (up to 4GB)
   Not general purpose – for parallel applications like
graphics and Monte Carlo simulations
   Can also build big clusters out of GPUs
Introduction to Massively Parallel Processing – Slide 39
 Four major vendors:
 NVIDIA
 AMD: bought ATI several years ago
 IBM: co-developed Cell processor with Sony and
Toshiba for Sony Playstation, but now dropped it for
high-performance computing
 Intel: was developing “Larrabee” chip as GPU, but now
aimed for high-performance computing

Introduction to Massively Parallel Processing – Slide 40
 A quiet revolution and potential buildup
 Computation: TFLOPs vs. 100 GFLOPs
T12

GT200

G80

3GHz Xeon
G70                                         Westmere
NV30   NV40            Duo

 CPU in every PC – massive volume and potential impact

Introduction to Massively Parallel Processing – Slide 41
 A quiet revolution and potential buildup
 Bandwidth: ~10x
T12

GT200

G80

G70                   3GHz Xeon
NV30                   Duo

 CPU in every PC – massive volume and potential impact

Introduction to Massively Parallel Processing – Slide 42
 High throughput computation
 GeForce GTX 280: 933 GFLOPs/sec
 High bandwidth memory
 GeForce GTX 280: 140 GB/sec
 High availability to all
 180M+ CUDA-capable GPUs in the wild

Introduction to Massively Parallel Processing – Slide 43
“Fermi”
3B xtors

GeForce 8800
681M xtors
GeForce FX
125M xtors
GeForce 3
GeForce®256       60M xtors
RIVA 128    23M xtors
3M xtors

1995                  2000                    2005                         2010
Introduction to Massively Parallel Processing – Slide 44
 Throughput is paramount
 Must paint every pixel within frame time
 Scalability

 Create, run and retire lots of threads very rapidly
 Measured 14.8 Gthreads/sec on increment() kernel

 Use multithreading to hide latency
 One stalled thread is o.k. if 100 are ready to run

Introduction to Massively Parallel Processing – Slide 45
 Different goals produce different designs
 GPU assumes work load is highly parallel
 CPU must be good at everything, parallel or not
 CPU: minimize latency experienced by one thread
 Big on-chip caches, sophisticated control logic
 GPU: maximize throughput of all thread
 Number of threads in flight limited by resources  lots
of resources (registers, bandwidth, etc.)
 Multithreading can hide latency  skip the big caches
 Share control logic across many threads

Introduction to Massively Parallel Processing – Slide 46
 GeForce 8800 (2007)
Host

Input Assembler

Parallel Data   Parallel Data   Parallel Data   Parallel Data   Parallel Data   Parallel Data     Parallel Data   Parallel Data
Cache           Cache           Cache           Cache           Cache           Cache             Cache           Cache
Texture
Texture         Texture        Texture         Texture         Texture         Texture           Texture         Texture

Global Memory

Introduction to Massively Parallel Processing – Slide 47
 16 highly threaded streaming multiprocessors (SMs)
 > 128 floating point units (FPUs)
 367 GFLOPs peak performance (25-50 times that of
current high-end microprocessors)
 265 GFLOPs sustained for applications such as visual
molecular dynamics (VMD)
 768 MB DRAM
 86.4 GB/sec memory bandwidth
 4 GB/sec bandwidth to CPU

Introduction to Massively Parallel Processing – Slide 48
 Massively parallel, 128 cores, 90 watts
application
 30-100 times speedup over high-end microprocessors
on scientific and media applications
 Medical imaging, molecular dynamics, …

Introduction to Massively Parallel Processing – Slide 49
 Fermi GF100 (2010)

 ~ 1.5 TFLOPs (single precision), ~800 GFLOPs (double
precision)
 230 GB/sec DRAM bandwidth

Introduction to Massively Parallel Processing – Slide 50
Introduction to Massively Parallel Processing – Slide 51
 32 CUDA cores per SM (512 total)
 8x peak FP64 performance
 50% of peak FP32 performance
 Usual linear sequence of bytes
 High bandwidth (hundreds of GB/sec)
 64K of fast, on-chip RAM
 Software or hardware managed
 Shared amongst CUDA cores

Introduction to Massively Parallel Processing – Slide 52
 SIMT (Single Instruction Multiple Thread) execution
 Threads run in groups of 32 called warps
 Minimum of 32 threads all doing the same thing at (almost)
the same time
 Threads in a warp share instruction unit (IU)
 All cores execute the same instructions simultaneously, but
with different data
 Similar to vector computing on CRAY supercomputers
 Hardware automatically handles divergence
 Natural for graphics processing and much scientific
computing; also a natural choice for many-core chips to
simplify each core
Introduction to Massively Parallel Processing – Slide 53
 Hardware resource allocation and thread scheduling
 Hardware relies on threads to hide latency
 Threads have all resources needed to run
 Any warp not waiting for something can run
 Context switching is (basically) free

Introduction to Massively Parallel Processing – Slide 54
 Lots of active threads is the key to high performance:
 no “context switching”; each thread has its own registers,
which limits the number of active threads
 threads on each SM execute in warps – execution
alternates between “active” warps, with warps becoming
temporarily “inactive” when waiting for data
 for each thread, one operation completes long before the
next starts – avoids the complexity of pipeline overlaps
which can limit the performance of modern processors
 memory access from device memory has a delay of 400-
600 cycles; with 40 threads this is equivalent to 10-15
operations, so hopefully there’s enough computation to
hide the latency
Introduction to Massively Parallel Processing – Slide 55
 Intel Core 2 / Xeon / i7
 4-6 MIMD (Multiple Instruction / Multiple Data) cores
 each core operates independently
 each can be working with a different code, performing
different operations with entirely different data
 few registers, multilevel caches
 10-30 GB/s bandwidth to main memory

Introduction to Massively Parallel Processing – Slide 56
 NVIDIA GTX280:
 240 cores, arranged as 30 units each with 8 SIMD (Single
Instruction / Multiple Data) cores
 all cores executing the same instruction at the same time, but
working on different data
 only one instruction de-coder needed to control all cores
 functions like a vector unit
 lots of registers, almost no cache
 5 GB/s bandwidth to host processor (PCIe x16 gen 2)
 140 GB/s bandwidth to graphics memory

Introduction to Massively Parallel Processing – Slide 57
 One issue with SIMD / vector execution: what happens
with if-then-else branches?
 Standard vector treatment: execute both branches but
only store the results for the correct branch.
 NVIDIA treatment: works with warps of length 32. If a
particular warp only goes one way, then that is all that is
executed.

Introduction to Massively Parallel Processing – Slide 58
 How do NVIDIA GPUs deal with same architectural
challenges as CPUs?
core:
 Solves pipeline problem, because output of one calculation
can be used an input for next calculation for same thread
 Switch from one set of threads to another when waiting for
data from graphics memory

Introduction to Massively Parallel Processing – Slide 59
Introduction to Massively Parallel Processing – Slide 60
 Scalable programming model

 Minimal extensions to familiar C/C++ environment

 Heterogeneous serial-parallel computing

Introduction to Massively Parallel Processing – Slide 61
45X       17X                                 100X

110-240X   13–457x                              35X
Introduction to Massively Parallel Processing – Slide 62
 Big breakthrough in GPU computing has been
NVIDIA’s development of CUDA programming
environment
 Initially driven by needs of computer games developers
 Now being driven by new markets (e.g. HD video
decoding)
 C plus some extensions and some C++ features (e.g.
templates)

Introduction to Massively Parallel Processing – Slide 63
 Host code runs on CPU, CUDA code runs on GPU
 Explicit movement of data across the PCIe connection
 Very straightforward for Monte Carlo applications,
once you have a random number generator
 Harder for finite difference applications

Introduction to Massively Parallel Processing – Slide 64
 At the top level, we have a master process which runs
on the CPU and performs the following steps:
1.        initialize card
2.        allocate memory in host and on device
3.        repeat as needed
1.     copy data from host to device memory
2.     launch multiple copies of execution “kernel” on device
3.     copy data from device memory to host
4.        de-allocate all memory and terminate

Introduction to Massively Parallel Processing – Slide 65
 At a lower level, within the GPU:
 each copy of the execution kernel executes on an SM
 if the number of copies exceeds the number of SMs,
then more than one will run at a time on each SM if
there are enough registers and shared memory, and the
others will wait in a queue and execute later
 all threads within one copy can access local shared
memory but can’t see what the other copies are doing
(even if they are on the same SM)
 there are no guarantees on the order in which the copies
will execute

Introduction to Massively Parallel Processing – Slide 66
 Augment C/C++ with minimalist abstractions
 Let programmers focus on parallel algorithms, not
mechanics of parallel programming language
 Provide straightforward mapping onto hardware
 Good fit to GPU architecture
 Maps well to multi-core CPUs too
 Scale to 100s of cores and 10000s of parallel threads
 GPU threads are lightweight – create/switch is free
 GPU needs 1000s of threads for full utilization

Introduction to Massively Parallel Processing – Slide 67
 Lightweight synchronization primitives
 Shared memory model for cooperating threads

Introduction to Massively Parallel Processing – Slide 68
 All threads execute the same sequential program

t0 t1 … tB
 Threads in the same block can cooperate

Introduction to Massively Parallel Processing – Slide 69
 CUDA virtualizes the physical hardware
 Thread is a virtualized scalar processor: registers, PC,
state
 Block is a virtualized multiprocessor: threads, shared
memory
 Scheduled onto physical hardware without pre-
emption
 Threads/blocks launch and run to completion
 Blocks should be independent

Introduction to Massively Parallel Processing – Slide 70
Block   Memory                 Block            Memory
•••

Global Memory

Introduction to Massively Parallel Processing – Slide 71
 Cf. PRAM (Parallel Random Access Machine) model

Processors

Global Memory

 Global synchronization isn’t cheap
 Global memory access times are expensive

Introduction to Massively Parallel Processing – Slide 72
 Cf. BSP (Bulk Synchronous Parallel) model, MPI

Processor   Memory              Processor           Memory
•••

Interconnection Network

 Distributed computing is a different setting

Introduction to Massively Parallel Processing – Slide 73
Multicore CPU        Manycore GPU

Introduction to Massively Parallel Processing – Slide 74
 Philosophy: provide minimal set of extensions
necessary to expose power
 Function and variable qualifiers
 Execution configuration
 Built-in variables and functions valid in device code

Introduction to Massively Parallel Processing – Slide 75
Introduction to Massively Parallel Processing – Slide 76
Application   Description                                                 Source       Kernel     % time
H.264         SPEC ‘06 version, change in guess vector                     34,811         194        35%
LBM           SPEC ‘06 version, change to single precision and              1,481         285      >99%
print fewer reports
RC5-72        Distributed.net RC5-72 challenge client code                  1,979         218      >99%
FEM           Finite element modeling, simulation of 3D graded              1,874         146        99%
materials
RPES          Rye Polynomial Equation Solver, quantum chem, 2-              1,104         281        99%
electron repulsion
PNS           Petri Net simulation of a distributed system                    322         160      >99%
SAXPY         Single-precision implementation of saxpy, used in               952           31     >99%
Linpack’s Gaussian elim. routine
TPACF         Two Point Angular Correlation Function                          536           98       96%
FDTD          Finite-Difference Time Domain analysis of 2D                  1,365           93       16%
electromagnetic wave propagation
MRI-Q         Computing a matrix Q, a scanner’s configuration in              490           33     >99%
MRI reconstruction

Introduction to Massively Parallel Processing – Slide 77
 GeForce 8800 GTX vs. 2.2 GHz Opteron 248

Introduction to Massively Parallel Processing – Slide 78
 10x speedup in a kernel is typical, as long as the kernel

 25x to 400x speedup if the function’s data
requirements and control flow suit the GPU and the
application is optimized

Introduction to Massively Parallel Processing – Slide 79
 Parallel hardware is here to stay

 GPUs are massively parallel manycore processors
 Easily available and fully programmable

 Parallelism and scalability are crucial for success

 This presents many important research challenges
 Not to mention the educational challenges

Introduction to Massively Parallel Processing – Slide 80
 Reading: Chapter 1, “Programming Massively Parallel
Processors” by Kirk and Hwu.
 Based on original material from
 The University of Akron: Tim O’Neil, Kathy Liszka
 Hiram College: Irena Lomonosov
 The University of Illinois at Urbana-Champaign
 David Kirk, Wen-mei W. Hwu
 The University of North Carolina at Charlotte
 Barry Wilkinson, Michael Allen
 Oregon State University: Michael Quinn
 Oxford University: Mike Giles
 Stanford University: Jared Hoberock, David Tarjan
 Revision history: last updated 6/4/2011.
Introduction to Massively Parallel Processing – Slide 81

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 0 posted: 2/22/2013 language: English pages: 81