Docstoc

CUDA Lecture 1 Introduction to Massively Parallel Computing

Document Sample
CUDA Lecture 1 Introduction to Massively Parallel Computing Powered By Docstoc
					Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The
                  University of Akron.
 A quiet revolution and potential buildup
   Computation: TFLOPs vs. 100 GFLOPs
                                                                      T12


                                               GT200




                                 G80



                                               3GHz Xeon
                         G70                                        Westmere
                               3GHz Core2        Quad
           NV30   NV40            Duo




   CPU in every PC – massive volume and potential impact

                                       Introduction to Massively Parallel Processing – Slide 2
Introduction to Massively Parallel Processing – Slide 2
• Solve computation-intensive problems faster

  • Make infeasible problems feasible
  •   Reduce design time

• Solve larger problems in the same amount of time

  • Improve an answer’s precision
  •   Reduce design time

• Gain competitive advantage


                                Introduction to Massively Parallel Processing – Slide 4
• Another factor: modern scientific method
                                   Nature


                               Observation


    Numerical       Physical                                 Theory
    Simulation   Experimentation




                               Introduction to Massively Parallel Processing – Slide 5
 A growing number of applications in science,
 engineering, business and medicine are requiring
 computing speeds that cannot be achieved by the
 current conventional computers because
  • Calculations too computationally intensive

  • Too complicated to model

  • Too hard, expensive, slow or dangerous for the
    laboratory.
 Parallel processing is an approach which helps to make
 these computations feasible.
                                 Introduction to Massively Parallel Processing – Slide 6
 One example: grand challenge problems that cannot
 be solved in a reasonable amount of time with today’s
 computers.
  •   Modeling large DNA structures
  •   Global weather forecasting
  •   Modeling motion of astronomical bodies
 Another example: real-time applications which have
 time constrains.
   If data are not processed during some time the result
      become meaningless.
                                   Introduction to Massively Parallel Processing – Slide 7
 Atmosphere modeled by dividing it into 3-
 dimensional cells. Calculations of each cell repeated
 many times to model passage of time.
 For example
   Suppose whole global atmosphere divided into cells of
    size 1 mile  1 mile  1 mile to a height of 10 miles (10
    cells high) – about 5108 cells.
   If each calculation requires 200 floating point
    operations, 1011 floating point operations necessary in
    one time step.

                                     Introduction to Massively Parallel Processing – Slide 8
 Example continued
   To forecast the weather over 10 days using 10-minute
    intervals, a computer operating at 100 megaflops (108
    floating point operations per second) would take 107
    seconds or over 100 days.
   To perform the calculation in 10 minutes would require a
    computer operating at 1.7 teraflops (1.71012 floating
    point operations per second).




                                   Introduction to Massively Parallel Processing – Slide 9
 Two main types
   Parallel computing: Using a computer with more than one
    processor to solve a problem.
   Distributed computing: Using more than one computer to
    solve a problem.
 Motives
  • Usually faster computation
     • Very simple idea – that n computers operating
       simultaneously can achieve the result n times faster
     • It will not be n times faster for various reasons.

  • Other motives include: fault tolerance, larger amount of
    memory available, …                  Introduction to Massively Parallel Processing – Slide 10
“... There is therefore nothing new in the idea of parallel
    programming, but its application to computers. The
    author cannot believe that there will be any insuperable
    difficulty in extending it to computers. It is not to be
    expected that the necessary programming techniques will
    be worked out overnight. Much experimenting remains to
    be done. After all, the techniques that are commonly used
    in programming today were only won at considerable toil
    several years ago. In fact the advent of parallel
    programming may do something to revive the pioneering
    spirit in programming which seems at the present to be
    degenerating into a rather dull and routine occupation …”
                - S. Gill, “Parallel Programming”, The Computer Journal, April 1958


                                            Introduction to Massively Parallel Processing – Slide 11
 Applications are typically written from scratch (or
  manually adapted from a sequential program)
  assuming a simple model, in a high-level language (i.e.
  C) with explicit parallel computing primitives.
 Which means
   Components difficult to reuse so similar components
    coded again and again
   Code difficult to develop, maintain or debug
   Not really developing a long-term software solution




                                 Introduction to Massively Parallel Processing – Slide 12
                  Processors                 Registers

Highest level                                Small, fast, expensive
                Memory level 1

                Memory level 2

                Memory level 3

Lowest level                                     Large, slow, cheap
                Main Memory
                               Introduction to Massively Parallel Processing – Slide 13
 Execution speed relies on exploiting data locality
   temporal locality: a data item just accessed is likely to be
    used again in the near future, so keep it in the cache
   spatial locality: neighboring data is also likely to be used
    soon, so load them into the cache at the same time using
    a ‘wide’ bus (like a multi-lane motorway)
   from a programmer point of view, all handled
    automatically – good for simplicity but maybe not for
    best performance



                                   Introduction to Massively Parallel Processing – Slide 14
 Useful work (i.e. floating point ops) can only be done
 at the top of the hierarchy.
   So data stored lower must be transferred to the registers
    before we can work on it.
      Possibly displacing other data already there.
   This transfer among levels is slow.
      This then becomes the bottleneck in most computations.
      Spend more time moving data than doing useful work.

 The result: speed depends to a large extent on problem
 size.


                                      Introduction to Massively Parallel Processing – Slide 15
 Good algorithm design then requires
   Keeping active data at the top of the hierarchy as long as
    possible.
   Minimizing movement between levels.
 Need enough work to do at the top of the hierarchy to
 mask the transfer time at the lower levels.
   The more processors you have to work with, the larger
    the problem has to be to accomplish this.




                                  Introduction to Massively Parallel Processing – Slide 16
Introduction to Massively Parallel Processing – Slide 17
 Each cycle, CPU takes data from registers, does an
  operation, and puts the result back
 Load/store operations (memory registers) also take
  one cycle
 CPU can do different operations each cycle
 Output of one operation can be input to next




 CPUs haven’t been this simple for a long time!
                                Introduction to Massively Parallel Processing – Slide 18
“The complexity for minimum component
  costs has increased at a rate of roughly a
  factor of two per year... Certainly over the
  short term this rate can be expected to
  continue, if not to increase. Over the
  longer term, the rate of increase is a bit
  more uncertain, although there is no
  reason to believe it will not remain nearly
  constant for at least 10 years. That means
  by 1975, the number of components per
  integrated circuit for minimum cost will be
  65,000. I believe that such a large circuit
  can be built on a single wafer.”
              - Gordon E. Moore, co-founder, Intel
               Electronics Magazine, 4/19/1965
                                                                                    Photo credit: Wikipedia




                                                 Introduction to Massively Parallel Processing – Slide 19
 Translation: The number of transistors on an integrated
  circuit doubles every 1½ - 2 years.




                                                                       Data credit: Wikipedia
                                    Introduction to Massively Parallel Processing – Slide 20
 As the chip size (amount of circuitry) continues to
  double every 18-24 months, the speed of a basic
  microprocessor also doubles over that time frame.
 This happens because, as we develop tricks, they are
  built into the newest generation of CPUs by the
  manufacturers.




                                Introduction to Massively Parallel Processing – Slide 21
 Instruction-level parallelism (ILP)
   Out-of-order execution, speculation, …
   Vanishing opportunities in power-constrained world
 Data-level parallelism
   Vector units, SIMD execution, …
   Increasing opportunities; SSE, AVX, Cell SPE,
    Clearspeed, GPU, …
 Thread-level parallelism
   Increasing opportunities; multithreading, multicore, …
   Intel Core2, AMD Phenom, Sun Niagara, STI Cell,
    NVIDIA Fermi, …
                                 Introduction to Massively Parallel Processing – Slide 22
 Most processors have
  multiple pipelines for
  different tasks and can
  start a number of
  different operations each
  cycle.
 For example consider the
  Intel core architecture.




                              Introduction to Massively Parallel Processing – Slide 23
 Each core in an Intel Core 2 Duo chip has
   14 stage pipeline
   3 integer units (ALU)
   1 floating-point addition unit (FPU)
   1 floating-point multiplication unit (FPU)
       FP division is very slow
   2 load/store units
   In principle capable of producing 3 integer and 2 FP
    results per cycle



                                 Introduction to Massively Parallel Processing – Slide 24
 Technical challenges
   Compiler to extract best performance, reordering
    instructions if necessary
   Out-of-order CPU execution to avoid delays waiting for
    read/write or earlier operations
   Branch prediction to minimize delays due to conditional
    branching (loops, if-then-else)
   Memory hierarchy to deliver data to registers fast
    enough to feed the processor
 These all limit the number of pipelines that can be
 used and increase the chip complexity
   90% of Intel chip devoted to control and data
                                 Introduction to Massively Parallel Processing – Slide 25
Power




        Performance                   (courtesy Mark Horowitz and Kevin Skadron)


                Introduction to Massively Parallel Processing – Slide 26
 CPU clock stuck at about 3 GHz since 2006 due to
  problems with power consumption (up to 130W per
  chip)
 Chip circuitry still doubling every 18-24 months
   More on-chip memory and memory management units
    (MMUs)
   Specialized hardware (e.g. multimedia, encryption)
   Multi-core (multiple CPUs on one chip)
 Thus peak performance of chip still doubling every 18-
 24 months

                                Introduction to Massively Parallel Processing – Slide 27
       Processor   Memory            Processor           Memory




                        Global Memory



 Handful of processors each supporting ~1 hardware
  thread
 On-chip memory near processors (cache, RAM or
  both)
 Shared global memory space (external DRAM)
                                   Introduction to Massively Parallel Processing – Slide 28
       Processor   Memory            Processor           Memory
                            •••

                        Global Memory



 Many processors each supporting many hardware
  threads
 On-chip memory near processors (cache, RAM or
  both)
 Shared global memory space (external DRAM)
                                   Introduction to Massively Parallel Processing – Slide 29
 Four 3.33 GHz cores, each of which can run 2 threads
 Integrated MMU
                               Introduction to Massively Parallel Processing – Slide 30
 Cannot continue to scale processor frequencies
   No 10 GHz chips


 Cannot continue to increase power consumption
   Can’t melt chips


 Can continue to increase transistor density
   As per Moore’s Law




                                Introduction to Massively Parallel Processing – Slide 31
 Exciting applications in future mass computing
  market have been traditionally considered
  “supercomputing applications”
   Molecular dynamic simulation, video and audio coding
    and manipulation, 3D imaging and visualization,
    consumer game physics, virtual reality products, …
   These “super apps” represent and model physical,
    concurrent world
 Various granularities of parallelism exist, but…
   Programming model must not hinder parallel
    implementation
   Data delivery needs careful management
                                 Introduction to Massively Parallel Processing – Slide 32
 Computers no longer get faster, just wider


 You must rethink your algorithms to be parallel!


 Data-parallel computing is most scalable solution
   Otherwise: refactor code for 2 cores, 4 cores, 8 cores, 16
    cores, …
   You will always have more data than cores – build the
    computation around the data


                                   Introduction to Massively Parallel Processing – Slide 33
 Traditional parallel architectures cover some super
 apps
   DSP, GPU, network and scientific applications, …
 The game is to grow mainstream architectures “out” or
 domain-specific architectures “in”
   CUDA is the latter
                                                            Traditional applications

                                                             Current architecture
                                                                  coverage

                                                             New applications

                                                                Domain-specific
                                                             architecture coverage

                                                             Obstacles



                                 Introduction to Massively Parallel Processing – Slide 34
Introduction to Massively Parallel Processing – Slide 35
 Massive economies of scale
 Massively parallel




                               Introduction to Massively Parallel Processing – Slide 36
 Produced in vast numbers for computer graphics
 Increasingly being used for
   Computer game “physics”
   Video (e.g. HD video decoding)
   Audio (e.g. MP3 encoding)
   Multimedia (e.g. Adobe software)
   Computational finance
   Oil and gas
   Medical imaging
   Computational science


                                Introduction to Massively Parallel Processing – Slide 37
 The GPU sits on a PCIe graphics card inside a standard
 PC/server with one or two multicore CPUs




                               Introduction to Massively Parallel Processing – Slide 38
 Up to 448 cores on a single chip
 Simplified logic (no out-of-order execution, no branch
    prediction) means much more of the chip is devoted
    to computation
   Arranged as multiple units with each unit being
    effectively a vector unit, all cores doing the same thing
    at the same time
   Very high bandwidth (up to 140GB/s) to graphics
    memory (up to 4GB)
   Not general purpose – for parallel applications like
    graphics and Monte Carlo simulations
   Can also build big clusters out of GPUs
                                   Introduction to Massively Parallel Processing – Slide 39
 Four major vendors:
   NVIDIA
   AMD: bought ATI several years ago
   IBM: co-developed Cell processor with Sony and
    Toshiba for Sony Playstation, but now dropped it for
    high-performance computing
   Intel: was developing “Larrabee” chip as GPU, but now
    aimed for high-performance computing




                                Introduction to Massively Parallel Processing – Slide 40
 A quiet revolution and potential buildup
   Computation: TFLOPs vs. 100 GFLOPs
                                                                       T12


                                                GT200




                                 G80



                                                3GHz Xeon
                         G70                                         Westmere
                               3GHz Core2         Quad
           NV30   NV40            Duo




   CPU in every PC – massive volume and potential impact

                                       Introduction to Massively Parallel Processing – Slide 41
 A quiet revolution and potential buildup
   Bandwidth: ~10x
                                                                      T12



                                                 GT200




                                  G80


                         G70                   3GHz Xeon
                  NV40         3GHz Core2        Quad              Westmere
           NV30                   Duo




   CPU in every PC – massive volume and potential impact

                                        Introduction to Massively Parallel Processing – Slide 42
 High throughput computation
   GeForce GTX 280: 933 GFLOPs/sec
 High bandwidth memory
   GeForce GTX 280: 140 GB/sec
 High availability to all
   180M+ CUDA-capable GPUs in the wild




                                  Introduction to Massively Parallel Processing – Slide 43
                                                                                      “Fermi”
                                                                                      3B xtors


                                                      GeForce 8800
                                                       681M xtors
                                         GeForce FX
                                         125M xtors
                             GeForce 3
           GeForce®256       60M xtors
RIVA 128    23M xtors
3M xtors




1995                  2000                    2005                         2010
                                                      Introduction to Massively Parallel Processing – Slide 44
 Throughput is paramount
     Must paint every pixel within frame time
     Scalability


 Create, run and retire lots of threads very rapidly
     Measured 14.8 Gthreads/sec on increment() kernel

 Use multithreading to hide latency
     One stalled thread is o.k. if 100 are ready to run



                                     Introduction to Massively Parallel Processing – Slide 45
 Different goals produce different designs
   GPU assumes work load is highly parallel
   CPU must be good at everything, parallel or not
 CPU: minimize latency experienced by one thread
   Big on-chip caches, sophisticated control logic
 GPU: maximize throughput of all thread
   Number of threads in flight limited by resources  lots
    of resources (registers, bandwidth, etc.)
   Multithreading can hide latency  skip the big caches
   Share control logic across many threads


                                 Introduction to Massively Parallel Processing – Slide 46
 GeForce 8800 (2007)
                                  Host

                             Input Assembler

                        Thread Execution Manager




Parallel Data   Parallel Data   Parallel Data   Parallel Data   Parallel Data   Parallel Data     Parallel Data   Parallel Data
   Cache           Cache           Cache           Cache           Cache           Cache             Cache           Cache
  Texture
   Texture         Texture        Texture         Texture         Texture         Texture           Texture         Texture




  Load/store             Load/store             Load/store            Load/store                Load/store           Load/store

                                                        Global Memory

                                                                         Introduction to Massively Parallel Processing – Slide 47
 16 highly threaded streaming multiprocessors (SMs)
 > 128 floating point units (FPUs)
 367 GFLOPs peak performance (25-50 times that of
 current high-end microprocessors)
   265 GFLOPs sustained for applications such as visual
    molecular dynamics (VMD)
 768 MB DRAM
 86.4 GB/sec memory bandwidth
 4 GB/sec bandwidth to CPU


                                Introduction to Massively Parallel Processing – Slide 48
 Massively parallel, 128 cores, 90 watts
 Massively threaded, sustains 1000s of threads per
  application
 30-100 times speedup over high-end microprocessors
  on scientific and media applications
   Medical imaging, molecular dynamics, …




                                 Introduction to Massively Parallel Processing – Slide 49
 Fermi GF100 (2010)




   ~ 1.5 TFLOPs (single precision), ~800 GFLOPs (double
    precision)
   230 GB/sec DRAM bandwidth

                                Introduction to Massively Parallel Processing – Slide 50
Introduction to Massively Parallel Processing – Slide 51
 32 CUDA cores per SM (512 total)
 8x peak FP64 performance
   50% of peak FP32 performance
 Direct load/store to memory
   Usual linear sequence of bytes
   High bandwidth (hundreds of GB/sec)
 64K of fast, on-chip RAM
   Software or hardware managed
   Shared amongst CUDA cores
   Enables thread communication

                                 Introduction to Massively Parallel Processing – Slide 52
 SIMT (Single Instruction Multiple Thread) execution
   Threads run in groups of 32 called warps
      Minimum of 32 threads all doing the same thing at (almost)
       the same time
   Threads in a warp share instruction unit (IU)
      All cores execute the same instructions simultaneously, but
       with different data
      Similar to vector computing on CRAY supercomputers
   Hardware automatically handles divergence
   Natural for graphics processing and much scientific
    computing; also a natural choice for many-core chips to
    simplify each core
                                      Introduction to Massively Parallel Processing – Slide 53
 Hardware multithreading
   Hardware resource allocation and thread scheduling
   Hardware relies on threads to hide latency
 Threads have all resources needed to run
   Any warp not waiting for something can run
   Context switching is (basically) free




                                  Introduction to Massively Parallel Processing – Slide 54
 Lots of active threads is the key to high performance:
   no “context switching”; each thread has its own registers,
    which limits the number of active threads
   threads on each SM execute in warps – execution
    alternates between “active” warps, with warps becoming
    temporarily “inactive” when waiting for data
   for each thread, one operation completes long before the
    next starts – avoids the complexity of pipeline overlaps
    which can limit the performance of modern processors
   memory access from device memory has a delay of 400-
    600 cycles; with 40 threads this is equivalent to 10-15
    operations, so hopefully there’s enough computation to
    hide the latency
                                  Introduction to Massively Parallel Processing – Slide 55
 Intel Core 2 / Xeon / i7
   4-6 MIMD (Multiple Instruction / Multiple Data) cores
      each core operates independently
      each can be working with a different code, performing
       different operations with entirely different data
   few registers, multilevel caches
   10-30 GB/s bandwidth to main memory




                                  Introduction to Massively Parallel Processing – Slide 56
 NVIDIA GTX280:
   240 cores, arranged as 30 units each with 8 SIMD (Single
    Instruction / Multiple Data) cores
      all cores executing the same instruction at the same time, but
       working on different data
      only one instruction de-coder needed to control all cores
      functions like a vector unit
   lots of registers, almost no cache
   5 GB/s bandwidth to host processor (PCIe x16 gen 2)
   140 GB/s bandwidth to graphics memory



                                      Introduction to Massively Parallel Processing – Slide 57
 One issue with SIMD / vector execution: what happens
 with if-then-else branches?
   Standard vector treatment: execute both branches but
    only store the results for the correct branch.
   NVIDIA treatment: works with warps of length 32. If a
    particular warp only goes one way, then that is all that is
    executed.




                                   Introduction to Massively Parallel Processing – Slide 58
 How do NVIDIA GPUs deal with same architectural
 challenges as CPUs?
   Very heavily multi-threaded, with at least 8 threads per
    core:
      Solves pipeline problem, because output of one calculation
       can be used an input for next calculation for same thread
      Switch from one set of threads to another when waiting for
       data from graphics memory




                                     Introduction to Massively Parallel Processing – Slide 59
Introduction to Massively Parallel Processing – Slide 60
 Scalable programming model


 Minimal extensions to familiar C/C++ environment


 Heterogeneous serial-parallel computing




                               Introduction to Massively Parallel Processing – Slide 61
  45X       17X                                 100X




110-240X   13–457x                              35X
                     Introduction to Massively Parallel Processing – Slide 62
 Big breakthrough in GPU computing has been
 NVIDIA’s development of CUDA programming
 environment
   Initially driven by needs of computer games developers
   Now being driven by new markets (e.g. HD video
    decoding)
   C plus some extensions and some C++ features (e.g.
    templates)




                                 Introduction to Massively Parallel Processing – Slide 63
 Host code runs on CPU, CUDA code runs on GPU
 Explicit movement of data across the PCIe connection
 Very straightforward for Monte Carlo applications,
  once you have a random number generator
 Harder for finite difference applications




                               Introduction to Massively Parallel Processing – Slide 64
 At the top level, we have a master process which runs
 on the CPU and performs the following steps:
  1.        initialize card
  2.        allocate memory in host and on device
  3.        repeat as needed
       1.     copy data from host to device memory
       2.     launch multiple copies of execution “kernel” on device
       3.     copy data from device memory to host
  4.        de-allocate all memory and terminate



                                          Introduction to Massively Parallel Processing – Slide 65
 At a lower level, within the GPU:
   each copy of the execution kernel executes on an SM
   if the number of copies exceeds the number of SMs,
    then more than one will run at a time on each SM if
    there are enough registers and shared memory, and the
    others will wait in a queue and execute later
   all threads within one copy can access local shared
    memory but can’t see what the other copies are doing
    (even if they are on the same SM)
   there are no guarantees on the order in which the copies
    will execute

                                 Introduction to Massively Parallel Processing – Slide 66
 Augment C/C++ with minimalist abstractions
   Let programmers focus on parallel algorithms, not
    mechanics of parallel programming language
 Provide straightforward mapping onto hardware
   Good fit to GPU architecture
   Maps well to multi-core CPUs too
 Scale to 100s of cores and 10000s of parallel threads
   GPU threads are lightweight – create/switch is free
   GPU needs 1000s of threads for full utilization



                                   Introduction to Massively Parallel Processing – Slide 67
 Hierarchy of concurrent threads
 Lightweight synchronization primitives
 Shared memory model for cooperating threads




                              Introduction to Massively Parallel Processing – Slide 68
 Parallel kernels composed of many threads                               Thread t
   All threads execute the same sequential program


 Threads are grouped into thread blocks                                   Block b
                                                                         t0 t1 … tB
   Threads in the same block can cooperate


 Threads/blocks have unique IDs




                                Introduction to Massively Parallel Processing – Slide 69
 CUDA virtualizes the physical hardware
   Thread is a virtualized scalar processor: registers, PC,
    state
   Block is a virtualized multiprocessor: threads, shared
    memory
 Scheduled onto physical hardware without pre-
 emption
   Threads/blocks launch and run to completion
   Blocks should be independent




                                   Introduction to Massively Parallel Processing – Slide 70
Block   Memory                 Block            Memory
                 •••


             Global Memory




                       Introduction to Massively Parallel Processing – Slide 71
 Cf. PRAM (Parallel Random Access Machine) model

                        Processors




                      Global Memory



 Global synchronization isn’t cheap
 Global memory access times are expensive

                                     Introduction to Massively Parallel Processing – Slide 72
 Cf. BSP (Bulk Synchronous Parallel) model, MPI

         Processor   Memory              Processor           Memory
                              •••

                     Interconnection Network



 Distributed computing is a different setting



                                    Introduction to Massively Parallel Processing – Slide 73
Multicore CPU        Manycore GPU




                Introduction to Massively Parallel Processing – Slide 74
 Philosophy: provide minimal set of extensions
 necessary to expose power
   Function and variable qualifiers
   Execution configuration
   Built-in variables and functions valid in device code




                                  Introduction to Massively Parallel Processing – Slide 75
Introduction to Massively Parallel Processing – Slide 76
Application   Description                                                 Source       Kernel     % time
H.264         SPEC ‘06 version, change in guess vector                     34,811         194        35%
LBM           SPEC ‘06 version, change to single precision and              1,481         285      >99%
              print fewer reports
RC5-72        Distributed.net RC5-72 challenge client code                  1,979         218      >99%
FEM           Finite element modeling, simulation of 3D graded              1,874         146        99%
              materials
RPES          Rye Polynomial Equation Solver, quantum chem, 2-              1,104         281        99%
              electron repulsion
PNS           Petri Net simulation of a distributed system                    322         160      >99%
SAXPY         Single-precision implementation of saxpy, used in               952           31     >99%
              Linpack’s Gaussian elim. routine
TPACF         Two Point Angular Correlation Function                          536           98       96%
FDTD          Finite-Difference Time Domain analysis of 2D                  1,365           93       16%
              electromagnetic wave propagation
MRI-Q         Computing a matrix Q, a scanner’s configuration in              490           33     >99%
              MRI reconstruction

                                                         Introduction to Massively Parallel Processing – Slide 77
 GeForce 8800 GTX vs. 2.2 GHz Opteron 248

                             Introduction to Massively Parallel Processing – Slide 78
 10x speedup in a kernel is typical, as long as the kernel
  can occupy enough parallel threads

 25x to 400x speedup if the function’s data
  requirements and control flow suit the GPU and the
  application is optimized




                                 Introduction to Massively Parallel Processing – Slide 79
 Parallel hardware is here to stay


 GPUs are massively parallel manycore processors
   Easily available and fully programmable


 Parallelism and scalability are crucial for success


 This presents many important research challenges
   Not to mention the educational challenges


                                 Introduction to Massively Parallel Processing – Slide 80
 Reading: Chapter 1, “Programming Massively Parallel
  Processors” by Kirk and Hwu.
 Based on original material from
   The University of Akron: Tim O’Neil, Kathy Liszka
   Hiram College: Irena Lomonosov
   The University of Illinois at Urbana-Champaign
      David Kirk, Wen-mei W. Hwu
   The University of North Carolina at Charlotte
      Barry Wilkinson, Michael Allen
   Oregon State University: Michael Quinn
   Oxford University: Mike Giles
   Stanford University: Jared Hoberock, David Tarjan
 Revision history: last updated 6/4/2011.
                                    Introduction to Massively Parallel Processing – Slide 81

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:2/22/2013
language:English
pages:81