Learning Center
Plans & pricing Sign in
Sign Out



									NVIDIA’S FERMI:

Presented by: Ahmad Hammad
Course: CSE 661 - Fall 2011

Ø   Introduction
Ø   What is GPU Computing?
Ø   Fermi
    Ø The Programming Model
    Ø The Streaming Multiprocessor

    Ø The Cache and Memory Hierarchy

Ø   Conclusion
   Traditional microprocessor technology see
    diminishing returns.
       Improvement in clock speeds and architectural
        sophistication is slowing
   Focus has shifted to multicore designs.
       These too are reaching practical limits for personal
Introduction (2)
   CPUs are optimized for applications where work done
    by limited number of threads
     Threads exhibit high data locality
     Mix of different operations
     High percentage of conditional branches.
   CPUs are inefficient for high-performance computing
     Integer and floating-point execution units is small
     Most of the CPU space, complexity and heat it generates,
      devoted to:
           Caches, instruction decoders, branch predictors, and other features
            to enhance single-threaded performance.
Introduction (3)
   GPU design aims applications with multiple threads
    dominated by long sequences of computational
   GPUs are much better at thread handling, data
    caching, virtual memory management, flow control,
    and other CPU-like features.
   CPUs will never go away, but
       GPUs deliver more cost-effective and energy-efficient
Introduction (4)
   The key GPU design goal is to maximize floating-
    point throughput.
   Most of the circuitry within each core is dedicated to
    computation, rather than speculative features
   power consumed by GPUs goes into the
    application’s actual algorithmic work.
What is GPU Computing?
   Use of a graphics processing unit to do general purpose
    scientific and engineering computing.
   GPU computing not a replacement for CPU computing.
       Each approach has advantages for certain kinds of
   CPU and GPU work together in a heterogeneous co-
    processing computing model.
     The sequential part of the application runs on the CPU
     the computationally-intensive part is accelerated by the
What is GPU Computing?

   From the user’s perspective, the application just runs
    faster because of using the GPU to boost performance. 
   GPU computing began with nonprogrammable 3G-
    graphics accelerators
   Multi-chip 3D rendering engines were developed
    starting in the 1980s,
   By mid-1990s all the essential elements integrated
    onto a single chip.
   1994-2001, these chips progressed from the
    simplest pixel-drawing functions to implementing the
    full 3D pipeline
   NVIDIA’s GeForce 3 in 2001 introduced
    programmable pixel shading to the consumer
   The programmability of this chip was very limited
   Later GeForce products became more flexible and
       adding separate programmable engines for vertex
        and geometry shading.
   This evolution culminated in the GeForce 7800
GeForce 7800
   Had Three kinds of programmable engines for
    different stages of the 3D pipeline.
   Several additional stages of configurable and
    fixed-function logic.
   programming evolved as a way to perform non-
    graphics processing on these graphics-optimized
     By running carefully crafted shader code against data
      presented as vertex or texture information.
     Retrieving the results from a later stage in the pipeline.
   Managing three different programmable engines in a
    single 3D pipeline led to unpredictable bottlenecks
       too much effort went into balancing the throughput of each
    In 2006, NVIDIA introduced the GeForce 8800,
   This design featured a “unified shader architecture”
    with 128 processing elements distributed among eight
    shader cores.
       Each shader core could be assigned to any shader task,
        eliminating the need for stage-by-stage balancing
           Greatly improving overall performance.
   To bring the advantages of the 8800 architecture and
    CUDA to new markets such as HPC, NVIDIA introduced the
    Tesla product line.
   Current Tesla products use the more recent GT200
   The Tesla line begins with PCI Express add-in boards—
    essentially graphics cards without display outputs—and with
    drivers optimized for GPU computing instead of 3D
   With Tesla, programmers don’t have to worry about making
    tasks look like graphics operations;
       the GPU can be treated like a many core processor.
   In 2006-2007 NVIDIA introduced its parallel architecture called “CUDA”.
       Consists of 100s of processor cores that operate together.
   Easy programming for the associated CUDA parallel programming model.
       Developer modify their application to take the compute-intensive kernels and
        map them to the GPU.
           adding “C” keywords
       The rest of the application remains on the CPU.
       Developer launches 10s of 1000s of threads simultaneously.
       The GPU hardware manages the threads and does thread scheduling.
   Although GPU computing is only a few years old now
       More programmers with direct GPU computing experience than have ever used
        a Cray dedicated supercomputers.
       Academic support for GPU computing is also growing quickly.
           over 200 colleges and universities are teaching classes in CUDA programming
   Code name for NVIDIA’s next-generation CUDA
   consists of
       16 streaming multiprocessors (SMs)
           each consisting of 32 cores
                each can execute one floating-point or integer instruction per clock.
     The SMs are supported by a second-level cache
     Host interface

     GigaThread scheduler

     Multiple DRAM interfaces.
   Code name for NVIDIA’s next-generation CUDA
The Programming Model
   Complexity of the Fermi architecture is managed by
    multi-level programming model
     allows software developers to focus on algorithm
     No need to know details about mapping algorithm to
      the hardware
           improve productivity
The Programming Model Kernels
   In NVIDIA’s CUDA software platform the
    computational elements of algorithms called kernels
     Kernels can be written in the C language
     extended with additional keywords to express
      parallelism directly
     Once compiled, kernels consist of many threads that
      execute the same program in parallel
The Programming Model Thread Blocks
The Programming Model Warps
   Thread blocks are divided into warps of 32
   The warp is the fundamental unit of dispatch within
    a single SM.
   Two warps from different thread blocks can be
    issued and executed concurrently
       increase hardware utilization and energy efficiency.
   Thread blocks are grouped into grids
       each executes a unique kernel
The Programming Model IDs
   Threads and thread blocks each have identifiers
     Specify their relationship to the kernel.
     Used by each thread as indexes to their input and
      output data and shared memory locations.
The Programming Model
   At any one time, the entire Fermi device is
    dedicated to a single application.
     an application may include multiple kernels.
     Fermi supports simultaneous execution of multiple
      kernels from the same application
     Each kernel distributed to one or more SMs
           This capability increases the utilization of the device
The Programming Model GigaThread

   Switching from one application to another needs
       Short enough to maintain high utilization even when
        running multiple applications
   This switching is managed by GigaThread
    (hardware thread scheduler)
       Manages 1,536 simultaneously active threads for each
        streaming multiprocessor across 16 kernels.
The Programming Model Languages

   Fermi support
     C-language
     FORTRAN (with independent solutions from The
      Portland Group and NOAA)
     Java, Matlab, and Python

   Fermi brings new instruction level support for C++
     previously unsupported on GPUs
     will make GPU computing more widely available than
Supported software platforms
   Supported software platforms
     NVIDIA’s own CUDA development environment
     The OpenCL standard managed by the Khronos Group

     Microsoft’s Direct Compute API.
The Streaming Multiprocessor
   Comprise 32 cores each:
       can perform floating-point and integer operations
   16 load-store units for memory operations
   four special-function units
   64K of local SRAM split between cache and local
The Streaming Multiprocessor core
   Floating-point operations follow the IEEE 754-2008
    floating-point standard.
   Each core can perform
     one single-precision fused multiply-add operation in each
      clock period
     one double-precision fused multiply-add FMA in two clock
           no rounding off in the intermediate result
   Fermi performs more than 8× as many double-precision
    operations per clock than previous GPU generations
The Streaming Multiprocessor core
   FMA support increases the accuracy and
    performance of other mathematical operations
     division and square root
     extended-precision arithmetic
     interval arithmetic
     Linear algebra.

   The integer ALU supports the usual mathematical
    and logical operations
       including multiplication, on both 32-bit and 64-bit
The Streaming Multiprocessor Memory
   Handled by a set of 16 load-store units in each SM.
   load/store instructions refer to memory in terms of
    two-dimensional arrays
       providing addresses in terms of x and y values.
   Data can be converted from one format to another
    as it passes between DRAM and the core registers
    at the full rate.
       examples of optimizations unique to GPUs
The Streaming Multiprocessor four
Special Function Units
   handle special operations such as sin, cos and exp
   Four of these operations can be issued per cycle in
    each SM.
The Streaming Multiprocessor execution
   Within Fermi SM has four execution blocks
     Cores are divided into two execution blocks:16 cores each.
     One block offer 16 load-store units

     One block of the four SFUs,

   In each cycle, 32 instructions can be dispatched from
    one or two warps to these blocks.
       Two cycles to execute the 32 instructions on the cores or
        load/store units.
   32 special-function instructions can issued in single cycle
       takes eight cycles to complete on the four SFUs. (32/4 = 8)
This figure shows how instructions are issued to the execution blocks.
ISA improvements (1)
   Fermi debuts the Parallel Thread eXecution (PTX)
    2.0 instruction-set architecture (ISA).
     Defines instruction set and new virtual machine
     Compilers supporting NVIDIA GPUs, provide PTX-
      compliant binaries that act as a hardware-neutral.
     Applications can be portable across GPU generations
      and implementations.
ISA improvements (2)
   All instructions support predication.
     Each instruction can be executed or skipped based on
      condition codes.
     Each thread perform different operations as needed
      while execution continues at full speed.
     If predication isn’t sufficient, usual if-then-else structure
      with branch statements used
The Cache and Memory Hierarchy L1

   Fermi architecture provides local memory in each
    SM, can be split
     Shared memory
     First-level (L1) cache for global memory references.

   The local memory is 64K in size
       Split 16K/48K or 48K/16K between L1 cache and
        shared memory. Depends on
         How much shared memory is needed,
         how predictable the kernel’s accesses to global memory are
          likely to be.
The Cache and Memory Hierarchy
   Shared memory, provides low-latency access to
    moderate amounts of data
   Because the access latency to this memory is also
    completely predictable
       algorithms can be written to interleave loads,
        calculations, and stores with maximum efficiency.
The Cache and Memory Hierarchy
   A larger shared-memory requirement argues for
    less cache;
        more frequent or unpredictable accesses to larger
        regions of DRAM argues for more cache.
The Cache and Memory Hierarchy L2

   Fermi come with an L2 cache
     768KB in size for a 512-core chip.
     Covers GPU local DRAM as well as system memory.

   The L2 cache subsystem implements:
       Set of memory read-modify-write atomic operations
         Managing access to data shared across thread blocks or
         atomic operations are 5× to 20× faster than on previous
          GPUs using conventional synchronization methods.
The Cache and Memory Hierarchy
   The final stage of the local memory hierarchy.
   Fermi provides six 64-bit DRAM channels that
    support SDDR3 and GDDR5 DRAMs.
   Up to 6GB of GDDR5 DRAM can be connected to
    the chip.
Error Correcting Code ECC
   Fermi is the first GPU to provide ECC protects
        DRAM, register files, shared memories, L1 and L2
   The level of protection is known as SECDED:
       single (bit) error correction, double error detection.
   Instead of each 64-bit memory channel carrying
    eight extra bits for ECC information
       NVIDIA has a secrete solution for packing the ECC bits
        into reserved lines of memory.
The Cache and Memory Hierarchy
   The GigaThread controller also provides a pair of
    streaming data-transfer engines,
   each can fully saturate Fermi’s PCI Express host
     Typically, one will be used to move data from system
      memory to GPU memory when setting up a GPU
     while the other will be used to move result data from
      GPU memory tos system memory.
   CPUs is the best for dynamic workloads with short
    sequences of computational operations and
    unpredictable control flow.
   workloads dominated by computational work
    performed within a simpler control flow need

To top