How GPUs Work

Document Sample
How GPUs Work Powered By Docstoc
					             HOW THINGS WORK

                                                                                            Direct3D) to provide each triangle to

      How GPUs                                                                              the graphics pipeline one vertex at a
                                                                                            time; the GPU assembles vertices into
                                                                                            triangles as needed.

      Work                                                                                  Model transformations
                                                                                               A GPU can specify each logical
                                                                                            object in a scene in its own locally
      David Luebke, NVIDIA Research                                                         defined coordinate system, which is
      Greg Humphreys, University of Virginia                                                convenient for objects that are natu-
                                                                                            rally defined hierarchically. This con-
                                                                                            venience comes at a price: before
                                                                                            rendering, the GPU must first trans-
                                                                                            form all objects into a common coor-
                                     GPUs have moved away from                              dinate system. To ensure that triangles
                                     the traditional fixed-function                          aren’t warped or twisted into curved
                                     3D graphics pipeline toward                            shapes, this transformation is limited
                                                                                            to simple affine operations such as
                                     a flexible general-purpose                              rotations, translations, scalings, and
                                     computational engine.                                  the like.
                                                                                               As the “Homogeneous Coordinates”
                                                                                            sidebar explains, by representing each
                                                                                            vertex in homogeneous coordinates,
                                                                                            the graphics system can perform the
                                                                                            entire hierarchy of transformations

           n the early 1990s, ubiquitous         THE GRAPHICS PIPELINE                      simultaneously with a single matrix-
           interactive 3D graphics was still       The task of any 3D graphics system       vector multiply. The need for efficient
           the stuff of science fiction. By the   is to synthesize an image from a           hardware to perform floating-point
           end of the decade, nearly every       description of a scene—60 times per        vector arithmetic for millions of ver-
           new computer contained a graph-       second for real-time graphics such as      tices each second has helped drive the
      ics processing unit (GPU) dedicated to     videogames. This scene contains the        GPU parallel-computing revolution.
      providing a high-performance, visu-        geometric primitives to be viewed as          The output of this stage of the
      ally rich, interactive 3D experience.      well as descriptions of the lights illu-   pipeline is a stream of triangles, all
         This dramatic shift was the in-         minating the scene, the way that each      expressed in a common 3D coordinate
      evitable consequence of consumer           object reflects light, and the viewer’s     system in which the viewer is located
      demand for videogames, advances in         position and orientation.                  at the origin, and the direction of view
      manufacturing technology, and the            GPU designers traditionally have         is aligned with the z-axis.
      exploitation of the inherent paral-        expressed this image-synthesis process
      lelism in the feed-forward graphics        as a hardware pipeline of specialized      Lighting
      pipeline. Today, the raw computa-          stages. Here, we provide a high-level         Once each triangle is in a global
      tional power of a GPU dwarfs that of       overview of the classic graphics           coordinate system, the GPU can com-
      the most powerful CPU, and the gap is      pipeline; our goal is to highlight those   pute its color based on the lights in the
      steadily widening.                         aspects of the real-time rendering cal-    scene. As an example, we describe the
         Furthermore, GPUs have moved            culation that allow graphics applica-      calculations for a single-point light
      away from the traditional fixed-func-       tion developers to exploit modern          source (imagine a very small lightbulb).
      tion 3D graphics pipeline toward           GPUs as general-purpose parallel           The GPU handles multiple lights by
      a flexible general-purpose compu-          computation engines.                       summing the contributions of each
      tational engine. Today, GPUs can                                                      individual light. The traditional graph-
      implement many parallel algorithms         Pipeline input                             ics pipeline supports the Phong light-
      directly using graphics hardware.            Most real-time graphics systems          ing equation (B-T. Phong, “Illumina-
      Well-suited algorithms that leverage       assume that everything is made of tri-     tion for Computer-Generated Images,”
      all the underlying computational           angles, and they first carve up any more    Comm. ACM, June 1975, pp. 311-
      horsepower often achieve tremendous        complex shapes, such as quadrilaterals     317), a phenomenological appearance
      speedups. Truly, the GPU is the first      or curved surface patches, into trian-     model that approximates the look of
      widely deployed commodity desktop          gles. The developer uses a computer        plastic. These materials combine a dull
      parallel computer.                         graphics library (such as OpenGL or        diffuse base with a shiny specular high-

126     Computer
light. The Phong lighting equation
gives the output color C = Kd × Li ×         Homogeneous Coordinates
(N.L) + Ks × Li × (R.V)^S.
   Table 1 defines each term in the             Points in three dimensions are typically represented as a triple (x,y,z). In
equation. The mathematics here isn’t         computer graphics, however, it’s frequently useful to add a fourth coordinate,
as important as the computation’s            w, to the point representation. To convert a point to this new representation,
structure; to evaluate this equation         we set w = 1. To recover the original point, we apply the transformation
efficiently, GPUs must again operate         (x,y,z,w) —> (x/w, y/w, z/w).
directly on vectors. In this case, we           Although at first glance this might seem like needless complexity, it has sev-
repeatedly evaluate the dot product of       eral significant advantages. As a simple example, we can use the otherwise
two vectors, performing a four-com-          undefined point (x,y,z,0) to represent the direction vector (x,y,z). With this uni-
ponent multiply-and-add operation.           fied representation for points and vectors in place, we can also perform several
                                             useful transformations such as simple matrix-vector multiplies that would oth-
Camera simulation                            erwise be impossible. For example, the multiplication
   The graphics pipeline next projects
each colored 3D triangle onto the vir-          ⎡1   0 0 Δx ⎤ ⎡ x ⎤
                                                ⎢           ⎥⎢ ⎥
tual camera’s film plane. Like the               ⎢0   1 0 Δy ⎥ ⎢ y ⎥
model transformations, the GPU does             ⎢0   0 1 Δz ⎥ ⎢ z ⎥
this using matrix-vector multiplication,        ⎢           ⎥⎢ ⎥
                                                ⎣0   0 0 1 ⎦ ⎣w ⎦
again leveraging efficient vector opera-
tions in hardware. This stage’s output       can accomplish translation by an amount Δx, Δy, Δz.
is a stream of triangles in screen coor-       Furthermore, these matrices can encode useful nonlinear transformations
dinates, ready to be turned into pixels.     such as perspective foreshortening.

   Each visible screen-space triangle       resolution. Because the access pattern     to a programmable computational sub-
overlaps some pixels on the display;        to texture memory is typically very        strate that can support it. Fixed-func-
determining these pixels is called ras-     regular (nearby pixels tend to access      tion units for transforming vertices and
terization. GPU designers have incor-       nearby texture image locations), spe-      texturing pixels have been subsumed by
porated many rasterizatiom algo-            cialized cache designs help hide the       a unified grid of processors, or shaders,
rithms over the years, which all ex-        latency of memory accesses.                that can perform these tasks and much
ploit one crucial observation: Each                                                    more. This evolution has taken place
pixel can be treated independently          Hidden surfaces                            over several generations by gradually
from all other pixels. Therefore, the         In most scenes, some objects             replacing individual pipeline stages
machine can handle all pixels in par-       obscure other objects. If each pixel       with increasingly programmable units.
allel—indeed, some exotic machines          were simply written to display mem-        For example, the NVIDIA GeForce 3,
have had a processor for each pixel.        ory, the most recently submitted tri-      launched in February 2001, introduced
This inherent independence has led          angle would appear to be in front.         programmable vertex shaders. These
GPU designers to build increasingly         Thus, correct hidden surface removal       shaders provide units that the pro-
parallel sets of pipelines.                 would require sorting all triangles        grammer can use for performing
                                            from back to front for each view, an       matrix-vector multiplication, exponen-
Texturing                                   expensive operation that isn’t even        tiation, and square root calculations, as
   The actual color of each pixel can       always possible for all scenes.
be taken directly from the lighting cal-      All modern GPUs provide a depth
                                                                                        Table 1. Phong lighting equation terms.
culations, but for added realism,           buffer, a region of memory that stores
images called textures are often            the distance from each pixel to the         Term            Meaning
draped over the geometry to give the        viewer. Before writing to the display,
illusion of detail. GPUs store these tex-   the GPU compares a pixel’s distance to      Kd              Diffuse color
tures in high-speed memory, which           the distance of the pixel that’s already    Li              Light color
each pixel calculation must access to       present, and it updates the display         N               Surface normal
determine or modify that pixel’s color.     memory only if the new pixel is closer.     L               Vector to light
   In practice, the GPU might require                                                   Ks              Specular color
multiple texture accesses per pixel to      THE GRAPHICS PIPELINE,                      R               Reflected light vector
mitigate visual artifacts that can result   EVOLVED                                     V               Vector to camera
when textures appear either smaller           GPUs have evolved from a hardwired        S               “Shininess”
or larger on screen than their native       implementation of the graphics pipeline

                                                                                                                    February 2007   127
             HOW THINGS WORK

                                                                                                 GPUs introduced increased flexibility,
                                                                                                 adding support for longer programs,
                                                                                                 more registers, and control-flow prim-
                                                                                                 itives such as branches, loops, and
                                                                                                    The ATI Radeon 9700 (July 2002)
                                                                                                 and NVIDIA GeForce FX (January
                                                                                                 2003) replaced the often awkward reg-
                                                                                                 ister combiners with fully program-
                                                                                                 mable pixel shaders. NVIDIA’s latest
                                                                                                 chip, the GeForce 8800 (November
                                                                                                 2006), adds programmability to the
                                                                                                 primitive assembly stage, allowing
                                                                                                 developers to control how they con-
                                                                                                 struct triangles from transformed ver-
                                                                                                 tices. As Figure 2 shows, modern
                                                                                                 GPUs achieve stunning visual realism.
                                                                                                    Increases in precision have accom-
                                                                                                 panied increases in programmability.
                                                                                                 The traditional graphics pipeline pro-
                                                                                                 vided only 8-bit integers per color
      Figure 1. Programmable shading.The introduction of programmable shading in 2001 led        channel, allowing values ranging from
      to several visual effects not previously possible, such as this simulation of refractive   0 to 255. The ATI Radeon 9700
      chromatic dispersion for a “soap bubble” effect.                                           increased the representable range of
                                                                                                 color to 24-bit floating point, and
                                                                                                 NVIDIA’s GeForce FX followed with
                                                                                                 both 16-bit and 32-bit floating point.
                                                                                                 Both vendors have announced plans
                                                                                                 to support 64-bit double-precision
                                                                                                 floating point in upcoming chips.
                                                                                                    To keep up with the relentless
                                                                                                 demand for graphics performance,
                                                                                                 GPUs have aggressively embraced
                                                                                                 parallel design. GPUs have long used
                                                                                                 four-wide vector registers much like
                                                                                                 Intel’s Streaming SIMD Extensions
                                                                                                 (SSE) instruction sets now provide on
                                                                                                 Intel CPUs. The number of such four-
                                                                                                 wide processors executing in parallel
                                                                                                 has increased as well, from only four
                                                                                                 on GeForce FX to 16 on GeForce
                                                                                                 6800 (April 2004) to 24 on GeForce
                                                                                                 7800 (May 2005). The GeForce 8800
                                                                                                 actually includes 128 scalar shader
                                                                                                 processors that also run on a special
                                                                                                 shader clock at 2.5 times the clock
                                                                                                 rate (relative to pixel output) of for-
      Figure 2. Unprecedented visual realism. Modern GPUs can use programmable shading to        mer chips, so the computational per-
      achieve near-cinematic realism, as this interactive demonstration shows, featuring         formance might be considered equiv-
      actress Adrianne Curry on an NVIDIA GeForce 8800 GTX.                                      alent to 128 × 2.5/4 = 80 four-wide
                                                                                                 pixel shaders.
      well as a short default program that          exposing the texturing hardware’s
      uses these units to perform vertex trans-     functionality as a set of register com-      UNIFIED SHADERS
      formation and lighting.                       biners that could achieve novel visual         The latest step in the evolution from
        GeForce 3 also introduced limited           effects such as the “soap-bubble” look       hardwired pipeline to flexible compu-
      reconfigurability into pixel processing,       demonstrated in Figure 1. Subsequent         tational fabric is the introduction of

128     Computer
unified shaders. Unified shaders were                 3D geometric
first realized in the ATI Xenos chip for              primitives
the Xbox 360 game console, and
NVIDIA introduced them to PCs with
the GeForce 8800 chip.
   Instead of separate custom proces-                                     Programmable unified processors
sors for vertex shaders, geometry                      Vertex           Geometry                 Pixel                 Compute
shaders, and pixel shaders, a unified                  programs          programs               programs                programs
shader architecture provides one large
grid of data-parallel floating-point
processors general enough to run all                                               Rasterization      Hidden surface
these shader workloads. As Figure 3
shows, vertices, triangles, and pixels
recirculate through the grid rather
                                                                   GPU memory (DRAM)
than flowing through a pipeline with                                                                  Final image
stages of fixed width.
   This configuration leads to better
overall utilization because demand for
the various shaders varies greatly          Figure 3. Graphics pipeline evolution.The NVIDIA GeForce 8800 GPU replaces the
between applications, and indeed even       traditional graphics pipeline with a unified shader architecture in which vertices,
within a single frame of one applica-       triangles, and pixels recirculate through a set of programmable processors.The flexibility
tion. For example, a videogame might        and computational power of these processors invites their use for general-purpose com-
begin an image by using large trian-        puting tasks.
gles to draw the sky and distant ter-
rain. This quickly saturates the pixel      extremely high arithmetic throughput             resources, mapping well to the GPU’s
shaders in a traditional pipeline, while    and streaming memory bandwidth                   many-core arithmetic intensity, or
leaving the vertex shaders mostly idle.     but tolerates considerable latency in            they require streaming through large
One millisecond later, the game might       an individual computation since final             quantities of data, mapping well to the
use highly detailed geometry to draw        images are only displayed every 16               GPU’s streaming memory subsystem.
intricate characters and objects. This      milliseconds. These workload charac-                Porting a judiciously chosen algo-
behavior will swamp the vertex shaders      teristics have shaped the underlying             rithm to the GPU often produces
and leave the pixel shaders mostly idle.    GPU architecture: Whereas CPUs are               speedups of five to 20 times over
   These dramatic oscillations in           optimized for low latency, GPUs are              mature, optimized CPU codes running
resource demands in a single image          optimized for high throughput.                   on state-of-the-art CPUs, and speed-
present a load-balancing nightmare             The raw computational horsepower              ups of more than 100 times have been
for the game designer and can also          of GPUs is staggering: A single GeForce          reported for some algorithms that
vary unpredictably as the players’          8800 chip achieves a sustained 330 bil-          map especially well.
viewpoint and actions change. A uni-        lion floating-point operations per sec-              Notable GPGPU success stories
fied shader architecture, on the other       ond (Gflops) on simple benchmarks                 include Stanford University’s Folding@
hand, can allocate a varying percent-       (          home project, which uses spare cycles
age of its pool of processors to each       gpubench). The ever-increasing power,            that users around the world donate to
shader type.                                programmability, and precision of                study protein folding (http://folding.
   For this example, a GeForce 8800         GPUs has motivated a great deal of      A new GPU-accelerated
might use 90 percent of its 128 proces-     research on general-purpose compu-               Folding@home client contributed
sors as pixel shaders and 10 percent        tation on graphics hardware—GPGPU                28,000 Gflops in the month after its
as vertex shaders while drawing the         for short. GPGPU researchers and                 October 2006 release—more than 18
sky, then reverse that ratio when           developers use the GPU as a compu-               percent of the total Gflops that CPU
drawing a distant character’s geome-        tational coprocessor rather than as an           clients contributed running on Micro-
try. The net result is a flexible parallel   image-synthesis device.                          soft Windows since October 2000.
architecture that improves GPU uti-            The GPU’s specialized architecture               In another GPGPU success story,
lization and provides much greater          isn’t well suited to every algorithm.            researchers at the University of North
flexibility for game designers.              Many applications are inherently ser-            Carolina and Microsoft used GPU-
                                            ial and are characterized by incoher-            based code to win the 2006 Indy
GPGPU                                       ent and unpredictable memory access.             PennySort category of the TeraSort
  The highly parallel workload of           Nonetheless, many important prob-                competition, a sorting benchmark
real-time computer graphics demands         lems require significant computational            testing price/performance for database

                                                                                                                          February 2007   129
             HOW THINGS WORK

      operations (     GPU architectures, but not without       David Luebke is a research scientist
      GPUTERASORT). Closer to home for         limit; neither vendors nor users want    at NVIDIA Research. Contact him at
      the GPU business, the HavokFX prod-      to sacrifice the specialized architec-
      uct uses GPGPU techniques to accel-      ture that made GPUs successful in the
      erate tenfold the physics calculations   first place. Today, GPU developers       Greg Humphreys is a faculty member in
      used to add realistic behavior to        need new high-level programming          the Computer Science Department at the
      objects in computer games (www.          models for massively multithreaded       University of Virginia. Contact him at                              parallel computation, a problem soon
                                               to impact multicore CPU vendors as

              odern GPUs could be seen as      well.
              the first generation of com-        Can GPU vendors, graphics devel-
              modity data-parallel proces-     opers, and the GPGPU research com-        Computer welcomes your submis-
      sors. Their tremendous computational     munity build on their success with        sions to this bimonthly column. For
      capacity and rapid growth curve, far     commodity parallel computing to           additional information, or to
      outstripping traditional CPUs, high-     transcend their computer graphics         suggest topics that you would like
      light the advantages of domain-spe-      roots and develop the computational       to see explained, contact column
      cialized data-parallel computing.        idioms, techniques, and frameworks        editor Alf Weaver at weaver@cs.
         We can expect increased program-      for the desktop parallel computing
      mability and generality from future      environment of the future? ■

130     Computer

Shared By: