Next-Generation Stanford Shading System by TD9W4fFl

VIEWS: 8 PAGES: 35

									A Multigrid Solver for Boundary Value Problems
   Using Programmable Graphics Hardware
        Nolan Goodnight Cliff Woolley Gregory Lewin
               David Luebke Greg Humphreys
                    University of Virginia




              Graphics Hardware 2003
                July 26-27 – San Diego, CA

      augmented by Klaus Mueller, Stony Brook University
General-Purpose GPU Programming

   Why do we port algorithms to the GPU?
   How much faster can we expect it to be, really?
   What is the challenge in porting?
Case Study

Problem: Implement a Boundary Value
  Problem (BVP) solver using the GPU

Could benefit an entire class of scientific and
   engineering applications, e.g.:
       Heat transfer
       Fluid flow
Related Work

   Krüger and Westermann: Linear Algebra Operators
    for GPU Implementation of Numerical Algorithms


   Bolz et al.: Sparse Matrix Solvers on the GPU:
    Conjugate Gradients and Multigrid
       Very similar to our system
           Developed concurrently
           Complementary approach
Driving problem: Fluid mechanics sim

Problem domain is a warped disc:




                                   regular
                                   regular
                                     grid
                                     grid
BVPs: Background

   Boundary value problems are sometimes governed
    by PDEs of the form:

                       L = f

            L is some operator
             is the problem domain
            f is a forcing function (source term)


          Given L and f, solve for .
BVPs: Example

Heat Transfer
   Find a steady-state temperature distribution T in a
    solid of thermal conductivity k with thermal source S
   This requires solving a Poisson equation of the form:

                        k2T = -S

   This is a BVP where L is the Laplacian operator 2



All our applications require a Poisson solver.
BVPs: Solving

   Most such problems cannot be solved analytically
   Instead, discretize onto a grid to form a set of linear
    equations, then solve:
       Direct elimination
       Gauss-Seidel iteration
       Conjugate-gradient
       Strongly implicit procedures
       Multigrid method
Multigrid method

   Iteratively corrects an approximation to the solution
   Operates at multiple grid resolutions
   Low-resolution grids are used to correct higher-
    resolution grids recursively




   Very fast, especially for large grids: O(n)
Multigrid method

   Use coarser grid levels to recursively correct an
    approximation to the solution
       may converge slowly on fine grid -> restrict to
        course grid
           push out long wavelength errors quickly (single grid
            solvers only smooth out high frequency errors)
   Algorithm:
       smooth

       residual 
       restrict
            recurse
                                                      = Li - f
        

       interpolate
Implementation - Overview

For each step of the algorithm:
   Bind as texture maps the buffers that contain the
    necessary data (current solution, residual, source
    terms, etc.)
   Set the target buffer for rendering
   Activate a fragment program that performs the
    necessary kernel computation (smoothing, residual
    calculation, restriction, interpolation)
   Render a grid-sized quad with multitexturing

       source
                                           render
                                           render
        buffer           fragment
                                           target
                                            target
       texture           program
                                           buffer
                                            buffer
Implementation - Overview
Input buffers

   Solution buffer: four-channel floating point pixel
    buffer (p-buffer)
       one channel each for solution, residual, source
        term, and a debugging term
       toggle front and back surfaces used to hold old
        and new solution
   Operator map: contains the discretized operator
    (e.g., Laplacian)
   Red-black map: accelerate odd-even tests (see later)
Smoothing

   Jacobi method
       one matrix row:


       calculate new value for each solution vector
        element:




       in our application, the aij are the Laplacian
        (sparse matrix):
Smoothing

   Also factor in the source term
   Use Red-black map to update only half of the grid
    cells in each pass
       converges faster in practice
       known as red-black iteration
       requires two passes per iteration
Calculate residual

   Apply operator (Laplacian) and source term to the
    current solution

       residual  = k2T + S
   Store result in the target surface
   Use occlusion query to determine if all solution
    fragments are below threshold ( < threshold)
       occlusion query = true means all fragments are
        below threshold
       this is an L norm, which may be too strict
       less strict norms L1, L2, will require reduction or
        fragment accumulation register (not available
        yet), could run in CPU instead
Multigrid reduction and refinement

   Average (restrict) current residual into coarser grid

   Iterate/smooth on coarser grid, solving k2  = -S
   Interpolate correction back into finer grid
       or restrict once more -> recursion
       use bilinear interpolation
   Update grid with this correction
   Iterate/smooth on fine grid
Boundary conditions

   Dirichlet (prescribed)
   Neumann (prescribed derivative)
   Mixed (coupled value and derivative)




       Uk: value at grid point k
       nk: normal at grid point k
   Periodic boundaries result in toroidal mapping
   Apply boundary conditions in smoothing pass
Boundary conditions

   Only need to compute at boundaries
       boundaries need significantly more computations
       restrict computations to boundaries
   GPUs do not allow branching
       or better, both branches are executed and the
        invalid fragment is discarded
       even more wasteful
   decompose domain into boundary and interior areas
       use general (boundary) and fastpath (interior)
        shaders
       run these in two separate passes, on respective
        domains
Optimizing the Solver

   Detect steady-state natively on GPU
   Minimize shader length
   Use special-case whenever possible
   Limit context switches
Optimizing the Solver: Steady-state

   How to detect convergence?
       L1 norm - average error
       L2 norm – RMS error (common in visual sim)
       L norm – max error (common in sci/eng apps)
            Can use occlusion query!




secs to steady state
    vs. grid size
Optimizing the Solver: Shader length

   Minimize number of registers used
   Vectorize as much as possible
   Use the rasterizer to perform computations of
    linearly-varying values
   Pre-compute invariants on CPU
   Compute texture coodinate offsets in vertex shader
         shader      original fp   fastpath fp fastpath vp
       smooth          79-6-1        20-4-1       12-2

       residual        45-7-0        16-4-0       11-1

       restrict        66-6-1        21-3-0       11-1

       interpolate     93-6-1        25-3-0       13-2
Optimizing the Solver: Special-case

   Fast-path vs. slow-path
       write several variants of each fragment program
        to handle boundary cases
       eliminates conditionals in the fragment program
       equivalent to avoiding CPU inner-loop branching


                                          fast path, no
                                          boundaries

                                          slow path with
                                          boundaries
Optimizing the Solver: Special-case

    Fast-path vs. slow-path
         write several variants of each fragment program
          to handle boundary cases
         eliminates conditionals in the fragment program
         equivalent to avoiding CPU inner-loop branching




    secs per v-cycle
      vs. grid size
Optimizing the Solver: Context-switching

   Find best packing data of multiple grid levels
    into the pbuffer surfaces - many p-buffers
Optimizing the Solver: Context-switching

   Find best packing data of multiple grid levels
    into the pbuffer surfaces - two p-buffers
Optimizing the Solver: Context-switching

   Find best packing data of multiple grid levels
    into the pbuffer surfaces - a single p-buffer
   Still one front- and one back surface for iterative
    smoothing
Optimizing the Solver: Context-switching

   Remove context switching
       Can introduce operations with undefined results:
        reading/writing same surface
       Why do we need to do this?
           there is a chance that we write and read from the same
            surface at the same time
       Can we get away with it?
           Yes, we can. Just need to be careful to avoid these
            conflicts
       What about RGBA parallelism?
           was not used in this implemtation, may give another
            boost of factor 4
 Data Layout

    Performance:




secs to steady state
    vs. grid size
Data Layout

   Possible additional vectorization:
                                     Compute 4 values at
       Stacked domain                 a time
                                     Requires source,
                                      residual, solution
                                      values to be in
                                      different buffers
                                     Complicates
                                      boundary
                                      calculations
                                     Adds setup and
                                      teardown overhead
 Results: CPU vs. GPU

    Performance:




secs to steady state
    vs. grid size
Applications – Flow Simulation
  Applications – High Dynamic Range




CPU                                   GPU
Conclusions

What we need going forward:
      Superbuffers
          or: Universal support for multiple-surface pbuffers
          or: Cheap context switching
      Developer tools
          Debugging tools
          Documentation
      Global accumulator
      Ever increasing amounts of precision, memory
          Textures bigger than 2048 on a side
Acknowledgements

Hardware                  General-purpose   GPU
         David Kirk              Mark Harris
         Matt Papakipos          Aaron Lefohn
Driver   Support                 Ian Buck
         Nick Triantos    Funding

         Pat Brown               NSF Award #0092793
         Stephen Ehmann
Fragment    Programming
         James Percy
         Matt Pharr

								
To top