# Next-Generation Stanford Shading System by TD9W4fFl

VIEWS: 8 PAGES: 35

• pg 1
```									A Multigrid Solver for Boundary Value Problems
Using Programmable Graphics Hardware
Nolan Goodnight Cliff Woolley Gregory Lewin
David Luebke Greg Humphreys
University of Virginia

Graphics Hardware 2003
July 26-27 – San Diego, CA

augmented by Klaus Mueller, Stony Brook University
General-Purpose GPU Programming

   Why do we port algorithms to the GPU?
   How much faster can we expect it to be, really?
   What is the challenge in porting?
Case Study

Problem: Implement a Boundary Value
Problem (BVP) solver using the GPU

Could benefit an entire class of scientific and
engineering applications, e.g.:
   Heat transfer
   Fluid flow
Related Work

   Krüger and Westermann: Linear Algebra Operators
for GPU Implementation of Numerical Algorithms

   Bolz et al.: Sparse Matrix Solvers on the GPU:
   Very similar to our system
   Developed concurrently
   Complementary approach
Driving problem: Fluid mechanics sim

Problem domain is a warped disc:

regular
regular
grid
grid
BVPs: Background

   Boundary value problems are sometimes governed
by PDEs of the form:

L = f

L is some operator
 is the problem domain
f is a forcing function (source term)

Given L and f, solve for .
BVPs: Example

Heat Transfer
   Find a steady-state temperature distribution T in a
solid of thermal conductivity k with thermal source S
   This requires solving a Poisson equation of the form:

k2T = -S

   This is a BVP where L is the Laplacian operator 2

All our applications require a Poisson solver.
BVPs: Solving

   Most such problems cannot be solved analytically
   Instead, discretize onto a grid to form a set of linear
equations, then solve:
   Direct elimination
   Gauss-Seidel iteration
   Strongly implicit procedures
   Multigrid method
Multigrid method

   Iteratively corrects an approximation to the solution
   Operates at multiple grid resolutions
   Low-resolution grids are used to correct higher-
resolution grids recursively

   Very fast, especially for large grids: O(n)
Multigrid method

   Use coarser grid levels to recursively correct an
approximation to the solution
   may converge slowly on fine grid -> restrict to
course grid
   push out long wavelength errors quickly (single grid
solvers only smooth out high frequency errors)
   Algorithm:
   smooth

   residual 
   restrict
recurse
 = Li - f


   interpolate
Implementation - Overview

For each step of the algorithm:
   Bind as texture maps the buffers that contain the
necessary data (current solution, residual, source
terms, etc.)
   Set the target buffer for rendering
   Activate a fragment program that performs the
necessary kernel computation (smoothing, residual
calculation, restriction, interpolation)
   Render a grid-sized quad with multitexturing

source
render
render
buffer           fragment
target
target
texture           program
buffer
buffer
Implementation - Overview
Input buffers

   Solution buffer: four-channel floating point pixel
buffer (p-buffer)
   one channel each for solution, residual, source
term, and a debugging term
   toggle front and back surfaces used to hold old
and new solution
   Operator map: contains the discretized operator
(e.g., Laplacian)
   Red-black map: accelerate odd-even tests (see later)
Smoothing

   Jacobi method
   one matrix row:

   calculate new value for each solution vector
element:

   in our application, the aij are the Laplacian
(sparse matrix):
Smoothing

   Also factor in the source term
   Use Red-black map to update only half of the grid
cells in each pass
   converges faster in practice
   known as red-black iteration
   requires two passes per iteration
Calculate residual

   Apply operator (Laplacian) and source term to the
current solution

   residual  = k2T + S
   Store result in the target surface
   Use occlusion query to determine if all solution
fragments are below threshold ( < threshold)
   occlusion query = true means all fragments are
below threshold
   this is an L norm, which may be too strict
   less strict norms L1, L2, will require reduction or
fragment accumulation register (not available
yet), could run in CPU instead
Multigrid reduction and refinement

   Average (restrict) current residual into coarser grid

   Iterate/smooth on coarser grid, solving k2  = -S
   Interpolate correction back into finer grid
   or restrict once more -> recursion
   use bilinear interpolation
   Update grid with this correction
   Iterate/smooth on fine grid
Boundary conditions

   Dirichlet (prescribed)
   Neumann (prescribed derivative)
   Mixed (coupled value and derivative)

   Uk: value at grid point k
   nk: normal at grid point k
   Periodic boundaries result in toroidal mapping
   Apply boundary conditions in smoothing pass
Boundary conditions

   Only need to compute at boundaries
   boundaries need significantly more computations
   restrict computations to boundaries
   GPUs do not allow branching
   or better, both branches are executed and the
   even more wasteful
   decompose domain into boundary and interior areas
   use general (boundary) and fastpath (interior)
   run these in two separate passes, on respective
domains
Optimizing the Solver

   Detect steady-state natively on GPU
   Use special-case whenever possible
   Limit context switches

   How to detect convergence?
   L1 norm - average error
   L2 norm – RMS error (common in visual sim)
   L norm – max error (common in sci/eng apps)
   Can use occlusion query!

vs. grid size

   Minimize number of registers used
   Vectorize as much as possible
   Use the rasterizer to perform computations of
linearly-varying values
   Pre-compute invariants on CPU
   Compute texture coodinate offsets in vertex shader
shader      original fp   fastpath fp fastpath vp
smooth          79-6-1        20-4-1       12-2

residual        45-7-0        16-4-0       11-1

restrict        66-6-1        21-3-0       11-1

interpolate     93-6-1        25-3-0       13-2
Optimizing the Solver: Special-case

   Fast-path vs. slow-path
   write several variants of each fragment program
to handle boundary cases
   eliminates conditionals in the fragment program
   equivalent to avoiding CPU inner-loop branching

fast path, no
boundaries

slow path with
boundaries
Optimizing the Solver: Special-case

    Fast-path vs. slow-path
   write several variants of each fragment program
to handle boundary cases
   eliminates conditionals in the fragment program
   equivalent to avoiding CPU inner-loop branching

secs per v-cycle
vs. grid size
Optimizing the Solver: Context-switching

   Find best packing data of multiple grid levels
into the pbuffer surfaces - many p-buffers
Optimizing the Solver: Context-switching

   Find best packing data of multiple grid levels
into the pbuffer surfaces - two p-buffers
Optimizing the Solver: Context-switching

   Find best packing data of multiple grid levels
into the pbuffer surfaces - a single p-buffer
   Still one front- and one back surface for iterative
smoothing
Optimizing the Solver: Context-switching

   Remove context switching
   Can introduce operations with undefined results:
   Why do we need to do this?
   there is a chance that we write and read from the same
surface at the same time
   Can we get away with it?
   Yes, we can. Just need to be careful to avoid these
conflicts
   was not used in this implemtation, may give another
boost of factor 4
Data Layout

   Performance:

vs. grid size
Data Layout

   Compute 4 values at
Stacked domain                 a time
   Requires source,
residual, solution
values to be in
different buffers
   Complicates
boundary
calculations
Results: CPU vs. GPU

   Performance:

vs. grid size
Applications – Flow Simulation
Applications – High Dynamic Range

CPU                                   GPU
Conclusions

What we need going forward:
   Superbuffers
   or: Universal support for multiple-surface pbuffers
   or: Cheap context switching
   Developer tools
   Debugging tools
   Documentation
   Global accumulator
   Ever increasing amounts of precision, memory
   Textures bigger than 2048 on a side
Acknowledgements

Hardware                  General-purpose   GPU
     David Kirk              Mark Harris
     Matt Papakipos          Aaron Lefohn
Driver   Support                 Ian Buck
     Nick Triantos    Funding

     Pat Brown               NSF Award #0092793
     Stephen Ehmann
Fragment    Programming
     James Percy
     Matt Pharr

```
To top