# project_presentation by wuzhengqin

VIEWS: 1 PAGES: 29

• pg 1
```									Optimizing 3D Multigrid to Be
Comparable to the FFT

Michael Maire and Kaushik Datta

Note: Several diagrams were taken from Kathy Yelick’s
CS267 lectures
Outline
   FFT and Multigrid Overview
   FFT and MG Running Times
   Multigrid Performance Model
   Multigrid Optimizations
   Results
3D Poisson’s Equation
   An elliptic PDE that arises in many
physical problems (e.g. electrostatic or
gravitational potential)
   Many different techniques for solving
are available
3D Poisson’s Equation
   The continuous version is:
d2/dx2 + d2/dy2 + d2/dz2 =  (or 
= )
   The discrete version is: T * x = b
   In 2D, the 9-point stencil looks like:
Algorithms for Solving 2D Poisson’s
Equation with N unknowns
Algorithm        Serial Flops   Memory
  Dense LU      N3             N2
  Band LU       N2             N3/2
  Jacobi        N2             N
  Explicit Inv. N              N
  RB SOR        N 3/2          N
  Sparse LU N 3/2              N*log N
  FFT           N*log N        N
  Multigrid     N              N
 Lower bound N                 N
Multigrid Overview
   Basic Algorithm:
   Replace problem on fine grid by an approximation
on a coarser grid
   Solve the coarse grid problem approximately, and
use the solution as a starting guess for the fine-
grid problem, which is then iteratively updated
   Solve the coarse grid problem recursively, i.e. by
using a still coarser grid approximation, etc.
   Success depends on coarse grid solution
being a good approximation to the fine grid
Multigrid Sketch on a 2D Mesh
   Consider a 2m+1 by 2m+1 grid
   Let P(i) be the problem of solving the discrete Poisson equation on a
2i+1 by 2i+1 grid in 2D
  Write linear system as T(i) * x(i) = b(i)
   P(m) , P(m-1) , … , P(1) is sequence of problems from finest to coarsest
Multigrid Convergence
Multigrid Convergence Rate

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Log Base 10 (L2 Norm)

-2
-4
-6
-8
-10
-12
-14
Number of MG V-cycles
Multigrid Operators
   The four operators that we examine are:
   evaluateResidual – calculates the residual of our
current solution
   applySmoother – performs a Jacobi relaxation step
   coarsen – maps from a (2m x 2m x 2m) grid to a
(2m-1 x 2m-1 x 2m-1) grid
   prolongate – maps from a (2m-1 x 2m-1 x 2m-1) grid
to a (2m x 2m x 2m) grid
   All these operators perform nearest-neighbor
computations using a 27-point stencil
Multigrid V-cycle
   Just a picture of the call graph
   In time a V-cycle looks like the following:
level
5

4

3

2                 prolongate,
applySmoother,
coarsen      evaluateResidual
1

time
Multigrid Performance Model
   Memory access is performance bottleneck
   Each pass over 3D grid requires (per cell):
   27 integer operations (stencil coordinates)
   27 FP loads of surrounding grid locations
   27 (approx) FP operations
   1 FP store
   Traversing grid consecutively in memory
causes 9 cache misses every 1/(# doubles
stored in a cache line) cells
   Grid size prevents reuse of cached values
Performance Model Running Time Predictions on Power3

4.5
4
3.5
Time (seconds)

lower bound
3
cse
2.5
upper bound
2
1.5
1
0.5
0
applySmoother   evaluateResidual     coarsen         prolongate
MG Operator
Performance Model for MFlops/s

450

400

350

300
MFlops/s

250                                                                     Lower Bound
CSE
200                                                                     Upper Bound

150

100

50

0
applySmoother   evaluateResidual          coarsen    prolongate
Operator
Multigrid Optimizations
   Optimizations possible in 2 areas
   Reducing ALU operations per cell
   Reusing stencil coordinates between cells
   Reusing partial sums common to consecutive cells
   Improving memory behavior
   Reducing # of loads (register blocking)
   Reducing # of cache misses (cache blocking)
Multigrid Optimizations
   Common subexpression elimination
   Loop unrolling
   Memoization
   Cache blocking
   Memoization + cache blocking
Platforms
   Power 3- 375 MHz, 64 KB L1 Cache
   Itanium II- 900 MHz, 16 KB L1 Cache
   Alphaserver- 1000 MHz, 64 KB L1 Cache
   Opteron- 1600 MHz, 64 KB L1 Cache
Optimizations – Loop Unrolling
   Reduces # stencil coordinates
computed per cell
   Exposes load reuse to compiler
   Allows compiler to use FP registers to
   Minimum number of loads is 9/grid
point (given generous # FP registers)
Optimizations - Memoization
   Traverses grid once to precompute partial
sums common to consecutive cells
   Traverses grid again to compute actual cell
values
   9 integer stencil operations/cell
   18 FP operations/cell
   Reduces FP register pressure by breaking
computation into two stages, but still uses 9
The Effect of Memoization on Time on Power3
The Effect of Memoization on FP Count on
2.5                                                                                                                             Power3
400

2                                                                                                   350
Regular
Memoized                                 300

Millions of FP Operations
Time (seconds)

1.5                                                                                                                                                             Regular
250
Memoized
200
1
150

100
0.5

50

0                                                                                                     0
applySmoother   evaluateResidual     coarsen         prolongate                                         applySmoother      evaluateResidual        coarsen            prolongate
MG Operator                                                                                               MG Operator

The Effect of Memoization on L1 Load Misses                                                                  The Effect of Memoization on Number of

5                                                                                                    400

4.5                                                                                                   350

4                                                   Regular                                          300                                                           Regular

Memoized                                                                                                       Memoized
3.5                                                                                                   250
Millions of Misses

3                                                                                                    200

2.5                                                                                                   150
2                                                                                                    100
1.5                                                                                                    50
1                                                                                                         0

l
0.5

te
er

n
ua

se

ga
th

id

ar
oo

on
es

co
m

0
R

ol
yS

pr
te

applySmoother   evaluateResidual    coarsen         prolongate
ua
pl
ap

al

MG Operator                                                                                                MG Operator
ev
Optimizations – Cache Blocking
   Break 3D grid into blocks that fit within
cache
   Attempts to allow reuse between
   Reduces memory traffic to 3 load
streams per cell
blocks
Time for Different Cache Block Sizes on Power3                                                                     Time for Different Cache Block Sizes on Alphaserver

4.5                                                                                                                2.5
4
Time (seconds)

3.5                                                                                                                    2

Time (seconds)
3
2.5
2                                                                                                                1.5
applySmoother_cblk
1.5
1                                                                                                                    1                                                            evaluateResidual_cblk
0.5                                                            applySmoother_cblk
0
evaluateResidual_cblk                               0.5                                                              applySmoother_mem_cblk
g
10
20
30
40
50
60
70
80
90
Bl 100
kin
applySmoother_mem_cblk                                  0                                                            evaluateResidual_mem_cblk
oc

100
No
10
20
30
40
50
60
70
80
90
evaluateResidual_mem_cblk
No

Side Length of Square Cache Block                                                                                    Side Length of Square Cache Block

Time for Different Cache Block Sizes on Itanium II                                                                  Time for Different Cache Block Sizes on Opteron

3                                                                                                          2.5

2.5                                                                                                                2

Time (seconds)
Time (seconds)

2                                                                                                          1.5
applySmoother_cblk                                                                                                   applySmoother_cblk
1.5                                                                                                                1
evaluateResidual_cblk                                                                                                evaluateResidual_cblk
1                                                                                                          0.5
applySmoother_mem_cblk                                                                                               applySmoother_mem_cblk
0.5                                                                                                                0                                                                evaluateResidual_mem_cblk
evaluateResidual_mem_cblk

g
10
20
30
40
50
60
70
80
90

oc 0
kin
10
0
100
No
10
20
30
40
50
60
70
80
90

Bl
No
Side Length of Square Cache Block
Side Length of Square Cache Block
L1 Load Misses for Different Cache Block Sizes on
Power3

8
7
6
5
4
3
2                                                   applySmoother_cblk
1                                                   evaluateResidual_cblk
0

ng
0
10

20

30

40

50

60

70

80

90

10

i
ck
lo
B
o
N
Side Length of Square Cache Block

Mispredicted Branches for Different Cache Block
Sizes on Power3

200
Thousands of Mispredicted

180
160                                                 applySmoother_cblk
140                                                 evaluateResidual_cblk
Branches

120
100
80
60
40
20
0
0

g
10

20

30

40

50

60

70

80

90

in
10

ck
lo
B
o
N

Side Length of Square Cache Block
Time for Different Optimizations on Power3                                                                          Time for Different Optimizations on Alphaserver

3.5                                                                                                                  2.5

3                                                             baseline                                                                                                              baseline
2
cse                                                                                                                   cse
2.5
loop unrolling                                                                                                        loop unrolling

Time (seconds)
Time (seconds)

2                                                             cache blocking                                        1.5                                                             cache blocking
memoization                                                                                                           memoization
1.5                                                            memoization + cache blocking                               1
memoization + cache blocking

1
0.5
0.5

0                                                                                                                        0
applySmoother           evaluateResidual                   coarsen           prolongate                                   applySmoother       evaluateResidual                  coarsen            prolongate
MG Operator                                                                                                            MG Operator

Time for Different Optimizations on Itanium II                                                                            Time for Different Optimizations on Opteron

2.5                                                                                                                       1.6

1.4
baseline                                                                                                              baseline
2
cse                                                        1.2                                                        cse
loop unrolling
Time (seconds)
loop unrolling
Time (seconds)

1
1.5                                                            cache blocking                                                                                                        cache blocking
memoization                                                0.8                                                        memoization
1
memoization + cache blocking                               0.6
memoization + cache blocking
0.4
0.5
0.2

0
0                                                                                                                              applySmoother        evaluateResidual                  coarsen           prolongate
applySmoother           evaluateResidual                   coarsen            prolongate
MG Operator
MG Operator
MFlops/s Rate for MG Operators on Power3

250

Baseline
CSE
Cache Block
200
Memoize
Memoize + Cache Block

150
MFlops/s

100

50

0
applySmoother       evaluateResidual              coarsen           prolongate
Operator
One V-cycle Time
V-cycle Time Across Platforms

10
9
baseline
8
cse
Time (seconds)

7
best
6
5
4
3
2
1
0
Power3         Itanium II     Alphaserver      Opteron
Platform
FFT vs. MG on Power3
Poisson Solve Time on the Power3 (Multigrid
performs 15 V-cycles)

100

80
Time (seconds)

60

40

20

0
FFT                     Multigrid
Solver
Summary & Continuing Work
   Overhead of cache-blocking is too large
for the small block sizes that fit in the
IBM Power3’s L1 cache
   Memoization offers greatest
performance benefit due to reduced FP
operations
Fraction of Peak Memory Bandwidth for MG on a 375 MHz Power3 (for Level 8 of Multigrid)

0.45

0.4

0.35

0.3
Fraction of Peak Bandwidth

Baseline
0.25
CSE
Cache Block
Memoize
0.2
Memoize + Cache Block

0.15

0.1

0.05

0
applySmoother       evaluateResidual              coarsen          prolongate
Operator
L1 Load Misses for MG on a 375 MHz Power3

8

7

6

5

Baseline
CSE
4                                                                               Cache Block
Memoize
Memoize + Cache Block

3

2

1

0
applySmoother   evaluateResidual                 coarsen       prolongate
MG Operator

```
To top