Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

project_presentation

VIEWS: 1 PAGES: 29

  • pg 1
									Optimizing 3D Multigrid to Be
Comparable to the FFT

    Michael Maire and Kaushik Datta

   Note: Several diagrams were taken from Kathy Yelick’s
                       CS267 lectures
Outline
   FFT and Multigrid Overview
   FFT and MG Running Times
   Multigrid Performance Model
   Multigrid Optimizations
   Results
3D Poisson’s Equation
   An elliptic PDE that arises in many
    physical problems (e.g. electrostatic or
    gravitational potential)
   Many different techniques for solving
    are available
3D Poisson’s Equation
   The continuous version is:
    d2/dx2 + d2/dy2 + d2/dz2 =  (or 
      = )
   The discrete version is: T * x = b
       In 2D, the 9-point stencil looks like:
Algorithms for Solving 2D Poisson’s
Equation with N unknowns
Algorithm        Serial Flops   Memory
  Dense LU      N3             N2
  Band LU       N2             N3/2
  Jacobi        N2             N
  Explicit Inv. N              N
  Conj.Grad. N 3/2             N
  RB SOR        N 3/2          N
  Sparse LU N 3/2              N*log N
  FFT           N*log N        N
  Multigrid     N              N
 Lower bound N                 N
Multigrid Overview
   Basic Algorithm:
       Replace problem on fine grid by an approximation
        on a coarser grid
       Solve the coarse grid problem approximately, and
        use the solution as a starting guess for the fine-
        grid problem, which is then iteratively updated
       Solve the coarse grid problem recursively, i.e. by
        using a still coarser grid approximation, etc.
   Success depends on coarse grid solution
    being a good approximation to the fine grid
Multigrid Sketch on a 2D Mesh
   Consider a 2m+1 by 2m+1 grid
   Let P(i) be the problem of solving the discrete Poisson equation on a
    2i+1 by 2i+1 grid in 2D
       Write linear system as T(i) * x(i) = b(i)
   P(m) , P(m-1) , … , P(1) is sequence of problems from finest to coarsest
         Multigrid Convergence
                                         Multigrid Convergence Rate

                         0
                              0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Log Base 10 (L2 Norm)




                         -2
                         -4
                         -6
                         -8
                        -10
                        -12
                        -14
                                               Number of MG V-cycles
Multigrid Operators
   The four operators that we examine are:
       evaluateResidual – calculates the residual of our
        current solution
       applySmoother – performs a Jacobi relaxation step
       coarsen – maps from a (2m x 2m x 2m) grid to a
        (2m-1 x 2m-1 x 2m-1) grid
       prolongate – maps from a (2m-1 x 2m-1 x 2m-1) grid
        to a (2m x 2m x 2m) grid
   All these operators perform nearest-neighbor
    computations using a 27-point stencil
Multigrid V-cycle
   Just a picture of the call graph
   In time a V-cycle looks like the following:
         level
             5

            4

            3

            2                 prolongate,
                              applySmoother,
                 coarsen      evaluateResidual
            1

                           time
Multigrid Performance Model
   Memory access is performance bottleneck
   Each pass over 3D grid requires (per cell):
       27 integer operations (stencil coordinates)
       27 FP loads of surrounding grid locations
       27 (approx) FP operations
       1 FP store
   Traversing grid consecutively in memory
    causes 9 cache misses every 1/(# doubles
    stored in a cache line) cells
   Grid size prevents reuse of cached values
                         Performance Model Running Time Predictions on Power3

                 4.5
                   4
                 3.5
Time (seconds)




                                                                      lower bound
                   3
                                                                      cse
                 2.5
                                                                      upper bound
                   2
                 1.5
                   1
                 0.5
                   0
                       applySmoother   evaluateResidual     coarsen         prolongate
                                                  MG Operator
                                     Performance Model for MFlops/s

           450


           400


           350


           300
MFlops/s




           250                                                                     Lower Bound
                                                                                   CSE
           200                                                                     Upper Bound


           150


           100


            50


             0
                 applySmoother   evaluateResidual          coarsen    prolongate
                                                Operator
Multigrid Optimizations
   Optimizations possible in 2 areas
   Reducing ALU operations per cell
       Reusing stencil coordinates between cells
       Reusing partial sums common to consecutive cells
   Improving memory behavior
       Reducing # of loads (register blocking)
       Reducing # of cache misses (cache blocking)
Multigrid Optimizations
   Common subexpression elimination
   Loop unrolling
   Memoization
   Cache blocking
   Memoization + cache blocking
Platforms
   Power 3- 375 MHz, 64 KB L1 Cache
   Itanium II- 900 MHz, 16 KB L1 Cache
   Alphaserver- 1000 MHz, 64 KB L1 Cache
   Opteron- 1600 MHz, 64 KB L1 Cache
Optimizations – Loop Unrolling
   Reduces # stencil coordinates
    computed per cell
   Exposes load reuse to compiler
   Allows compiler to use FP registers to
    store grid values, reducing loads
   Minimum number of loads is 9/grid
    point (given generous # FP registers)
Optimizations - Memoization
   Traverses grid once to precompute partial
    sums common to consecutive cells
   Traverses grid again to compute actual cell
    values
       9 integer stencil operations/cell
       18 FP operations/cell
   Reduces FP register pressure by breaking
    computation into two stages, but still uses 9
    load streams per cell
                           The Effect of Memoization on Time on Power3
                                                                                                                                    The Effect of Memoization on FP Count on
                     2.5                                                                                                                             Power3
                                                                                                                          400


                      2                                                                                                   350
                                                                                 Regular
                                                                                 Memoized                                 300




                                                                                              Millions of FP Operations
Time (seconds)




                     1.5                                                                                                                                                             Regular
                                                                                                                          250
                                                                                                                                                                                     Memoized
                                                                                                                          200
                      1
                                                                                                                          150


                                                                                                                          100
                     0.5

                                                                                                                           50

                      0                                                                                                     0
                            applySmoother   evaluateResidual     coarsen         prolongate                                         applySmoother      evaluateResidual        coarsen            prolongate
                                                       MG Operator                                                                                               MG Operator



                           The Effect of Memoization on L1 Load Misses                                                                  The Effect of Memoization on Number of
                                            on Power3                                                                                               Loads on Power3

                      5                                                                                                    400

                     4.5                                                                                                   350

                      4                                                   Regular                                          300                                                           Regular


                                                                                                 Millions of Loads
                                                                          Memoized                                                                                                       Memoized
                     3.5                                                                                                   250
Millions of Misses




                      3                                                                                                    200

                     2.5                                                                                                   150
                      2                                                                                                    100
                     1.5                                                                                                    50
                      1                                                                                                         0




                                                                                                                                                            l
                     0.5




                                                                                                                                                                                                    te
                                                                                                                                         er




                                                                                                                                                                                 n
                                                                                                                                                            ua




                                                                                                                                                                              se




                                                                                                                                                                                                 ga
                                                                                                                                      th




                                                                                                                                                          id




                                                                                                                                                                            ar
                                                                                                                                    oo




                                                                                                                                                                                              on
                                                                                                                                                       es




                                                                                                                                                                          co
                                                                                                                                    m




                      0
                                                                                                                                                        R




                                                                                                                                                                                           ol
                                                                                                                              yS




                                                                                                                                                                                         pr
                                                                                                                                                     te


                            applySmoother   evaluateResidual    coarsen         prolongate
                                                                                                                                                  ua
                                                                                                                            pl
                                                                                                                          ap




                                                                                                                                                al




                                                      MG Operator                                                                                                MG Operator
                                                                                                                                              ev
Optimizations – Cache Blocking
   Break 3D grid into blocks that fit within
    cache
   Attempts to allow reuse between
    adjacent 2D-slices
   Reduces memory traffic to 3 load
    streams per cell
   Overhead when switching between
    blocks
                              Time for Different Cache Block Sizes on Power3                                                                     Time for Different Cache Block Sizes on Alphaserver

                 4.5                                                                                                                2.5
                   4
Time (seconds)




                 3.5                                                                                                                    2




                                                                                                                   Time (seconds)
                   3
                 2.5
                   2                                                                                                                1.5
                                                                                                                                                                                                     applySmoother_cblk
                 1.5
                   1                                                                                                                    1                                                            evaluateResidual_cblk
                 0.5                                                            applySmoother_cblk
                   0
                                                                                evaluateResidual_cblk                               0.5                                                              applySmoother_mem_cblk
                                                                    g
                   10
                   20
                             30
                             40
                                       50
                                       60
                                                 70
                                                 80
                                                                  90
                                                           Bl 100
                                                                kin
                                                                                applySmoother_mem_cblk                                  0                                                            evaluateResidual_mem_cblk
                                                             oc




                                                                                                                                                                                          100
                                                                                                                                                                                                No
                                                                                                                                            10
                                                                                                                                                  20
                                                                                                                                                       30
                                                                                                                                                            40
                                                                                                                                                                 50
                                                                                                                                                                      60
                                                                                                                                                                           70
                                                                                                                                                                                80
                                                                                                                                                                                     90
                                                                                evaluateResidual_mem_cblk
                                                            No




                            Side Length of Square Cache Block                                                                                    Side Length of Square Cache Block


                             Time for Different Cache Block Sizes on Itanium II                                                                  Time for Different Cache Block Sizes on Opteron

                  3                                                                                                          2.5

                 2.5                                                                                                                2




                                                                                                            Time (seconds)
Time (seconds)




                  2                                                                                                          1.5
                                                                                applySmoother_cblk                                                                                                   applySmoother_cblk
                 1.5                                                                                                                1
                                                                                evaluateResidual_cblk                                                                                                evaluateResidual_cblk
                  1                                                                                                          0.5
                                                                                applySmoother_mem_cblk                                                                                               applySmoother_mem_cblk
                 0.5                                                                                                                0                                                                evaluateResidual_mem_cblk
                                                                                evaluateResidual_mem_cblk




                                                                                                                                                                                           g
                                                                                                                                     10
                                                                                                                                     20
                                                                                                                                                 30
                                                                                                                                                 40
                                                                                                                                                            50
                                                                                                                                                            60
                                                                                                                                                                      70
                                                                                                                                                                      80
                                                                                                                                                                                         90

                                                                                                                                                                                     oc 0
                                                                                                                                                                                       kin
                                                                                                                                                                                       10
                  0
                                                                     100
                                                                           No
                       10
                             20
                                  30
                                       40
                                            50
                                                 60
                                                      70
                                                           80
                                                                90




                                                                                                                                                                                   Bl
                                                                                                                                                                                No
                            Side Length of Square Cache Block
                                                                                                                                                 Side Length of Square Cache Block
                                       L1 Load Misses for Different Cache Block Sizes on
                                                            Power3


Millions of L1 Load Misses
                             8
                             7
                             6
                             5
                             4
                             3
                             2                                                   applySmoother_cblk
                             1                                                   evaluateResidual_cblk
                             0




                                                                                                                    ng
                                                                                                          0
                                 10


                                       20


                                             30

                                                   40


                                                             50


                                                                       60


                                                                                 70


                                                                                           80

                                                                                                 90

                                                                                                       10


                                                                                                                   i
                                                                                                                ck
                                                                                                              lo
                                                                                                          B
                                                                                                        o
                                                                                                       N
                                                  Side Length of Square Cache Block


                                      Mispredicted Branches for Different Cache Block
                                                     Sizes on Power3

                             200
Thousands of Mispredicted




                             180
                             160                                                 applySmoother_cblk
                             140                                                 evaluateResidual_cblk
       Branches




                             120
                             100
                              80
                              60
                              40
                              20
                               0
                                                                                                            0


                                                                                                                      g
                                      10


                                            20


                                                  30


                                                        40


                                                                  50


                                                                            60


                                                                                      70


                                                                                            80


                                                                                                  90




                                                                                                                    in
                                                                                                         10


                                                                                                                  ck
                                                                                                                lo
                                                                                                                B
                                                                                                           o
                                                                                                          N




                                                        Side Length of Square Cache Block
                                       Time for Different Optimizations on Power3                                                                          Time for Different Optimizations on Alphaserver

                 3.5                                                                                                                  2.5

                  3                                                             baseline                                                                                                              baseline
                                                                                                                                           2
                                                                                cse                                                                                                                   cse
                 2.5
                                                                                loop unrolling                                                                                                        loop unrolling




                                                                                                                  Time (seconds)
Time (seconds)




                  2                                                             cache blocking                                        1.5                                                             cache blocking
                                                                                memoization                                                                                                           memoization
                 1.5                                                            memoization + cache blocking                               1
                                                                                                                                                                                                      memoization + cache blocking

                  1
                                                                                                                                      0.5
                 0.5

                  0                                                                                                                        0
                       applySmoother           evaluateResidual                   coarsen           prolongate                                   applySmoother       evaluateResidual                  coarsen            prolongate
                                                              MG Operator                                                                                                            MG Operator



                                   Time for Different Optimizations on Itanium II                                                                            Time for Different Optimizations on Opteron

                 2.5                                                                                                                       1.6

                                                                                                                                           1.4
                                                                                baseline                                                                                                              baseline
                  2
                                                                                cse                                                        1.2                                                        cse
                                                                                                                                                                                                      loop unrolling
                                                                                                                          Time (seconds)
                                                                                loop unrolling
Time (seconds)




                                                                                                                                            1
                 1.5                                                            cache blocking                                                                                                        cache blocking
                                                                                memoization                                                0.8                                                        memoization
                  1
                                                                                memoization + cache blocking                               0.6
                                                                                                                                                                                                      memoization + cache blocking
                                                                                                                                           0.4
                 0.5
                                                                                                                                           0.2

                                                                                                                                            0
                  0                                                                                                                              applySmoother        evaluateResidual                  coarsen           prolongate
                       applySmoother           evaluateResidual                   coarsen            prolongate
                                                                                                                                                                                        MG Operator
                                                                  MG Operator
                                 MFlops/s Rate for MG Operators on Power3

           250

                                                                             Baseline
                                                                             CSE
                                                                             Cache Block
           200
                                                                             Memoize
                                                                             Memoize + Cache Block


           150
MFlops/s




           100




           50




            0
                 applySmoother       evaluateResidual              coarsen           prolongate
                                                        Operator
One V-cycle Time
                                V-cycle Time Across Platforms

                  10
                   9
                                                                   baseline
                   8
                                                                   cse
 Time (seconds)




                   7
                                                                   best
                   6
                   5
                   4
                   3
                   2
                   1
                   0
                       Power3         Itanium II     Alphaserver      Opteron
                                               Platform
FFT vs. MG on Power3
                       Poisson Solve Time on the Power3 (Multigrid
                                 performs 15 V-cycles)

                 100

                  80
Time (seconds)




                  60

                  40

                  20

                   0
                                 FFT                     Multigrid
                                             Solver
Summary & Continuing Work
   Overhead of cache-blocking is too large
    for the small block sizes that fit in the
    IBM Power3’s L1 cache
   Memoization offers greatest
    performance benefit due to reduced FP
    operations
                                    Fraction of Peak Memory Bandwidth for MG on a 375 MHz Power3 (for Level 8 of Multigrid)

                             0.45



                              0.4



                             0.35



                              0.3
Fraction of Peak Bandwidth




                                                                                                                        Baseline
                             0.25
                                                                                                                        CSE
                                                                                                                        Cache Block
                                                                                                                        Memoize
                              0.2
                                                                                                                        Memoize + Cache Block


                             0.15



                              0.1



                             0.05



                               0
                                    applySmoother       evaluateResidual              coarsen          prolongate
                                                                           Operator
                                                    L1 Load Misses for MG on a 375 MHz Power3

                             8



                             7




                             6
Millions of L1 Load Misses




                             5

                                                                                                             Baseline
                                                                                                             CSE
                             4                                                                               Cache Block
                                                                                                             Memoize
                                                                                                             Memoize + Cache Block

                             3




                             2



                             1



                             0
                                 applySmoother   evaluateResidual                 coarsen       prolongate
                                                                    MG Operator

								
To top