Automatic Performance Tuning of Sparse-Matrix-Vector

Document Sample
Automatic Performance Tuning of Sparse-Matrix-Vector Powered By Docstoc
					     Automatic Performance Tuning
                    of
Sparse-Matrix-Vector-Multiplication (SpMV)
                   and
        Iterative Sparse Solvers


                James Demmel

    www.cs.berkeley.edu/~demmel/cs267_Spr09
Outline

• Motivation for Automatic Performance Tuning
• Results for sparse matrix kernels
   – Sparse Matrix Vector Multiplication (SpMV)
   – Sequential and Multicore results
• OSKI = Optimized Sparse Kernel Interface
• Tuning Entire Sparse Solvers
Berkeley Benchmarking and OPtimization (BeBOP)

  • Prof. Katherine Yelick
  • Current members
     – Kaushik Datta, Mark Hoemmen, Marghoob Mohiyuddin,
       Shoaib Kamil, Rajesh Nishtala, Vasily Volkov, Sam Williams, …
  • Previous members
     – Hormozd Gahvari, Eun-Jim Im, Ankit Jain, Rich Vuduc,
       many undergrads, …
  • Many results here from current, previous students
  • bebop.cs.berkeley.edu
Automatic Performance Tuning
 • Goal: Let machine do hard work of writing fast code
 • What does tuning of dense BLAS, FFTs, signal processing,
   have in common?
    – Can do the tuning off-line: once per architecture, algorithm
    – Can take as much time as necessary (hours, a week…)
    – At run-time, algorithm choice may depend only on few
      parameters (matrix dimensions, size of FFT, etc.)
 • Can’t always do tuning off-line
    – Algorithm and implementation may strongly depend on data
      only known at run-time
    – Ex: Sparse matrix nonzero pattern determines both best data
      structure and implementation of Sparse-matrix-vector-
      multiplication (SpMV)
    – Part of search for best algorithm just be done (very quickly!)
      at run-time
Source: Accelerator Cavity Design Problem (Ko via Husbands)
Linear Programming Matrix




                            …
A Sparse Matrix You Encounter Every Day
SpMV with Compressed Sparse Row (CSR) Storage




Matrix-vector multiply kernel: y(i)      y(i) + A(i,j)*x(j)

for each row i                                         Only 2 flops per
  for k=ptr[i] to ptr[i+1] do                          2 mem_refs:
      y[i] = y[i] + val[k]*x[ind[k]]                   Limited by getting
                                                       data from memory
Example: The Difficulty of Tuning

                          • n = 21200
                          • nnz = 1.5 M
                          • kernel: SpMV

                          • Source: NASA
                            structural analysis
                            problem
Example: The Difficulty of Tuning

                              • n = 21200
                              • nnz = 1.5 M
                              • kernel: SpMV

                              • Source: NASA
                                structural analysis
                                problem
                              • 8x8 dense
                                substructure:
                                exploit this to
                                limit #mem_refs
  Speedups on Itanium 2: The Need for Search
                                   Mflop/s




  Best: 4x2




Reference
                                   Mflop/s
Register Profile: Itanium 2
                              1190 Mflop/s




                              190 Mflop/s
Power3 - 17%          252 Mflop/s   Power4 - 16%      820 Mflop/s




   Register Profiles: IBM and Intel IA-64



                      122 Mflop/s                     459 Mflop/s


Itanium 1 - 8%        247 Mflop/s   Itanium 2 - 33%   1.2 Gflop/s




                      107 Mflop/s                     190 Mflop/s
Ultra 2i - 11%        72 Mflop/s    Ultra 3 - 5%          90 Mflop/s




   Register Profiles: Sun and Intel x86



                      35 Mflop/s                          50 Mflop/s


Pentium III - 21%     108 Mflop/s   Pentium III-M - 15%   122 Mflop/s




                      42 Mflop/s                          58 Mflop/s
Another example of tuning challenges

                         • More complicated
                           non-zero structure
                           in general

                         • N = 16614
                         • NNZ = 1.1M
Zoom in to top corner

                        • More complicated
                          non-zero structure
                          in general

                        • N = 16614
                        • NNZ = 1.1M
3x3 blocks look natural, but…

                         • More complicated non-zero
                           structure in general

                         • Example: 3x3 blocking
                             – Logical grid of 3x3 cells


                         • But would lead to lots of ―fill-in‖
Extra Work Can Improve Efficiency!

                         • More complicated non-zero
                           structure in general

                         • Example: 3x3 blocking
                            –   Logical grid of 3x3 cells
                            –   Fill-in explicit zeros
                            –   Unroll 3x3 block multiplies
                            –   ―Fill ratio‖ = 1.5



                         • On Pentium III: 1.5x speedup!
                            – Actual mflop rate 1.52 = 2.25
                              higher
  Automatic Register Block Size Selection

• Selecting the r x c block size
   – Off-line benchmark
      • Precompute Mflops(r,c) using dense A for each r x c
      • Once per machine/architecture
   – Run-time “search”
      • Sample A to estimate Fill(r,c) for each r x c
   – Run-time heuristic model
      • Choose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)
Accurate and Efficient Adaptive Fill Estimation
• Idea: Sample matrix
   – Fraction of matrix to sample: s  [0,1]
   – Control cost = O(s * nnz ) by controlling s
       • Search at run-time: the constant matters!
• Control s automatically by computing statistical confidence
  intervals, by monitoring variance
• Cost of tuning
   – Lower bound: convert matrix in 5 to 40 unblocked SpMVs
   – Heuristic: 1 to 11 SpMVs
• Tuning only useful when we do many SpMVs
   – Common case, eg in sparse solvers
   Accuracy of the Tuning Heuristics (1/4)




See p. 375 of Vuduc’s thesis for matrices
NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)
Accuracy of the Tuning Heuristics (2/4)
          DGEMV

Accuracy of the Tuning Heuristics (2/4)
Upper Bounds on Performance for blocked SpMV
  • P = (flops) / (time)
      – Flops = 2 * nnz(A)
  • Upper bound on speed: Two main assumptions
      – 1. Count memory ops only (streaming)
      – 2. Count only compulsory, capacity misses: ignore conflicts
          • Account for line sizes
          • Account for matrix size and nnz
  • Charge minimum access ―latency‖ ai at Li cache & amem
      – e.g., Saavedra-Barrera and PMaC MAPS benchmarks
             
    Time   a i  Hits i  a mem  Hits mem
             i 1
                           
           a1  Loads   (a i 1  a i )  Misses i  (a mem  a  )  Misses 
                           i 1
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Summary of Other Performance Optimizations
 • Optimizations for SpMV
    –   Register blocking (RB): up to 4x over CSR
    –   Variable block splitting: 2.1x over CSR, 1.8x over RB
    –   Diagonals: 2x over CSR
    –   Reordering to create dense structure + splitting: 2x over CSR
    –   Symmetry: 2.8x over CSR, 2.6x over RB
    –   Cache blocking: 2.8x over CSR
    –   Multiple vectors (SpMM): 7x over CSR
    –   And combinations…
 • Sparse triangular solve
    – Hybrid sparse/dense data structure: 1.8x over CSR
 • Higher-level kernels
    – A·AT·x, AT·A·x: 4x over CSR, 1.8x over RB
    – A2·x: 2x over CSR, 1.5x over RB
    – [A·x, A2·x, A3·x, .. , Ak·x] …. more to say later
Potential Impact on Applications: Omega3P
• Application: accelerator cavity design [Ko]
• Relevant optimization techniques
   – Symmetric storage
   – Register blocking
   – Reordering, to create more dense blocks
      • Reverse Cuthill-McKee ordering to reduce bandwidth
          – Do Breadth-First-Search, number nodes in reverse order visited
      • Traveling Salesman Problem-based ordering to create blocks
          –   Nodes = columns of A
          –   Weights(u, v) = no. of nz u, v have in common
          –   Tour = ordering of columns
          –   Choose maximum weight tour
          –   See [Pinar & Heath ’97]

• 2.1x speedup on IBM Power 4
Source: Accelerator Cavity Design Problem (Ko via Husbands)
Post-RCM Reordering
100x100 Submatrix Along Diagonal
“Microscopic” Effect of RCM Reordering



                      Before: Green + Red
                      After: Green + Blue
“Microscopic” Effect of Combined RCM+TSP Reordering



                              Before: Green + Red
                              After: Green + Blue
(Omega3P)
Optimized Sparse Kernel Interface - OSKI


• Provides sparse kernels automatically tuned for user‘s
  matrix & machine
   – BLAS-style functionality: SpMV, Ax & ATy, TrSV
   – Hides complexity of run-time tuning
   – Includes new, faster locality-aware kernels: ATAx, Akx
• Faster than standard implementations
   – Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x
• For ―advanced‖ users & solver library writers
   – Available as stand-alone library (OSKI 1.0.1h, 6/07)
   – Available as PETSc extension (OSKI-PETSc .1d, 3/06)
   – Bebop.cs.berkeley.edu/oski
  How the OSKI Tunes (Overview)
     Library Install-Time (offline)                           Application Run-Time

    1. Build for                                                  Workload
       Target              2. Benchmark              Matrix     from program
       Arch.                                                      monitoring                History



    Generated                                                    1. Evaluate
                             Benchmark                                                     Heuristic
      code                                                         Models
                                data                                                       models
     variants

                                                                  2. Select
                                                                                         To user:
                                                                 Data Struct.
                                                                                         Matrix handle
                                                                   & Code                for kernel
                                                                                         calls


Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.
How to Call OSKI: Basic Usage

   • May gradually migrate existing apps
       – Step 1: ―Wrap‖ existing data structures
       – Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */




/* Compute y = ·y + a·A·x, 500 times */
for( i = 0; i < 500; i++ )
    my_matmult( ptr, ind, val, a, x, , y );
How to Call OSKI: Basic Usage

   • May gradually migrate existing apps
       – Step 1: ―Wrap‖ existing data structures
       – Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */
/* Step 1: Create OSKI wrappers around this data */
oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
   num_cols, SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

/* Compute y = ·y + a·A·x, 500 times */
for( i = 0; i < 500; i++ )
   my_matmult( ptr, ind, val, a, x, , y );
How to Call OSKI: Basic Usage

   • May gradually migrate existing apps
       – Step 1: ―Wrap‖ existing data structures
       – Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */
/* Step 1: Create OSKI wrappers around this data */
oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
   num_cols, SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

/* Compute y = ·y + a·A·x, 500 times */
for( i = 0; i < 500; i++ )
   oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view);/* Step 2 */
How to Call OSKI: Tune with Explicit Hints

   • User calls ―tune‖ routine
       – May provide explicit tuning hints (OPTIONAL)

oski_matrix_t A_tunable = oski_CreateMatCSR( … );
   /* … */
/* Tell OSKI we will call SpMV 500 times (workload hint) */
oski_SetHintMatMult(A_tunable, OP_NORMAL, a, x_view, , y_view, 500);
/* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */
oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);

oski_TuneMat(A_tunable); /* Ask OSKI to tune */

for( i = 0; i < 500; i++ )
   oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view);
How the User Calls OSKI: Implicit Tuning

   • Ask library to infer workload
       – Library profiles all kernel calls
       – May periodically re-tune



oski_matrix_t A_tunable = oski_CreateMatCSR( … );
   /* … */

for( i = 0; i < 500; i++ ) {
   oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view);
   oski_TuneMat(A_tunable); /* Ask OSKI to tune */
}
Multicore SMPs Used for Tuning SpMV
    Intel Xeon E5345 (Clovertown)    AMD Opteron 2356 (Barcelona)




    Sun T2+ T5140 (Victoria Falls)       IBM QS20 Cell Blade
Multicore SMPs with Conventional cache-based memory hierarchy
       Intel Xeon E5345 (Clovertown)    AMD Opteron 2356 (Barcelona)




       Sun T2+ T5140 (Victoria Falls)       IBM QS20 Cell Blade
Multicore SMPs with local store-based memory hierarchy
     Intel Xeon E5345 (Clovertown)    AMD Opteron 2356 (Barcelona)




     Sun T2+ T5140 (Victoria Falls)

                                          IBM QS20 Cell Blade
Multicore SMPs with CMT = Chip-MultiThreading
     Intel Xeon E5345 (Clovertown)    AMD Opteron 2356 (Barcelona)




     Sun T2+ T5140 (Victoria Falls)      IBM QS20 Cell Blade
Multicore SMPs: Number of threads
     Intel Xeon E5345 (Clovertown)    AMD Opteron 2356 (Barcelona)




             8 threads                        8 threads



     Sun T2+ T5140 (Victoria Falls)       IBM QS20 Cell Blade




       128 threads                           16* threads


                                                           *SPEs     only
Multicore SMPs: peak double precision flops
     Intel Xeon E5345 (Clovertown)     AMD Opteron 2356 (Barcelona)




         75 GFlop/s                        74 Gflop/s

      Sun T2+ T5140 (Victoria Falls)       IBM QS20 Cell Blade




         19 GFlop/s                       29* GFlop/s

                                                             *SPEs    only
Multicore SMPs: total DRAM bandwidth
     Intel Xeon E5345 (Clovertown)    AMD Opteron 2356 (Barcelona)



         21 GB/s (read)
         10 GB/s (write)
                                            21 GB/s

     Sun T2+ T5140 (Victoria Falls)       IBM QS20 Cell Blade



         42 GB/s (read)
         21 GB/s (write)
                                             51 GB/s

                                                            *SPEs    only
Multicore SMPs with Non-Uniform Memory Access - NUMA
     Intel Xeon E5345 (Clovertown)    AMD Opteron 2356 (Barcelona)




     Sun T2+ T5140 (Victoria Falls)       IBM QS20 Cell Blade




                                                           *SPEs     only
Set of 14 test matrices
  • All bigger than the caches of our SMPs



     2K x 2K Dense matrix
     stored in sparse format
                                 Dense


         Well Structured
    (sorted by nonzeros/row)
                                              FEM /     FEM /        Wind    FEM /          FEM /
                                 Protein                                              QCD           Economics Epidemiology
                                             Spheres   Cantilever   Tunnel   Harbor          Ship


       Poorly Structured
         hodgepodge
                                 FEM /
                                             Circuit   webbase
                               Accelerator


     Extreme Aspect Ratio
     (linear programming)
                                   LP
SpMV Performance: Naive parallelization




                                          •   Out-of-the box SpMV
                                              performance on a suite of
                                              14 matrices
                                          •   Scalability isn’t great:
                                              Compare to # threads
                                                 8    8
                                              128 16




                                               Naïve Pthreads

                                               Naïve
SpMV Performance: NUMA and Software Prefetching


                                      NUMA-aware allocation is
                                       essential on NUMA SMPs.
                                      Explicit software
                                       prefetching can boost
                                       bandwidth and change
                                       cache replacement
                                       policies

                                      used exhaustive search
SpMV Performance: ―Matrix Compression‖


                                          Compression includes
                                             register blocking
                                             other formats
                                             smaller indices
                                          Use heuristic rather
                                           than search
SpMV Performance: cache and TLB blocking




                                           +Cache/LS/TLB Blocking

                                           +Matrix Compression

                                           +SW Prefetching

                                           +NUMA/Affinity

                                           Naïve Pthreads

                                           Naïve
SpMV Performance: Architecture specific optimizations




                                              +Cache/LS/TLB Blocking

                                              +Matrix Compression

                                              +SW Prefetching

                                              +NUMA/Affinity

                                              Naïve Pthreads

                                              Naïve
SpMV Performance: max speedup

                                •   Fully auto-tuned SpMV
                                    performance across the suite
                                    of matrices
                                •   Included SPE/local store


      2.7x             4.0x     •
                                    optimized version
                                    Why do some optimizations
                                    work better on some
                                    architectures?




                                     +Cache/LS/TLB Blocking


      2.9x              35x          +Matrix Compression

                                     +SW Prefetching

                                     +NUMA/Affinity

                                     Naïve Pthreads

                                     Naïve
  Avoiding Communication in Sparse Linear Algebra
• k-steps of typical iterative solver for Ax=b or Ax=λx
   – Does k SpMVs with starting vector (eg with b, if solving Ax=b)
   – Finds ―best‖ solution among all linear combinations of these k+1 vectors
   – Many such ―Krylov Subspace Methods‖
       • Conjugate Gradients, GMRES, Lanczos, Arnoldi, …
• Goal: minimize communication in Krylov Subspace Methods
   – Assume matrix ―well-partitioned,‖ with modest surface-to-volume ratio
   – Parallel implementation
      • Conventional: O(k log p) messages, because k calls to SpMV
      • New: O(log p) messages - optimal
   – Serial implementation
      • Conventional: O(k) moves of data from slow to fast memory
      • New: O(1) moves of data – optimal
• Lots of speed up possible (modeled and measured)
   – Price: some redundant computation
           Locally Dependent Entries for
         [x,Ax], A tridiagonal, 2 processors
            Proc 1                         Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x

      Can be computed without communication
            Locally Dependent Entries for
        [x,Ax,A2x], A tridiagonal, 2 processors
             Proc 1                        Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x

      Can be computed without communication
            Locally Dependent Entries for
       [x,Ax,…,A3x], A tridiagonal, 2 processors
             Proc 1                        Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x

      Can be computed without communication
           Locally Dependent Entries for
      [x,Ax,…,A4x], A tridiagonal, 2 processors
             Proc 1                        Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x

      Can be computed without communication
           Locally Dependent Entries for
      [x,Ax,…,A8x], A tridiagonal, 2 processors
             Proc 1                        Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x

      Can be computed without communication
                k=8 fold reuse of A
                  Remotely Dependent Entries for
              [x,Ax,…,A8x], A tridiagonal, 2 processors
                    Proc 1                            Proc 2
     A8x
     A7x
     A6x
     A5x
     A4x
     A3x
     A2x
     Ax
      x
One message to get data needed to compute remotely dependent entries, not k=8
                 Minimizes number of messages = latency cost
                Price: redundant work  “surface/volume ratio”
Remotely Dependent Entries for [x,Ax,A2x,A3x],
      A irregular, multiple processors
      Sequential [x,Ax,…,A4x], with memory hierarchy




v




    One read of matrix from slow memory, not k=4
      Minimizes words moved = bandwidth cost
                 No redundant work
Performance results on 8-Core Clovertown
Optimizing Communication Complexity of Sparse Solvers

 • Example: GMRES for Ax=b on ―2D Mesh‖
    – x lives on n-by-n mesh
    – Partitioned on p½ -by- p½ grid of p processors
    – A has ―5 point stencil‖ (Laplacian)
       • (Ax)(i,j) = linear_combination(x(i,j), x(i,j±1), x(i±1,j))
    – Ex: 18-by-18 mesh on 3-by-3 grid of 9 processors
Minimizing Communication of GMRES
• What is the cost = (#flops, #words, #mess)
  of k steps of standard GMRES?

   GMRES, ver.1:
    for i=1 to k
      w = A * v(i-1)                            n/p½
      MGS(w, v(0),…,v(i-1))
      update v(i), H
    endfor
    solve LSQ problem with H
                                                       n/p½
    • Cost(A * v) = k * (9n2 /p, 4n / p½ , 4 )
    • Cost(MGS) = k2/2 * ( 4n2 /p , log p , log p )
    • Total cost ~ Cost( A * v ) + Cost (MGS)
    • Can we reduce the latency?
Minimizing Communication of GMRES
• Cost(GMRES, ver.1) = Cost(A*v) + Cost(MGS)
     = ( 9kn2 /p, 4kn / p½ , 4k ) + ( 2k2n2 /p , k2 log p / 2 , k2 log p / 2 )
• How much latency cost from A*v can you avoid? Almost all

   GMRES, ver. 2:
    W = [ v, Av, A2v, … , Akv ]
    [Q,R] = MGS(W)
    Build H from R, solve LSQ problem

                                                                         k=3
 • Cost(W) = ( ~ same, ~ same , 8 )
     • Latency cost independent of k – optimal
 • Cost (MGS) unchanged
 • Can we reduce the latency more?
Minimizing Communication of GMRES
• Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS)
     = ( 9kn2 /p, 4kn / p½ , 8 ) + ( 2k2n2 /p , k2 log p / 2 , k2 log p / 2 )
• How much latency cost from MGS can you avoid? Almost all
   GMRES, ver. 3:
    W = [ v, Av, A2v, … , Akv ]
    *Q,R+ = TSQR(W) … “Tall Skinny QR”
    Build H from R, solve LSQ problem
         W1             R1          R12
     W = W2             R2                        R1234
         W3             R3          R34
         W4             R4

 • Cost(TSQR) = ( ~ same, ~ same , log p )
    • Latency cost independent of s - optimal
Minimizing Communication of GMRES
• Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS)
     = ( 9kn2 /p, 4kn / p½ , 8 ) + ( 2k2n2 /p , k2 log p / 2 , k2 log p / 2 )
• How much latency cost from MGS can you avoid? Almost all
   GMRES, ver. 3:
    W = [ v, Av, A2v, … , Akv ]
    *Q,R+ = TSQR(W) … “Tall Skinny QR”
    Build H from R, solve LSQ problem
         W1             R1          R12
     W = W2             R2                        R1234
         W3             R3          R34
         W4             R4

 • Cost(TSQR) = ( ~ same, ~ same , log p )
 • Oops
Minimizing Communication of GMRES
• Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS)
     = ( 9kn2 /p, 4kn / p½ , 8 ) + ( 2k2n2 /p , k2 log p / 2 , k2 log p / 2 )
• How much latency cost from MGS can you avoid? Almost all
   GMRES, ver. 3:
    W = [ v, Av, A2v, … , Akv ]
    *Q,R+ = TSQR(W) … “Tall Skinny QR”
    Build H from R, solve LSQ problem
         W1             R1          R12
     W = W2             R2                        R1234
         W3             R3          R34
         W4             R4

 • Cost(TSQR) = ( ~ same, ~ same , log p )
 • Oops – W from power method, precision lost!
Minimizing Communication of GMRES
• Cost(GMRES, ver. 3) = Cost(W) + Cost(TSQR)
     = ( 9kn2 /p, 4kn / p½ , 8 ) + ( 2k2n2 /p , k2 log p / 2 , log p )
• Latency cost independent of k, just log p – optimal
• Oops – W from power method, so precision lost – What to do?


 • Use a different polynomial basis
 • Not Monomial basis W = [v, Av, A2v, …+, instead …
 • Newton Basis WN = [v, (A – θ1 I)v , (A – θ2 I)(A – θ1 I)v, …+ or
 • Chebyshev Basis WC = [v, T1(v), T2(v), …+
Speed ups on 8-core Clovertown
Conclusions
 • Fast code must minimize communication
    – Especially for sparse matrix computations because
      communication dominates
 • Generating fast code for a single SpMV
    – Design space of possible algorithms must be searched at
      run-time, when sparse matrix available
    – Design space should be searched automatically
 • Biggest speedups from minimizing communication in
   an entire sparse solver
    – Many more opportunities to minimize communication in
      multiple SpMVs than in one
    – Requires transforming entire algorithm
    – Lots of open problems
EXTRA SLIDES
Quick-and-dirty Parallelism: OSKI-PETSc

     • Extend PETSc‘s distributed memory SpMV (MATMPIAIJ)

                                   • PETSc
p0
                                      – Each process stores diag
                                        (all-local) and off-diag
                                        submatrices
p1
                                   • OSKI-PETSc:
                                      – Add OSKI wrappers
p2                                    – Each submatrix tuned
                                        independently

p3
OSKI-PETSc Proof-of-Concept Results

 • Matrix 1: Accelerator cavity design (R. Lee @ SLAC)
    – N ~ 1 M, ~40 M non-zeros
    – 2x2 dense block substructure
    – Symmetric
 • Matrix 2: Linear programming (Italian Railways)
    – Short-and-fat: 4k x 1M, ~11M non-zeros
    – Highly unstructured
    – Big speedup from cache-blocking: no native PETSc format
 • Evaluation machine: Xeon cluster
    – Peak: 4.8 Gflop/s per node
Accelerator Cavity Matrix
OSKI-PETSc Performance: Accel. Cavity
Linear Programming Matrix




                            …
OSKI-PETSc Performance: LP Matrix
Performance Results
•   Measured Multicore (Clovertown) speedups up to 6.4x
•   Measured/Modeled sequential OOC speedup up to 3x
•   Modeled parallel Petascale speedup up to 6.9x
•   Modeled parallel Grid speedup up to 22x

• Sequential speedup due to bandwidth, works for many
  problem sizes
• Parallel speedup due to latency, works for smaller problems
  on many processors
Speedups on Intel Clovertown (8 core)
Extensions

• Other Krylov methods
   – Arnoldi, CG, Lanczos, …
• Preconditioning
   – Solve MAx=Mb where preconditioning matrix M chosen to make
     system ―easier‖
       • M approximates A-1 somehow, but cheaply, to accelerate convergence
   – Cheap as long as contributions from ―distant‖ parts of the system
     can be compressed
       • Sparsity
       • Low rank
Design Space for [x,Ax,…,Akx] (1/3)

  • Mathematical Operation
     – Keep all vectors
         • Krylov Subspace Methods
     – Improving conditioning of basis
         • W = [x, p1(A)x, p2(A)x,…,pk(A)x]
         • pi(A) = degree i polynomial chosen to reduce cond(W)
     – Preconditioning (Ay=b  MAy=Mb)
         • [x,Ax,MAx,AMAx,MAMAx,…,(MA)kx]
     – Keep last vector Akx only
         • Jacobi, Gauss Seidel
Design Space for [x,Ax,…,Akx] (2/3)
 • Representation of sparse A
    – Zero pattern may be explicit or implicit
    – Nonzero entries may be explicit or implicit
        • Implicit  save memory, communication



                       Explicit pattern         Implicit pattern

  Explicit nonzeros    General sparse matrix Image segmentation

  Implicit nonzeros    Laplacian(graph),        ―Stencil matrix‖
                       for graph partitioning   Ex: tridiag(-1,2,-1)

 • Representation of dense preconditioners M
    – Low rank off-diagonal blocks (semiseparable)
Design Space for [x,Ax,…,Akx] (3/3)
 • Parallel implementation
    – From simple indexing, with redundant flops  surface/volume ratio
    – To complicated indexing, with fewer redundant flops
 • Sequential implementation
    – Depends on whether vectors fit in fast memory
 • Reordering rows, columns of A
    – Important in parallel and sequential cases
    – Can be reduced to pair of Traveling Salesmen Problems
 • Plus all the optimizations for one SpMV!
Summary
• Communication-Avoiding Linear Algebra (CALA)
• Lots of related work
   – Some going back to 1960‘s
   – Reports discuss this comprehensively, not here
• Our contributions
   – Several new algorithms, improvements on old ones
   – Unifying parallel and sequential approaches to avoiding
     communication
   – Time for these algorithms has come, because of growing
     communication costs
   – Why avoid communication just for linear algebra motifs?
Possible Class Projects

  • Come to BEBOP meetings (T 9 – 10:30, 606 Soda)
  • Incorporate multicore optimizations into OSKI
  • Experiment with SpMV on GPU
     – Which optimizations are most effective?
  • Try to speed up particular matrices of interest
     – Data mining
  • Experiment with new [x,Ax,…,Akx] kernel
     – GPU, multicore, distributed memory
     – On matrices of interest
     – Experiment with new solvers using this kernel
Extra Slides
Tuning Higher Level Algorithms
  • So far we have tuned a single sparse matrix kernel
      • y = AT*A*x motivated by higher level algorithm (SVD)
  • What can we do by extending tuning to a higher level?
  • Consider Krylov subspace methods for Ax=b, Ax = lx
      • Conjugate Gradients (CG), GMRES, Lanczos, …
      • Inner loop does y=A*x, dot products, saxpys, scalar ops
      • Inner loop costs at least O(1) messages
      • k iterations cost at least O(k) messages
  • Our goal: show how to do k iterations with O(1) messages
      • Possible payoff – make Krylov subspace methods much
        faster on machines with slow networks
      • Memory bandwidth improvements too (not discussed)
      • Obstacles: numerical stability, preconditioning, …
Krylov Subspace Methods for Solving Ax=b

• Compute a basis for a subspace V by doing y = A*x
  k times
• Find “best” solution in that Krylov subspace V
• Given starting vector x1, V spanned by x2 = A*x1, x3
  = A*x2 , … , xk = A*xk-1
• GMRES finds an orthogonal basis of V by
  Gram-Schmidt, so it actually does a different set of
  SpMVs than in last bullet
Example: Standard GMRES

r = b - A*x1,  = length(r), v1 = r /    … length(r) = sqrt(S ri2 )
for m= 1 to k do
   w = A*vm … at least O(1) messages
   for i = 1 to m do    … Gram-Schmidt
      him = dotproduct(vi , w ) … at least O(1) messages, or log(p)
      w = w – h im * vi
   end for
   hm+1,m = length(w) … at least O(1) messages, or log(p)
   vm+1 = w / hm+1,m
end for
find y minimizing length( Hk * y – e1 ) … small, local problem
new x = x1 + Vk * y     … Vk = [v1 , v2 , … , vk ]
           O(k2), or O(k2 log p), messages altogether
Latency-Avoiding GMRES (1)

r = b - A*x1,  = length(r), v1 = r /  … O(log p) messages
Wk+1 = [v1 , A * v1 , A2 * v1 , … , Ak * v1 ] … O(1) messages
[Q, R] = qr(Wk+1) … QR decomposition, O(log p) messages
Hk = R(:, 2:k+1) * (R(1:k,1:k))-1 … small, local problem
find y minimizing length( Hk * y – e1 ) … small, local problem
new x = x1 + Qk * y … local problem



                O(log p) messages altogether
                     Independent of k
Latency-Avoiding GMRES (2)

[Q, R] = qr(Wk+1)    … QR decomposition,   O(log p) messages

Easy, but not so stable way to do it:
   X(myproc) = Wk+1T(myproc) * Wk+1 (myproc)
               … local computation
   Y = sum_reduction(X(myproc)) … O(log p) messages
                                  … Y = Wk+1T* Wk+1
   R = (cholesky(Y))T … small, local computation
   Q(myproc) = Wk+1 (myproc) * R-1 … local computation
Numerical example (1)
    Diagonal matrix with n=1000, Aii from 1 down to 10-5
         Instability as k grows, after many iterations
Numerical Example (2)
        Partial remedy: restarting periodically (every 120 iterations)
Other remedies: high precision, different basis than [v , A * v , … , Ak * v ]
Operation Counts for [Ax,A2x,A3x,…,Akx] on p procs

Problem      Per-proc cost Standard                 Optimized

1D mesh      #messages      2k                      2
(tridiagonal) #words sent   2k                      2k
             #flops         5kn/p                   5kn/p + 5k2
             memory         (k+4)n/p                (k+4)n/p + 8k

3D mesh      #messages      26k                     26
27 pt stencil #words sent   6kn2p-2/3 + 12knp-1/3   6kn2p-2/3 + 12k2np-1/3
                            + O(k)                  + O(k3)
             #flops         53kn3/p                 53kn3/p + O(k2n2p-2/3)

             memory         (k+28)n3/p+ 6n2p-2/3    (k+28)n3/p + 168kn2p-2/3
                            + O(np-1/3)             + O(k2np-1/3)
Summary and Future Work

 • Dense
    – LAPACK
    – ScaLAPACK
 • Communication primitives
 • Sparse
    – Kernels, Stencils
    – Higher level algorithms
 • All of the above on new architectures
    – Vector, SMPs, multicore, Cell, …
 • High level support for tuning
    – Specification language
    – Integration into compilers
Extra Slides
A Sparse Matrix You Encounter Every Day
             Who am I?

                                      I am a
                                  Big Repository
                                     Of useful
                                   And useless
                                    Facts alike.

                                   Who am I?

                               (Hint: Not your e-mail
                                       inbox.)
What about the Google Matrix?

 • Google approach
    – Approx. once a month: rank all pages using connectivity structure
        • Find dominant eigenvector of a matrix
    – At query-time: return list of pages ordered by rank
 • Matrix: A = aG + (1-a)(1/n)uuT
    – Markov model: Surfer follows link with probability a, jumps to a
      random page with probability 1-a
    – G is n x n connectivity matrix [n  billions]
        • gij is non-zero if page i links to page j
        • Normalized so each column sums to 1
        • Very sparse: about 7—8 non-zeros per row (power law dist.)
    – u is a vector of all 1 values
    – Steady-state probability xi of landing on page i is solution to x = Ax
 • Approximate x by power method: x = Akx0
    – In practice, k  25
Current Work

 • Public software release
 • Impact on library designs: Sparse BLAS, Trilinos, PETSc, …
 • Integration in large-scale applications
     – DOE: Accelerator design; plasma physics
     – Geophysical simulation based on Block Lanczos (ATA*X; LBL)
 • Systematic heuristics for data structure selection?
 • Evaluation of emerging architectures
     – Revisiting vector micros
 • Other sparse kernels
     – Matrix triple products, Ak*x
 • Parallelism
 • Sparse benchmarks (with UTK) [Gahvari & Hoemmen]
 • Automatic tuning of MPI collective ops [Nishtala, et al.]
Summary of High-Level Themes

 • ―Kernel-centric‖ optimization
    – Vs. basic block, trace, path optimization, for instance
    – Aggressive use of domain-specific knowledge
 • Performance bounds modeling
    – Evaluating software quality
    – Architectural characterizations and consequences
 • Empirical search
    – Hybrid on-line/run-time models
 • Statistical performance models
    – Exploit information from sampling, measuring
Related Work

 • My bibliography: 337 entries so far
 • Sample area 1: Code generation
    – Generative & generic programming
    – Sparse compilers
    – Domain-specific generators
 • Sample area 2: Empirical search-based tuning
    – Kernel-centric
       • linear algebra, signal processing, sorting, MPI, …
    – Compiler-centric
       • profiling + FDO, iterative compilation, superoptimizers, self-
         tuning compilers, continuous program optimization
Future Directions (A Bag of Flaky Ideas)

  • Composable code generators and search spaces
  • New application domains
     – PageRank: multilevel block algorithms for topic-sensitive search?
  • New kernels: cryptokernels
     – rich mathematical structure germane to performance; lots of
       hardware
  • New tuning environments
     – Parallel, Grid, ―whole systems‖
  • Statistical models of application performance
     – Statistical learning of concise parametric models from traces for
       architectural evaluation
     – Compiler/automatic derivation of parametric models
Possible Future Work
  • Different Architectures
      – New FP instruction sets (SSE2)
      – SMP / multicore platforms
      – Vector architectures
  • Different Kernels
  • Higher Level Algorithms
      – Parallel versions of kenels, with optimized communication
      – Block algorithms (eg Lanczos)
      – XBLAS (extra precision)
  • Produce Benchmarks
      – Augment HPCC Benchmark
  • Make it possible to combine optimizations easily for any kernel
  • Related tuning activities (LAPACK & ScaLAPACK)
Review of Tuning by Illustration

                (Extra Slides)
Splitting for Variable Blocks and Diagonals

  • Decompose A = A1 + A2 + … At
     –   Detect ―canonical‖ structures (sampling)
     –   Split
     –   Tune each Ai
     –   Improve performance and save storage
  • New data structures
     – Unaligned block CSR
          • Relax alignment in rows & columns
     – Row-segmented diagonals
Example: Variable Block Row (Matrix #12)


                                      2.1x
                                        over CSR
                                      1.8x
                                        over RB
Example: Row-Segmented Diagonals




                                   2x
                                     over CSR
Mixed Diagonal and Block Structure
Summary

 • Automated block size selection
    – Empirical modeling and search
    – Register blocking for SpMV, triangular solve, ATA*x


 • Not fully automated
    – Given a matrix, select splittings and transformations
 • Lots of combinatorial problems
    – TSP reordering to create dense blocks (Pinar ‘97; Moon, et
      al. ‘04)
Extra Slides
A Sparse Matrix You Encounter Every Day
             Who am I?

                                      I am a
                                  Big Repository
                                     Of useful
                                   And useless
                                    Facts alike.

                                   Who am I?

                               (Hint: Not your e-mail
                                       inbox.)
Problem Context

 • Sparse kernels abound
    – Models of buildings, cars, bridges, economies, …
    – Google PageRank algorithm
 • Historical trends
    –   Sparse matrix-vector multiply (SpMV): 10% of peak
    –   2x faster with ―hand-tuning‖
    –   Tuning becoming more difficult over time
    –   Promise of automatic tuning: PHiPAC/ATLAS, FFTW, …
 • Challenges to high-performance
    – Not dense linear algebra!
         • Complex data structures: indirect, irregular memory access
         • Performance depends strongly on run-time inputs
Key Questions, Ideas, Conclusions

  • How to tune basic sparse kernels automatically?
     – Empirical modeling and search
        •   Up to 4x speedups for SpMV
        •   1.8x for triangular solve
        •   4x for ATA*x; 2x for A2*x
        •   7x for multiple vectors
  • What are the fundamental limits on performance?
     – Kernel-, machine-, and matrix-specific upper bounds
        • Achieve 75% or more for SpMV, limiting low-level tuning
        • Consequences for architecture?
  • General techniques for empirical search-based tuning?
     – Statistical models of performance
Road Map

 •   Sparse matrix-vector multiply (SpMV) in a nutshell
 •   Historical trends and the need for search
 •   Automatic tuning techniques
 •   Upper bounds on performance
 •   Statistical models of performance
Compressed Sparse Row (CSR) Storage




Matrix-vector multiply kernel: y(i)      y(i) + A(i,j)*x(j)

for each row i
  for k=ptr[i] to ptr[i+1] do
      y[i] = y[i] + val[k]*x[ind[k]]
Road Map

 •   Sparse matrix-vector multiply (SpMV) in a nutshell
 •   Historical trends and the need for search
 •   Automatic tuning techniques
 •   Upper bounds on performance
 •   Statistical models of performance
Historical Trends in SpMV Performance

  • The Data
    –   Uniprocessor SpMV performance since 1987
    –   ―Untuned‖ and ―Tuned‖ implementations
    –   Cache-based superscalar micros; some vectors
    –   LINPACK
SpMV Historical Trends: Mflop/s
Example: The Difficulty of Tuning

                              • n = 21216
                              • nnz = 1.5 M
                              • kernel: SpMV

                              • Source: NASA
                                structural analysis
                                problem
Still More Surprises

                       • More complicated non-zero
                         structure in general
Still More Surprises

                       • More complicated non-zero
                         structure in general

                       • Example: 3x3 blocking
                          – Logical grid of 3x3 cells
Historical Trends: Mixed News

  • Observations
     –   Good news: Moore‘s law like behavior
     –   Bad news: ―Untuned‖ is 10% peak or less, worsening
     –   Good news: ―Tuned‖ roughly 2x better today, and improving
     –   Bad news: Tuning is complex

     – (Not really news: SpMV is not LINPACK)


  • Questions
     – Application: Automatic tuning?
     – Architect: What machines are good for SpMV?
Road Map

 • Sparse matrix-vector multiply (SpMV) in a nutshell
 • Historical trends and the need for search
 • Automatic tuning techniques
    – SpMV [SC‘02; IJHPCA ‘04b]
    – Sparse triangular solve (SpTS) [ICS/POHLL ‘02]
    – ATA*x [ICCS/WoPLA ‘03]
 • Upper bounds on performance
 • Statistical models of performance
SPARSITY: Framework for Tuning SpMV

 • SPARSITY: Automatic tuning for SpMV [Im & Yelick ‘99]
    – General approach
        • Identify and generate implementation space
        • Search space using empirical models & experiments
    – Prototype library and heuristic for choosing register block size
        • Also: cache-level blocking, multiple vectors
 • What‘s new?
    – New block size selection heuristic
        • Within 10% of optimal — replaces previous version
    – Expanded implementation space
        • Variable block splitting, diagonals, combinations
    – New kernels: sparse triangular solve, ATA*x, Ar*x
Automatic Register Block Size Selection

  • Selecting the r x c block size
     – Off-line benchmark: characterize the machine
         • Precompute Mflops(r,c) using dense matrix for each r x c
         • Once per machine/architecture
     – Run-time “search”: characterize the matrix
         • Sample A to estimate Fill(r,c) for each r x c
     – Run-time heuristic model
         • Choose r, c to maximize Mflops(r,c) / Fill(r,c)
  • Run-time costs
     – Up to ~40 SpMVs (empirical worst case)
                     DGEMV

  Accuracy of the Tuning Heuristics (1/4)




NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)
          DGEMV

Accuracy of the Tuning Heuristics (2/4)
          DGEMV

Accuracy of the Tuning Heuristics (3/4)
          DGEMV

Accuracy of the Tuning Heuristics (4/4)
Road Map

 •   Sparse matrix-vector multiply (SpMV) in a nutshell
 •   Historical trends and the need for search
 •   Automatic tuning techniques
 •   Upper bounds on performance
      – SC’02
 • Statistical models of performance
Motivation for Upper Bounds Model

 • Questions
    – Speedups are good, but what is the speed limit?
       • Independent of instruction scheduling, selection
    – What machines are ―good‖ for SpMV?
Upper Bounds on Performance: Blocked SpMV

 • P = (flops) / (time)
     – Flops = 2 * nnz(A)
 • Lower bound on time: Two main assumptions
     – 1. Count memory ops only (streaming)
     – 2. Count only compulsory, capacity misses: ignore conflicts
         • Account for line sizes
         • Account for matrix size and nnz
 • Charge min access ―latency‖ ai at Li cache & amem
     – e.g., Saavedra-Barrera and PMaC MAPS benchmarks
            
   Time   a i  Hits i  a mem  Hits mem
            i 1
                          
          a1  Loads   (a i 1  a i )  Misses i  (a mem  a  )  Misses 
                          i 1
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Fraction of Upper Bound Across Platforms
Achieved Performance and Machine Balance
 • Machine balance [Callahan ‘88; McCalpin ‘95]
      – Balance = Peak Flop Rate / Bandwidth (flops / double)
 • Ideal balance for mat-vec:  2 flops / double
      – For SpMV, even less



Time  a1  Loads   (a i 1  a i )  Misses i  (a mem  a )  Misses 
                        i



 • SpMV ~ streaming
      – 1 / (avg load time to stream 1 array) ~ (bandwidth)
      – ―Sustained‖ balance = peak flops / model bandwidth
Where Does the Time Go?

                          
                Time  a i  Hits i  a mem  Hits mem
                          i 1




 • Most time assigned to memory
 • Caches ―disappear‖ when line sizes are equal
    – Strictly increasing line sizes
Execution Time Breakdown: Matrix 40
Speedups with Increasing Line Size
Summary: Performance Upper Bounds

 • What is the best we can do for SpMV?
    – Limits to low-level tuning of blocked implementations
    – Refinements?
 • What machines are good for SpMV?
    – Partial answer: balance characterization
 • Architectural consequences?
    – Example: Strictly increasing line sizes
Road Map

 •   Sparse matrix-vector multiply (SpMV) in a nutshell
 •   Historical trends and the need for search
 •   Automatic tuning techniques
 •   Upper bounds on performance
 •   Tuning other sparse kernels
 •   Statistical models of performance
      – FDO ’00; IJHPCA ’04a
Statistical Models for Automatic Tuning

  • Idea 1: Statistical criterion for stopping a search
     – A general search model
         • Generate implementation
         • Measure performance
         • Repeat
     – Stop when probability of being within e of optimal falls below
       threshold
         • Can estimate distribution on-line
  • Idea 2: Statistical performance models
     – Problem: Choose 1 among m implementations at run-time
     – Sample performance off-line, build statistical model
Example: Select a Matmul Implementation
Example: Support Vector Classification
Road Map

 •   Sparse matrix-vector multiply (SpMV) in a nutshell
 •   Historical trends and the need for search
 •   Automatic tuning techniques
 •   Upper bounds on performance
 •   Tuning other sparse kernels
 •   Statistical models of performance
 •   Summary and Future Work
Summary of High-Level Themes

 • ―Kernel-centric‖ optimization
    – Vs. basic block, trace, path optimization, for instance
    – Aggressive use of domain-specific knowledge
 • Performance bounds modeling
    – Evaluating software quality
    – Architectural characterizations and consequences
 • Empirical search
    – Hybrid on-line/run-time models
 • Statistical performance models
    – Exploit information from sampling, measuring
Related Work

 • My bibliography: 337 entries so far
 • Sample area 1: Code generation
    – Generative & generic programming
    – Sparse compilers
    – Domain-specific generators
 • Sample area 2: Empirical search-based tuning
    – Kernel-centric
       • linear algebra, signal processing, sorting, MPI, …
    – Compiler-centric
       • profiling + FDO, iterative compilation, superoptimizers, self-
         tuning compilers, continuous program optimization
Future Directions (A Bag of Flaky Ideas)

  • Composable code generators and search spaces
  • New application domains
     – PageRank: multilevel block algorithms for topic-sensitive search?
  • New kernels: cryptokernels
     – rich mathematical structure germane to performance; lots of
       hardware
  • New tuning environments
     – Parallel, Grid, ―whole systems‖
  • Statistical models of application performance
     – Statistical learning of concise parametric models from traces for
       architectural evaluation
     – Compiler/automatic derivation of parametric models
Acknowledgements

 • Super-advisors: Jim and Kathy
 • Undergraduate R.A.s: Attila, Ben, Jen, Jin, Michael,
   Rajesh, Shoaib, Sriram, Tuyet-Linh
 • See pages xvi—xvii of dissertation.
TSP-based Reordering: Before




                               (Pinar ‘97;
                               Moon, et al ‗04)
TSP-based Reordering: After




                              (Pinar ‘97;
                              Moon, et al ‗04)

                              Up to 2x
                              speedups
                              over CSR
  Example: L2 Misses on Itanium 2




Misses measured using PAPI [Browne ’00]
Example: Distribution of Blocked Non-Zeros
Sparse/Dense Partitioning for SpTS

  • Partition L into sparse (L1,L2) and dense LD:

                     L1       x1   b1 
                    
                    L            
                     2    LD  x2   b2 
                                 

  • Perform SpTS in three steps:
                  (1) L1 x1  b1
                  (2)     ˆ
                         b2  b2  L2 x1
                  (3) L x  b  ˆ
                           D 2   2


  • Sparsity optimizations for (1)—(2); DTRSV for (3)
  • Tuning parameters: block size, size of dense triangle
SpTS Performance: Power3
Summary of SpTS and AAT*x Results

 • SpTS — Similar to SpMV
    – 1.8x speedups; limited benefit from low-level tuning
 • AATx, ATAx
    – Cache interleaving only: up to 1.6x speedups
    – Reg + cache: up to 4x speedups
        • 1.8x speedup over register only
    – Similar heuristic; same accuracy (~ 10% optimal)
    – Further from upper bounds: 60—80%
        • Opportunity for better low-level tuning a la PHiPAC/ATLAS
 • Matrix triple products? Ak*x?
    – Preliminary work
Register Blocking: Speedup
Register Blocking: Performance
Register Blocking: Fraction of Peak
Example: Confidence Interval Estimation
Costs of Tuning
Splitting + UBCSR: Pentium III
Splitting + UBCSR: Power4
Splitting+UBCSR Storage: Power4
Example: Variable Block Row (Matrix #13)
Preliminary Results (Matrix Set 2): Itanium 2




                                                      Web/IR




   Dense   FEM   FEM (var)   Bio   Econ   Stat   LP
Multiple Vector Performance
What about the Google Matrix?

 • Google approach
    – Approx. once a month: rank all pages using connectivity structure
        • Find dominant eigenvector of a matrix
    – At query-time: return list of pages ordered by rank
 • Matrix: A = aG + (1-a)(1/n)uuT
    – Markov model: Surfer follows link with probability a, jumps to a
      random page with probability 1-a
    – G is n x n connectivity matrix [n  3 billion]
        • gij is non-zero if page i links to page j
        • Normalized so each column sums to 1
        • Very sparse: about 7—8 non-zeros per row (power law dist.)
    – u is a vector of all 1 values
    – Steady-state probability xi of landing on page i is solution to x = Ax
 • Approximate x by power method: x = Akx0
    – In practice, k  25
MAPS Benchmark Example: Power4
MAPS Benchmark Example: Itanium 2
Saavedra-Barrera Example: Ultra 2i
Execution Time Breakdown (PAPI): Matrix 40
Preliminary Results (Matrix Set 1): Itanium 2




    Dense   FEM        FEM (var)   Assorted   LP
Tuning Sparse Triangular Solve (SpTS)

  • Compute x=L-1*b where L sparse lower triangular, x
    & b dense
  • L from sparse LU has rich dense substructure
     – Dense trailing triangle can account for 20—90% of matrix
       non-zeros
  • SpTS optimizations
     – Split into sparse trapezoid and dense trailing triangle
     – Use tuned dense BLAS (DTRSV) on dense triangle
     – Use Sparsity register blocking on sparse part
  • Tuning parameters
     – Size of dense trailing triangle
     – Register block size
Sparse Kernels and Optimizations

  • Kernels
     –   Sparse matrix-vector multiply (SpMV): y=A*x
     –   Sparse triangular solve (SpTS): x=T-1*b
     –   y=AAT*x, y=ATA*x
     –   Powers (y=Ak*x), sparse triple-product (R*A*RT), …
  • Optimization techniques (implementation space)
     –   Register blocking
     –   Cache blocking
     –   Multiple dense vectors (x)
     –   A has special structure (e.g., symmetric, banded, …)
     –   Hybrid data structures (e.g., splitting, switch-to-dense, …)
     –   Matrix reordering
  • How and when do we search?
     – Off-line: Benchmark implementations
     – Run-time: Estimate matrix properties, evaluate performance
       models based on benchmark data
Cache Blocked SpMV on LSI Matrix: Ultra 2i


                                   A
                                   10k x 255k
                                   3.7M non-zeros

                                   Baseline:
                                   16 Mflop/s

                                   Best block size
                                   & performance:
                                   16k x 64k
                                   28 Mflop/s
Cache Blocking on LSI Matrix: Pentium 4


                                   A
                                   10k x 255k
                                   3.7M non-zeros

                                   Baseline:
                                   44 Mflop/s

                                   Best block size
                                   & performance:
                                   16k x 16k
                                   210 Mflop/s
Cache Blocked SpMV on LSI Matrix: Itanium


                                 A
                                 10k x 255k
                                 3.7M non-zeros

                                 Baseline:
                                 25 Mflop/s

                                 Best block size
                                 & performance:
                                 16k x 32k
                                 72 Mflop/s
Cache Blocked SpMV on LSI Matrix: Itanium 2


                                  A
                                  10k x 255k
                                  3.7M non-zeros

                                  Baseline:
                                  170 Mflop/s

                                  Best block size
                                  & performance:
                                  16k x 65k
                                  275 Mflop/s
Inter-Iteration Sparse Tiling (1/3)

  x1       t1      y1    • [Strout, et al., ‗01]
                         • Let A be 6x6 tridiagonal
                         • Consider y=A2x
  x2       t2      y2
                            – t=Ax, y=At
                         • Nodes: vector elements
  x3       t3      y3    • Edges: matrix elements aij


  x4       t4      y4



  x5       t5      y5
Inter-Iteration Sparse Tiling (2/3)

  x1       t1      y1    • [Strout, et al., ‗01]
                         • Let A be 6x6 tridiagonal
                         • Consider y=A2x
  x2       t2      y2
                            – t=Ax, y=At
                         • Nodes: vector elements
  x3       t3      y3    • Edges: matrix elements aij

                         • Orange = everything needed
  x4       t4      y4
                           to compute y1
                            – Reuse a11, a12

  x5       t5      y5
Inter-Iteration Sparse Tiling (3/3)

  x1       t1      y1    • [Strout, et al., ‗01]
                         • Let A be 6x6 tridiagonal
                         • Consider y=A2x
  x2       t2      y2
                            – t=Ax, y=At
                         • Nodes: vector elements
  x3       t3      y3    • Edges: matrix elements aij

                         • Orange = everything needed
  x4       t4      y4
                           to compute y1
                            – Reuse a11, a12

  x5       t5      y5    • Grey = y2, y3
                            – Reuse a23, a33, a43
Inter-Iteration Sparse Tiling: Issues

  x1       t1      y1    • Tile sizes (colored regions)
                           grow with no. of iterations
                           and increasing out-degree
  x2       t2      y2        – G likely to have a few nodes
                               with high out-degree (e.g.,
                               Yahoo)
  x3       t3      y3    • Mathematical tricks to limit
                           tile size?
                             – Judicious dropping of edges
  x4       t4      y4          [Ng‘01]


  x5       t5      y5
Summary and Questions

 • Need to understand matrix structure and machine
    – BeBOP: suite of techniques to deal with different sparse structures
      and architectures
 • Google matrix problem
    – Established techniques within an iteration
    – Ideas for inter-iteration optimizations
    – Mathematical structure of problem may help
 • Questions
    – Structure of G?
    – What are the computational bottlenecks?
    – Enabling future computations?
        • E.g., topic-sensitive PageRank  multiple vector version [Haveliwala
          ‘02]
    – See www.cs.berkeley.edu/~richie/bebop/intel/google for more info,
      including more complete Itanium 2 results.
Exploiting Matrix Structure

  • Symmetry (numerical or structural)
     – Reuse matrix entries
     – Can combine with register blocking, multiple vectors, …
  • Matrix splitting
     – Split the matrix, e.g., into r x c and 1 x 1
     – No fill overhead
  • Large matrices with random structure
     – E.g., Latent Semantic Indexing (LSI) matrices
     – Technique: cache blocking
         • Store matrix as 2i x 2j sparse submatrices
         • Effective when x vector is large
         • Currently, search to find fastest size
Symmetric SpMV Performance: Pentium 4
SpMV with Split Matrices: Ultra 2i
Cache Blocking on Random Matrices: Itanium


Speedup on four banded
random matrices.
Sparse Kernels and Optimizations

  • Kernels
     –   Sparse matrix-vector multiply (SpMV): y=A*x
     –   Sparse triangular solve (SpTS): x=T-1*b
     –   y=AAT*x, y=ATA*x
     –   Powers (y=Ak*x), sparse triple-product (R*A*RT), …
  • Optimization techniques (implementation space)
     –   Register blocking
     –   Cache blocking
     –   Multiple dense vectors (x)
     –   A has special structure (e.g., symmetric, banded, …)
     –   Hybrid data structures (e.g., splitting, switch-to-dense, …)
     –   Matrix reordering
  • How and when do we search?
     – Off-line: Benchmark implementations
     – Run-time: Estimate matrix properties, evaluate performance
       models based on benchmark data
Register Blocked SpMV: Pentium III
Register Blocked SpMV: Ultra 2i
Register Blocked SpMV: Power3
Register Blocked SpMV: Itanium
Possible Optimization Techniques

  • Within an iteration, i.e., computing (G+uuT)*x once
     – Cache block G*x
         • On linear programming matrices and matrices with random
           structure (e.g., LSI), 1.5—4x speedups
         • Best block size is matrix and machine dependent
     – Reordering and/or splitting of G to separate dense structure
       (rows, columns, blocks)
  • Between iterations, e.g., (G+uuT)2x
     – (G+uuT)2x = G2x + (Gu)uTx + u(uTG)x + u(uTu)uTx
         • Compute Gu, uTG, uTu once for all iterations
         • G2x: Inter-iteration tiling to read G only once
Multiple Vector Performance: Itanium
Sparse Kernels and Optimizations

  • Kernels
     –   Sparse matrix-vector multiply (SpMV): y=A*x
     –   Sparse triangular solve (SpTS): x=T-1*b
     –   y=AAT*x, y=ATA*x
     –   Powers (y=Ak*x), sparse triple-product (R*A*RT), …
  • Optimization techniques (implementation space)
     –   Register blocking
     –   Cache blocking
     –   Multiple dense vectors (x)
     –   A has special structure (e.g., symmetric, banded, …)
     –   Hybrid data structures (e.g., splitting, switch-to-dense, …)
     –   Matrix reordering
  • How and when do we search?
     – Off-line: Benchmark implementations
     – Run-time: Estimate matrix properties, evaluate performance
       models based on benchmark data
SpTS Performance: Itanium




(See POHLL ‘02 workshop paper, at ICS ‘02.)
Sparse Kernels and Optimizations

  • Kernels
     –   Sparse matrix-vector multiply (SpMV): y=A*x
     –   Sparse triangular solve (SpTS): x=T-1*b
     –   y=AAT*x, y=ATA*x
     –   Powers (y=Ak*x), sparse triple-product (R*A*RT), …
  • Optimization techniques (implementation space)
     –   Register blocking
     –   Cache blocking
     –   Multiple dense vectors (x)
     –   A has special structure (e.g., symmetric, banded, …)
     –   Hybrid data structures (e.g., splitting, switch-to-dense, …)
     –   Matrix reordering
  • How and when do we search?
     – Off-line: Benchmark implementations
     – Run-time: Estimate matrix properties, evaluate performance
       models based on benchmark data
Optimizing AAT*x

 • Kernel: y=AAT*x, where A is sparse, x & y dense
    – Arises in linear programming, computation of SVD
    – Conventional implementation: compute z=AT*x, y=A*z
 • Elements of A can be reused:

                                a1
                                  T
                                      
                                           n
                  y  a1 an       x   ak ( ak x )
                                                    T

                                aT        k 1
                                n    

 • When ak represent blocks of columns, can apply register
   blocking.
Optimized AAT*x Performance: Pentium III
Current Directions

  • Applying new optimizations
     – Other split data structures (variable block, diagonal, …)
     – Matrix reordering to create block structure
     – Structural symmetry
  • New kernels (triple product RART, powers Ak, …)
  • Tuning parameter selection
  • Building an automatically tuned sparse matrix library
     – Extending the Sparse BLAS
     – Leverage existing sparse compilers as code generation
       infrastructure
     – More thoughts on this topic tomorrow
Related Work

 • Automatic performance tuning systems
    – PHiPAC [Bilmes, et al., ‘97], ATLAS [Whaley & Dongarra ‘98]
    – FFTW [Frigo & Johnson ‘98], SPIRAL [Pueschel, et al., ‘00],
      UHFFT [Mirkovic and Johnsson ‘00]
    – MPI collective operations [Vadhiyar & Dongarra ‘01]
 • Code generation
    – FLAME [Gunnels & van de Geijn, ‘01]
    – Sparse compilers: [Bik ‘99], Bernoulli [Pingali, et al., ‘97]
    – Generic programming: Blitz++ [Veldhuizen ‘98], MTL [Siek &
      Lumsdaine ‘98], GMCL [Czarnecki, et al. ‘98], …
 • Sparse performance modeling
    – [Temam & Jalby ‘92], [White & Saddayappan ‘97], [Navarro,
      et al., ‘96], [Heras, et al., ‘99], [Fraguela, et al., ‘99], …
More Related Work

 • Compiler analysis, models
    – CROPS [Carter, Ferrante, et al.]; Serial sparse tiling [Strout
      ‘01]
    – TUNE [Chatterjee, et al.]
    – Iterative compilation [O‘Boyle, et al., ‘98]
    – Broadway compiler [Guyer & Lin, ‘99]
    – [Brewer ‘95], ADAPT [Voss ‘00]
 • Sparse BLAS interfaces
    –   BLAST Forum (Chapter 3)
    –   NIST Sparse BLAS [Remington & Pozo ‘94]; SparseLib++
    –   SPARSKIT [Saad ‘94]
    –   Parallel Sparse BLAS [Fillipone, et al. ‘96]
Context: Creating High-Performance Libraries

  • Application performance dominated by a few
    computational kernels
  • Today: Kernels hand-tuned by vendor or user
  • Performance tuning challenges
     – Performance is a complicated function of kernel,
       architecture, compiler, and workload
     – Tedious and time-consuming
  • Successful automated approaches
     – Dense linear algebra: ATLAS/PHiPAC
     – Signal processing: FFTW/SPIRAL/UHFFT
Cache Blocked SpMV on LSI Matrix: Itanium
Sustainable Memory Bandwidth
Multiple Vector Performance: Pentium 4
Multiple Vector Performance: Itanium
Multiple Vector Performance: Pentium 4
Optimized AAT*x Performance: Ultra 2i
Optimized AAT*x Performance: Pentium 4
High Precision GEMV (XBLAS)
  High Precision Algorithms (XBLAS)
• Double-double (High precision word represented as pair of doubles)
   – Many variations on these algorithms; we currently use Bailey‘s
• Exploiting Extra-wide Registers
   – Suppose s(1) , … , s(n) have f-bit fractions, SUM has F>f bit fraction
   – Consider following algorithm for S = Si=1,n s(i)
      • Sort so that |s(1)|  |s(2)|  …  |s(n)|
      • SUM = 0, for i = 1 to n SUM = SUM + s(i), end for, sum = SUM
   – Theorem (D., Hida) Suppose F<2f (less than double precision)
      • If n  2F-f + 1, then error  1.5 ulps
      • If n = 2F-f + 2, then error  22f-F ulps (can be  1)
      • If n  2F-f + 3, then error can be arbitrary (S  0 but sum = 0 )
   – Examples
      • s(i) double (f=53), SUM double extended (F=64)
           – accurate if n  211 + 1 = 2049
      • Dot product of single precision x(i) and y(i)
           – s(i) = x(i)*y(i) (f=2*24=48), SUM double extended (F=64) 
           – accurate if n  216 + 1 = 65537
More Extra Slides
Awards
• Best Paper, Intern. Conf. Parallel Processing, 2004
   – ―Performance models for evaluation and automatic performance tuning of
     symmetric sparse matrix-vector multiply‖
• Best Student Paper, Intern. Conf. Supercomputing, Workshop
  on Performance Optimization via High-Level Languages and
  Libraries, 2003
   – Best Student Presentation too, to Richard Vuduc
   – ―Automatic performance tuning and analysis of sparse triangular solve‖
• Finalist, Best Student Paper, Supercomputing 2002
   – To Richard Vuduc
   – ―Performance Optimization and Bounds for Sparse Matrix-vector Multiply‖

• Best Presentation Prize, MICRO-33: 3rd ACM Workshop on
  Feedback-Directed Dynamic Optimization, 2000
   – To Richard Vuduc
   – ―Statistical Modeling of Feedback Data in an Automatic Tuning System‖
Accuracy of the Tuning Heuristics (4/4)
         DGEMV

Can Match DGEMV Performance
Fraction of Upper Bound Across Platforms
Achieved Performance and Machine Balance
 • Machine balance [Callahan ‘88; McCalpin ‘95]
    – Balance = Peak Flop Rate / Bandwidth (flops / double)
    – Lower is better (I.e. can hope to get higher fraction of peak
      flop rate)
 • Ideal balance for mat-vec:  2 flops / double
    – For SpMV, even less
Where Does the Time Go?

                          
                Time  a i  Hits i  a mem  Hits mem
                          i 1




 • Most time assigned to memory
 • Caches ―disappear‖ when line sizes are equal
    – Strictly increasing line sizes
Execution Time Breakdown: Matrix 40
Execution Time Breakdown (PAPI): Matrix 40
Speedups with Increasing Line Size
Summary: Performance Upper Bounds

 • What is the best we can do for SpMV?
    – Limits to low-level tuning of blocked implementations
    – Refinements?
 • What machines are good for SpMV?
    – Partial answer: balance characterization
 • Architectural consequences?
    – Help to have strictly increasing line sizes
Evaluating algorithms and machines for SpMV


 • Metrics
    – Speedups
    – Mflop/s (―fair‖ flops)
    – Fraction of peak
 • Questions
    – Speedups are good, but what is ―the best?‖
        • Independent of instruction scheduling, selection
        • Can SpMV be further improved or not?
    – What machines are ―good‖ for SpMV?
Statistical Models for Automatic Tuning

  • Idea 1: Statistical criterion for stopping a search
     – A general search model
         • Generate implementation
         • Measure performance
         • Repeat
     – Stop when probability of being within e of optimal falls below
       threshold
         • Can estimate distribution on-line
  • Idea 2: Statistical performance models
     – Problem: Choose 1 among m implementations at run-time
     – Sample performance off-line, build statistical model
SpMV Historical Trends: Fraction of Peak
Motivation for Automatic Performance Tuning of SpMV

 • Historical trends
    – Sparse matrix-vector multiply (SpMV): 10% of peak or less
 • Performance depends on machine, kernel, matrix
    – Matrix known at run-time
    – Best data structure + implementation can be surprising
 • Our approach: empirical performance modeling and
   algorithm search
 Sparse CALA Summary
• Optimizing one SpMV too limiting
• Take k steps of Krylov subspace method
   – GMRES, CG, Lanczos, Arnoldi
   – Assume matrix ―well-partitioned,‖ with modest surface-to-volume
     ratio
   – Parallel implementation
      • Conventional: O(k log p) messages
      • CALA: O(log p) messages - optimal
   – Serial implementation
      • Conventional: O(k) moves of data from slow to fast memory
      • CALA: O(1) moves of data – optimal
   – Can incorporate some preconditioners
      • Need to be able to ―compress‖ interactions between distant i, j
      • Hierarchical, semiseparable matrices …
   – Lots of speed up possible (measured and modeled)
      • Price: some redundant computation
Motivation for Automatic Performance Tuning
• Writing high performance software is hard
   – Make programming easier while getting high speed
• Ideal: program in your favorite high level language
  (Matlab, Python, PETSc…) and get a high fraction of
  peak performance
• Reality: Best algorithm (and its implementation) can
  depend strongly on the problem, computer
  architecture, compiler,…
   – Best choice can depend on knowing a lot of applied
     mathematics and computer science
• How much of this can we teach?
• How much of this can we automate?
Examples of Automatic Performance Tuning (1)
 • Dense BLAS
     – Sequential
     – PHiPAC (UCB), then ATLAS (UTK) (used in Matlab)
     – math-atlas.sourceforge.net/
 • Fast Fourier Transform (FFT) & variations
     – Sequential and Parallel
     – FFTW (MIT)
     – www.fftw.org
 • Digital Signal Processing
     – SPIRAL: www.spiral.net (CMU)
 • Communication Collectives (UCB, UTK)
 • Rose (LLNL), Bernoulli (Cornell), Telescoping Languages (Rice), …
 • More projects, conferences, government reports, …
Potential Impact on Applications: T3P

  • Application: accelerator design [Ko]
  • 80% of time spent in SpMV
  • Relevant optimization techniques
     – Symmetric storage
     – Register blocking
  • On Single Processor Itanium 2
     – 1.68x speedup
        • 532 Mflops, or 15% of 3.6 GFlop peak
     – 4.4x speedup with multiple (8) vectors
        • 1380 Mflops, or 38% of peak
Examples of Automatic Performance Tuning (2)
 • What do dense BLAS, FFTs, signal processing, MPI reductions
   have in common?
    – Can do the tuning off-line: once per architecture, algorithm
    – Can take as much time as necessary (hours, a week…)
    – At run-time, algorithm choice may depend only on few parameters
        • Matrix dimension, size of FFT, etc.
Ultra 2i - 9%       63 Mflop/s   Ultra 3 - 5%          109 Mflop/s




   SpMV Performance (Matrix #2): Generation 2



                    35 Mflop/s                         53 Mflop/s


Pentium III - 19%   96 Mflop/s   Pentium III-M - 15%   120 Mflop/s




                    42 Mflop/s                         58 Mflop/s
Power3 - 13%        195 Mflop/s   Power4 - 14%      703 Mflop/s




   SpMV Performance (Matrix #2): Generation 1



                    100 Mflop/s                     469 Mflop/s


Itanium 1 - 7%      225 Mflop/s   Itanium 2 - 31%   1.1 Gflop/s




                    103 Mflop/s                     276 Mflop/s
Taking advantage of block structure in SpMV

  • Bottleneck is time to get matrix from memory
  • Don‘t store each nonzero with index, instead store
    each nonzero r-by-c block with index
     – Storage drops by up to 2x, if rc >> 1, all 32-bit quantities
     – Time to fetch matrix from memory decreases
  • Change both data structure and algorithm
     – Need to pick r and c
     – Need to change algorithm accordingly
  • In example, is r=c=8 best choice?
     – Minimizes storage, so looks like a good idea…
  Example: L2 Misses on Itanium 2




Misses measured using PAPI [Browne ’00]
Example: Sparse Triangular Factor

                          • Raefsky4 (structural
                            problem) + SuperLU +
                            colmmd
                          • N=19779, nnz=12.6 M




                            Dense trailing triangle:
                            dim=2268, 20% of total
                            nz

                            Can be as high as 90+%!
                            1.8x over CSR
Cache Optimizations for AAT*x

  • Cache-level: Interleave multiplication by A, AT
     – Only fetch A from memory once
                   a1T 
                              n
               …     x 
AA  x  a1  an              ai (ai x)
  T                                   T

                   T
                   an    …  i 1
                       
                                        ―axpy‖        dot product


  • Register-level: aiT to be rc block row, or diag row
Example: Combining Optimizations (1/2)

 • Register blocking, symmetry, multiple (k) vectors
    – Three low-level tuning parameters: r, c, v

                            X
                                                   k

                        v
                                        *

                                    r
                                            c
                       +=


        Y                   A
Example: Combining Optimizations (2/2)

 • Register blocking, symmetry, and multiple vectors
   [Ben Lee @ UCB]
    – Symmetric, blocked, 1 vector
       • Up to 2.6x over nonsymmetric, blocked, 1 vector
    – Symmetric, blocked, k vectors
       • Up to 2.1x over nonsymmetric, blocked, k vecs.
       • Up to 7.3x over nonsymmetric, nonblocked, 1, vector
    – Symmetric Storage: up to 64.7% savings
How the OSKI Tunes (Overview)

 • At library build/install-time
     – Pre-generate and compile code variants into dynamic libraries
     – Collect benchmark data
         • Measures and records speed of possible sparse data structure and code
           variants on target architecture
     – Installation process uses standard, portable GNU AutoTools
 • At run-time
     – Library ―tunes‖ using heuristic models
         • Models analyze user‘s matrix & benchmark data to choose optimized
           data structure and code
     – Non-trivial tuning cost: up to ~40 mat-vecs
         • Library limits the time it spends tuning based on estimated workload
              – provided by user or inferred by library
         • User may reduce cost by saving tuning results for application on future
           runs with same or similar matrix
Optimizations in OSKI, so far

   •   Fully automatic heuristics for
        – Sparse matrix-vector multiply
             • Register-level blocking
             • Register-level blocking + symmetry + multiple vectors
             • Cache-level blocking
        – Sparse triangular solve with register-level blocking and ―switch-to-dense‖
          optimization
        – Sparse ATA*x with register-level blocking
   •   User may select other optimizations manually
        – Diagonal storage optimizations, reordering, splitting; tiled matrix powers
          kernel (Ak*x)
        – All available in dynamic libraries
        – Accessible via high-level embedded script language
   •   ―Plug-in‖ extensibility
        – Very advanced users may write their own heuristics, create new data
          structures/code variants and dynamically add them to the system
Performance Results
• Measured
   – Sequential/OOC speedup up to 3x
• Modeled
   – Sequential/multicore speedup up to 2.5x
   – Parallel/Petascale speedup up to 6.9x
   – Parallel/Grid speedup up to 22x
• See bebop.cs.berkeley.edu/#pubs

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:9/16/2011
language:English
pages:261