The Future of LAPACK and ScaLAPA

Document Sample
The Future of LAPACK and ScaLAPA Powered By Docstoc
					    The Future of
LAPACK and ScaLAPACK
 www.netlib.org/lapack-dev

       Jim Demmel
       UC Berkeley
       23 Feb 2007
                   Outline
•   Motivation for new Sca/LAPACK
•   Challenges (or research opportunities…)
•   Goals of new Sca/LAPACK
•   Highlights of progress
    – With some excursions …
                        Motivation
• LAPACK and ScaLAPACK are widely used
  – Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks,
    NAG, NEC, SGI, …
  – >68M web hits @ Netlib (incl. CLAPACK, LAPACK95)
       • 35K hits/day
• Many ways to improve them, based on
  –   Own algorithmic research
  –   Enthusiastic participation of research community
  –   User/vendor survey
  –   Opportunities and demands of new architectures,
      programming languages
• New releases planned (NSF support)
                      Participants
• UC Berkeley:
   – Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li,
     Osni Marques, Christof Voemel, David Bindel, Yozo Hida,
     Jason Riedy, undergrads…
• U Tennessee, Knoxville
   – Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek,
     Stan Tomov, Alfredo Buttari, Jakub Kurzak
• Other Academic Institutions
   – UT Austin, UC Davis, Florida IT, U Kansas, U Maryland,
     North Carolina SU, San Jose SU, UC Santa Barbara
   – TU Berlin, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid,
     U Manchester, U Umeå, U Wuppertal, U Zagreb
• Research Institutions
   – CERFACS, LBL
• Industrial Partners
   – Cray, HP, Intel, Interactive Supercomputing, MathWorks, NAG, SGI
        Challenges: Increasing Parallelism
                 In the Top500:
500                                                                                              64k-128k
                                                                                                 32k-64k
                                                                                                 16k-32k
400                                                                                              8k-16k
                                                                                                 4k-8k
                                                                                                 2049-4096
                                                                                                 1025-2048
300
                                                                                                 513-1024
                                                                                                 257-512
                                                                                                 129-256
200                                                                                              65-128
                                                                                                 33-64
                                                                                                 17-32
100                                                                                              9-16
                                                                                                 5-8
                                                                                                 3-4
 0                                                                                               2
                                                                                                 1
      1993

             1994

                    1995

                           1996

                                  1997

                                         1998

                                                1999

                                                       2000

                                                              2001

                                                                     2002

                                                                            2003

                                                                                   2004

                                                                                          2005
       In your Laptop (Intel just announced an 80-core, 1 Teraflop chip)
                   Challenges
• Increasing parallelism
• Exponentially growing gaps between
  – Floating point time << 1/Memory BW << Memory Latency
     • Improving 59%/year vs 23%/year vs 5.5%/year
  – Floating point time << 1/Network BW << Network Latency
     • Improving 59%/year vs 26%/year vs 15%/year
• Heterogeneity (performance and semantics)
• Asynchrony
• Unreliability
         What do users want?
• High performance, ease of use, …
• Survey results at www.netlib.org/lapack-dev
  – Small but interesting sample
  – What matrix sizes do you care about?
     • 1000s: 34%
     • 10,000s: 26%
     • 100,000s or 1Ms: 26%
  – How many processors, on distributed memory?
     • >10: 34%,     >100: 31%,       >1000: 19%
  – Do you use more than double precision?
     • Sometimes or frequently: 16%
  – Would Automatic Memory Allocation help?
     • Very useful: 72%,   Not useful: 14%
 Goals of next Sca/LAPACK
1. Better algorithms
     – Faster, more accurate
2. Expand contents
     – More functions, more parallel implementations
3.   Automate performance tuning
4.   Improve ease of use
5.   Better software engineering
6.   Increased community involvement
   Goal 2 – Expanded Content
• Make content of ScaLAPACK mirror LAPACK as much
  as possible (possible class projects)
 Missing Routines in Sca/LAPACK
                                         LAPACK      ScaLAPACK
Linear          LU                       xGESV     PxGESV
Equations       LU + iterative refine    xGESVX    missing
                Cholesky                 xPOSV     PxPOSV
                LDLT                     xSYSV     missing
Least Squares   QR                       xGELS     PxGELS
(LS)            QR+pivot                 xGELSY    missing
                SVD/QR                   xGELSS    missing
                SVD/D&C                  xGELSD    missing (intent?)
                SVD/MRRR                 missing   missing
                QR + iterative refine.   missing   missing
Generalized LS LS + equality constr.     xGGLSE missing
               Generalized LM            xGGGLM missing
               Above + Iterative ref.    missing missing
             More missing routines
                                            LAPACK       ScaLAPACK
Symmetric EVD      QR / Bisection+Invit   xSYEV / X   PxSYEV / X
                   D&C                    xSYEVD      PxSYEVD
                   MRRR                   xSYEVR      missing
Nonsymmetric EVD   Schur form             xGEES / X   missing (driver)
                   Vectors too            xGEEV /X    missing
SVD                QR                     xGESVD      PxGESVD
                   D&C                    xGESDD      missing (intent?)
                   MRRR                   missing     missing
                   Jacobi                 missing     missing
Generalized        QR / Bisection+Invit   xSYGV / X   PxSYGV / X
Symmetric EVD      D&C                    xSYGVD      missing (intent?)
                   MRRR                   missing     missing
Generalized        Schur form             xGGES / X   missing
Nonsymmetric EVD   Vectors too            xGGEV / X   missing
Generalized SVD    Kogbetliantz           xGGSVD      missing (intent)
                   MRRR                   missing     missing
    Goal 1: Better Algorithms
• Faster
  – But provide ―usual‖ accuracy, stability
  – … Or accurate for an important subclass
• More accurate
  – But provide ―usual‖ speed
  – … Or at any cost
  Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
   – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR:
   – Byers / Mathias / Braman
• Faster Hessenberg, tridiagonal, bidiagonal reductions:
   – van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
• Extensions to QZ:
   – Kågström / Kressner
• Recursive blocked layouts for packed formats:
   – Gustavson / Kågström / Elmroth / Jonsson/
  Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
  –   Parlett / Dhillon / Voemel / Marques / Willems
  –   Faster and more accurate than previous algorithms
  –   SIAM SIAG/LA Prize in 2006
  –   New sequential, first parallel versions out in 2006
Flop Counts of Eigensolvers
        (2.2 GHz Opteron + ACML)
Flop Counts of Eigensolvers
        (2.2 GHz Opteron + ACML)
Flop Counts of Eigensolvers
        (2.2 GHz Opteron + ACML)
Flop Counts of Eigensolvers
        (2.2 GHz Opteron + ACML)
Flop Count Ratios of Eigensolvers
           (2.2 GHz Opteron + ACML)
Run Time Ratios of Eigensolvers
          (2.2 GHz Opteron + ACML)
MFlop Rates of Eigensolvers
        (2.2 GHz Opteron + ACML)
                Exploiting GPUs
• Numerous emerging co-processors
  – Cell, SSE, Grape, GPU, ―physics coprocessor,‖ …
• When can we exploit them?
  – LIttle help if memory is bottleneck
  – Various attempts to use GPUs for dense linear algebra
• Bisection on GPUs for symmetric tridiagonal
  eigenproblem
  – Evaluate Count(x) = #(evals < x) for many x
  – Very little memory traffic
  – Speedups up to 100x (Volkov)
     • 43 Gflops on ATI Radeon X1900 vs running on 2.8 GHz Pentium 4
     • Overall eigenvalue solver 6.8x faster
Parallel Runtimes of Eigensolvers
         (2.4 GHz Xeon Cluster + Ethernet)
Goal 1b – More Accurate Algorithms
• Iterative refinement for Ax=b, least squares
  – “Promise” the right answer for O(n2) additional cost
• Jacobi-based SVD
  – Faster than QR, can be arbitrarily more accurate
• Arbitrary precision versions of everything
  – Using your favorite multiple precision package
Goal 1b – More Accurate Algorithms
• Iterative refinement for Ax=b, least squares
  – “Promise” the right answer for O(n2) additional cost
  – Iterative refinement with extra-precise residuals
     • Newton’s method applied to solving f(x) = A*x-b = 0
  – Extra-precise BLAS needed (LAWN#165)
           More Accurate: Solve Ax=b
    Conventional Gaussian Elimination
                                        With extra precise
             1/e                        iterative refinement




e




             e = n1/2 2-24
 Iterative Refinement: for speed
• What if double precision much slower than
  single?
  – Cell processor in Playstation 3
     • 256 GFlops single, 25 GFlops double

  – Pentium SSE2: single twice as fast as double
• Given Ax=b in double precision
  – Factor in single, do refinement in double
  – If k(A) < 1/esingle, runs at speed of single
• 1.9x speedup on Intel-based laptop
• Applies to many algorithms, if difference large
   Goal 2 – Expanded Content
• Make content of ScaLAPACK mirror LAPACK as much
  as possible
• New functions (highlights)
   – Updating / downdating of factorizations:
      • Stewart, Langou
   – More generalized SVDs:
      • Bai , Wang
   – More generalized Sylvester/Lyapunov eqns:
      • Kågström, Jonsson, Granat
   – Structured eigenproblems
      • O(n2) version of roots(p)
         – Gu, Chandrasekaran, Bindel et al
      • Selected matrix polynomials:
         – Mehrmann
      New algorithm for roots(p)
• To find roots of polynomial p            -p1 -p2     … -pd
  – Roots(p) does eig(C(p))                 1 0        … 0
  – Costs O(n3), stable, reliable    C(p)= 0 1         … 0
• O(n2) Alternatives                        ……         … …
  – Newton, Jenkins-Traub, Laguerre, …      0 …         1 0
   – Stable? Reliable?
• New: Exploit “semiseparable” structure of C(p)
   – Low rank of any submatrix of upper triangle of C(p)
     preserved under QR iteration
   – Complexity drops from O(n3) to O(n2), stable in practice
• Related work: Gemignani, Bini, Pan, et al
• Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David
  Bindel, David Garmire, Jim Demmel
 Goal 3 – Automate Performance Tuning

• Widely used in performance tuning of Kernels
  –   ATLAS (PhiPAC) – BLAS - www.netlib.org/atlas
  –   FFTW – Fast Fourier Transform – www.fftw.org
  –   Spiral – signal processing - www.spiral.net
  –   OSKI – Sparse BLAS – bebop.cs.berkeley.edu/oski
Optimizing blocksizes for mat-mul




Finding a Needle in a Haystack – So Automate
 Goal 3 – Automate Performance Tuning

• Widely used in performance tuning of Kernels
• 1300 calls to ILAENV() to get block sizes, etc.
  – Never been systematically tuned
• Extend automatic tuning techniques of ATLAS, etc.
  to these other parameters
  – Automation important as architectures evolve
• Convert ScaLAPACK data layouts on the fly
  – Important for ease-of-use too
       ScaLAPACK Data Layouts

1D Block                     1D Cyclic




                            2D Block
1D Block                     Cyclic
 Cyclic
                               Execution time of PDGESV for various grid shape




          100
                                                                                                                    90-100
          90                                                                                                        80-90
          80                                                                                                        70-80
          70                                                                                                        60-70
          60                                                                                                        50-60

seconds   50                                                                                                        40-50
                                                                                                      1x60
          40                                                                                                        30-40
                                                                                                    2x30
                                                                                                                    20-30
          30
                                                                                               3x20
                                                                                                       grid shape   10-20
          20                                                                                 4x15
                                                                                                                    0-10
          10                                                                            5x12
           0                                                                          6x10
                10000 9000   8000   7000   6000   5000    4000   3000   2000   1000
                                           problem size


   Speedups for using 2D processor grid range from 2x to 8x
   Cost of redistributing from 1D to best 2D layout 1% - 10%
  Times obtained on:
  60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect
  2GB Memory
                Conclusions
• Lots to do in Dense Linear Algebra
  – New numerical algorithms
  – Continuing architectural challenges
     • Parallelism, performance tuning
  – Ease of use, software engineering
• Grant support, but success depends on
  contributions from community

• www.netlib.org/lapack-dev
• www.cs.berkeley.edu/~demmel
Extra Slides
     Fast Matrix Multiplication (1)
              (Cohn, Kleinberg, Szegedy, Umans)
• Can think of fast convolution of polynomials p, q as
   – Map p (q) into group algebra Si pi zi  C[G] of cyclic group G = { zi }
   – Multiply elements of C[G]    (use divide&conquer = FFT)
   – Extract coefficients
• For matrix multiply, need non-abelian group satisfying triple
  product property
   – There are subsets X, Y, Z of G where xyz = 1 with
     x  X, y  Y, z Z  x = y = z = 1
   – Map matrix A into group algebra via Sxy Axy x-1y,
     B into Sy’z By’z y’-1z.
   – Since x-1y y’-1z = x-1z iff y = y’ we get Sy Axy Byz = (AB)xz
• Search for fast algorithms reduced to search for groups with
  certain properties
   – Fastest algorithm so far is O(n2.38), same as Coppersmith/Winograd
       Fast Matrix Multiplication (2)
            (Cohn, Kleinberg, Szegedy, Umans)

1.   Embed A, B in group algebra (exact)
2.   Perform FFT (roundoff)
3.   Reorganize results into new matrices (exact)
4.   Multiple new matrices recursively (roundoff)
5.   Reorganize results into new matrices (exact)
6.   Perform IFFT (roundoff)
7.   Extract C = AB from group algebra (exact)
    Fast Matrix Multiplication (3)
            (Demmel, Dumitriu, Holtz, Kleinberg)

•   Thm 1: Any algorithm of this class for C = AB
    is “numerically stable”
    –   || Ccomp - C || <= c • nd • e • || A || • || B || + O(e2)
    –   c and d are “modest” constants
    –   Like Strassen

•   Let  be the exponent of matrix multiplication,
    i.e. no algorithm is faster than O(n).

•   Thm 2: For all >0 there exists an algorithm
    with complexity O(n+) that is numerically
    stable in the sense of Thm 1.
     Commodity Processor Trends
                        Annual        Typical value          Predicted value         Typical valu
                        increase      in 2006                in 2010                 in 2020


 Single-chip
 floating-point         59%            4 GFLOP/s              32 GFLOP/s              3300 GFLO
 performance

 Memory bus                           1 GWord/s              3.5 GWord/s             27 GWord/s
                        23%
 bandwidth                            = 0.25 word/flop       = 0.11 word/flop        = 0.008 wor

                                      70 ns                  50 ns                   28 ns
 Memory latency         (5.5%)        = 280 FP ops           = 1600 FP ops           = 94,000 FP
                                      = 70 loads             = 170 loads             = 780 loads



     Will our algorithms run at a high fraction of peak?


Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222
   pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.
                  Challenges
• For all large scale computing, not just
  linear algebra!
• Example … your laptop
• Exponentially growing gaps between
  – Floating point time << 1/Memory BW << Memory Latency

  – Floating point time << 1/Network BW << Network Latency
          Parallel Processor Trends
                        Annual        Typical value          Predicted value         Typical valu
                        increase      in 2004                in 2010                 in 2020



 # Processors           20 %           4,000                  12,000                  3300 GFLO


 Network                              65 MWord/s             260 MWord/s             27 GWord/s
                        26%
 Bandwidth                            = 0.03 word/flop       = 0.008 word/flop       = 0.008 wor

                                                                                     28 ns
 Network                              5 ms                   2 ms
                        (15%)                                                        = 94,000 FP
 latency                              = 20K FP ops           = 64K FP ops
                                                                                     = 780 loads



     Will our algorithms scale up to more processors?


Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222
   pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
  – Parlett / Dhillon / Voemel / Marques / Willems
  – Faster and more accurate than previous algorithms
  – New sequential, first parallel versions out in 2006
Timing of Eigensolvers
(1.2 GHz Athlon, only matrices where time > .1 sec)
Timing of Eigensolvers
(1.2 GHz Athlon, only matrices where time > .1 sec)
Timing of Eigensolvers
(1.2 GHz Athlon, only matrices where time > .1 sec)
Timing of Eigensolvers
    (only matrices where time > .1 sec)
Accuracy Results (old vs new MRRR)
   maxi ||Tqi – li qi || / ( n e )   || QQT – I || / (n e )
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
   – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR:
   – Byers / Mathias / Braman
• Extensions to QZ:
   – Kågström / Kressner
• Faster Hessenberg, tridiagonal, bidiagonal reductions:
   – van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
   – Full nonsymmetric eigenproblem: n=1500: 3.43x faster
      • HQR: 5x faster, Reduction: 14% faster
   – Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster
   – Sequential versions out in 2006
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
   – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR:
   – Byers / Mathias / Braman

• Faster Hessenberg, tridiagonal, bidiagonal reductions:
   – van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
• Extensions to QZ:
   – Kågström / Kressner
   – LAPACK Working Note (LAWN) #173
   – On 26 real test matrices, speedups up to 11.9x,   4.4x average
 Goal 4: Improved Ease of Use
• Which do you prefer?
                    A\B
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA,
 IPIV, B, IB, JB, DESCB, INFO)

CALL PDGESVX( FACT, TRANS, N ,NRHS, A,
 IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV,
 EQUED, R, C, B, IB, JB, DESCB, X, IX, JX,
 DESCX, RCOND, FERR, BERR, WORK,
 LWORK, IWORK, LIWORK, INFO)
   Goal 4: Improved Ease of Use
• Easy interfaces vs access to details
  – Some users want access to all details, because
     • Peak performance matters
     • Control over memory allocation
  – Other users want ―simpler‖ interface
     • Automatic allocation of workspace
     • No universal agreement across systems on ―easiest interface‖
     • Leave decision to higher level packages
• Keep expert driver / simple driver / computational
  routines
• Add wrappers for other languages
  – Fortran95, Java, Matlab, Python, even C
  – Automatic allocation of workspace
• Add wrappers to convert to ―best‖ parallel layout
 Goal 5: Better SW Engineering:
What could go into Sca/LAPACK?
For all linear algebra problems

     For all matrix structures

          For all data types

              For all architectures and networks

                   For all programming interfaces

                          Produce best algorithm(s) w.r.t.
                           performance and accuracy
                           (including condition estimates, etc)

               Need to prioritize, automate!
    Goal 5: Better SW Engineering
•   How to map multiple SW layers to emerging HW layers?
•   How much better are asynchronous algorithms?
•   Are emerging PGAS languages better?
•   Statistical modeling to limit performance tuning costs, improve use
    of shared clusters

• Only some things understood well enough for automation now
    – Telescoping languages, Bernoulli, Rose, FLAME, …

• Research Plan: explore above design space
• Development Plan to deliver code (some aspects)
    – Maintain core in F95 subset
    – Friendly wrappers for other programming environments
    – Use variety of source control, maintenance, development tools
 Goal 6: Involve the Community
• To help identify priorities
   – More interesting tasks than we are funded to do
   – See www.netlib.org/lapack-dev for list
• To help identify promising algorithms
   – What have we missed?
• To help do the work
   – Bug reports, provide fixes
   – Again, more tasks than we are funded to do
   – Already happening: thank you!
 Accuracy of Eigensolvers
maxi ||Tqi – li qi || / ( n e )   || QQT – I || / (n e )
   Goal 2 – Expanded Content
• Make content of ScaLAPACK mirror LAPACK as much
  as possible
• New functions (highlights)
   – Updating / downdating of factorizations:
      • Stewart, Langou
   – More generalized SVDs:
      • Bai , Wang
                 New GSVD Algorithm
Given m x n A and p x n B, factor A = U ∑a X and B = V ∑b X




 Bai et al, UC Davis                    PSVD, CSD on the way
                      Motivation
• LAPACK and ScaLAPACK are widely used
  – Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks,
    NAG, NEC, SGI, …
  – >63M web hits @ Netlib (incl. CLAPACK, LAPACK95)
     • 35K hits/day
Impact (with NERSC, LBNL)

                 Cosmic Microwave
                Background Analysis,
              BOOMERanG collaboration,
             MADCAP code (Apr. 27, 2000).


                       ScaLAPACK
               Challenges
• For all large scale computing, not just
  linear algebra!
• Example …
               Challenges
• For all large scale computing, not just
  linear algebra!
• Example … your laptop
                                   CPU Trends
• Relative processing power will continue to double every
  18 months
• 256 logical processors per chip in late 2010
300


250


200


150


100


 50


  0
                                                          Hardware Threads Per Chip
      2004   2005   2006    2007                         Cores Per Processor Chip
                                    2008   2009
                     Year                         2010
                   Challenges
• For all large scale computing, not just
  linear algebra!
• Example … your laptop
• Exponentially growing gaps between
  – Floating point time << 1/Memory BW << Memory Latency
     • Improving 59%/year vs 23%/year vs 5.5%/year
Accuracy of Eigensolvers: Old vs New MRRR
     maxi ||Tqi – li qi || / ( n e )   || QQT – I || / (n e )
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
  – Parlett / Dhillon / Voemel / Marques / Willems
  – Faster and more accurate than previous algorithms
  – New sequential, first parallel versions out in 2006
  – Numerical evidence shows DC faster if it “deflates” often,
    which is hard to predict in advance.
    So having both algorithms is important.
  – SVD still an open problem
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
   – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR:
   –   Byers / Mathias / Braman
   –   SIAM SIAG/LA Prize in 2003
   –   Sequential version out in 2006
   –   More on performance later
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
   – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR:
   – Byers / Mathias / Braman

• Faster Hessenberg, tridiagonal, bidiagonal reductions:
   – van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
   – Full nonsymmetric eigenproblem: n=1500: 3.43x faster
      • HQR: 5x faster, Reduction: 14% faster
           Dense nonsymmetric eigenvalue problem

                 No vectors                           All vectors




                                                  Source: Julien Langou
ARCH: Intel Pentium 4 ( 3.4 GHz )
F77 : GNU Fortran (GCC) 3.4.4
BLAS: libgoto_prescott32p-r1.00.so (one thread)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
   – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR:
   – Byers / Mathias / Braman

• Faster Hessenberg, tridiagonal, bidiagonal reductions:
   – van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
   – Full nonsymmetric eigenproblem: n=1500: 3.43x faster
      • HQR: 5x faster, Reduction: 14% faster
   – Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster
   – Sequential versions out in 2006
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
   – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR:
   – Byers / Mathias / Braman

• Faster Hessenberg, tridiagonal, bidiagonal reductions:
   – van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
• Extensions to QZ:
   – Kågström / Kressner
   – LAPACK Working Note (LAWN) #173
   – On 26 real test matrices, speedups up to 11.9x,   4.4x average
   Comparison of ScaLAPACK QR
    and new parallel multishift QZ
Execution times in secs for
4096 x 4096 random problems
Ax = sx and Ax = sBx,           6000
using processor grids including
                                5000
1-16 processors.
                                   4000                             // QR
Note:                                                               // QZ
                                   3000
work(QZ) > 2 * work(QR) but
Time(// QZ) << Time (//QR)!!       2000
                                   1000
Times include cost for
computing eigenvalues and              0
transformation matrices.                   1x 1   2x1   2x2   4x2    4x4

   Adlerborn-Kågström-Kressner, SIAM PP’2006
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD:
   – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR:
   – Byers / Mathias / Braman
• Faster Hessenberg, tridiagonal, bidiagonal reductions:
   – van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
• Extensions to QZ:
   – Kågström / Kressner
• Recursive blocked layouts for packed formats:
   – Gustavson / Kågström / Elmroth / Jonsson/
   – SIAM Review Article 2004
Recursive Layouts and Algorithms
Still merges multiple elimination steps into a
few BLAS 3 operations




Best speedups for packed storage of symmetric
matrices
Goal 1b – More Accurate Algorithms
• Iterative refinement for Ax=b, least squares
  –   “Promise” the right answer for O(n2) additional cost
  –   Iterative refinement with extra-precise residuals
  –   Extra-precise BLAS needed (LAWN#165)
  –   “Guarantees” based on condition number estimates
        • Condition estimate < 1/(n1/2 e)  reliable answer
          and tiny error bounds
        • No bad bounds in 6.2M tests
        • Can condition estimators lie?
Can condition estimators lie?




• Yes, but rarely, unless they cost as much as
  matrix multiply = cost of LU factorization
   – Demmel/Diament/Malajovich (FCM2001)
• But what if matrix multiply costs O(n2)?
   –   More later
Goal 1b – More Accurate Algorithms
• Iterative refinement for Ax=b, least squares
  – “Promise” the right answer for O(n2) additional cost
  – Iterative refinement with extra-precise residuals
  – Extra-precise BLAS needed (LAWN#165)
  – “Guarantees” based on condition number estimates
  – Get tiny componentwise bounds too
      • Each xi accurate
      • Slightly different condition number
  – Extends to Least Squares
  – Release in 2006
  Goal 1b – More Accurate Algorithms
• Iterative refinement for Ax=b, least squares
  – Promise the right answer for O(n2) additional cost
• Jacobi-based SVD
  –   Faster than QR, can be arbitrarily more accurate
  –   LAWNS # 169, 170
  –   Can be arbitrarily more accurate on tiny singular values
  –   Yet faster than QR iteration!
Goal 1b – More Accurate Algorithms
• Iterative refinement for Ax=b, least squares
  – Promise the right answer for O(n2) additional cost
• Jacobi-based SVD
  – Faster than QR, can be arbitrarily more accurate
• Arbitrary precision versions of everything
  – Using your favorite multiple precision package
  – Quad, Quad-double, ARPREC, MPFR, …
  – Using Fortran 95 modules
   Goal 2 – Expanded Content
• Make content of ScaLAPACK mirror LAPACK as much
  as possible
• New functions (highlights)
   – Updating / downdating of factorizations:
      • Stewart, Langou
   – More generalized SVDs:
      • Bai , Wang
   – More generalized Sylvester/Lyapunov eqns:
      • Kågström, Jonsson, Granat
   – Structured eigenproblems
      • O(n2) version of roots(p)
         – Gu, Chandrasekaran, Bindel et al
      • Selected matrix polynomials:
         – Mehrmann
• How should we prioritize missing functions?
The Difficulty of Tuning SpMV:
Sparse Matrix Vector Multiply
                // y <-- y + A*x
                for all A(i,j):
                 y(i) += A(i,j) * x(j)
The Difficulty of Tuning SpMV

                // y <-- y + A*x
                for all A(i,j):
                 y(i) += A(i,j) * x(j)


                // Compressed sparse row (CSR)
                for each row i:
                 t=0
                 for k=row[i] to row[i+1]-1:
                   t += A[k] * x[J[k]]
                 y[i] = t


                • Exploit 8x8 dense blocks
            Speedups on Itanium 2:
             The Need for Search
                                      Mflop/s (31.1%)




Reference                            Mflop/s (7.6%)
              Speedups on Itanium 2:
               The Need for Search
                                        Mflop/s (31.1%)

  Best: 4x2




Reference                              Mflop/s (7.6%)
SpMV Performance—raefsky3
SpMV Performance—raefsky3
More Surprises tuning SpMV

              • More complex example

              • Example: 3x3 blocking
                 – Logical grid of 3x3 cells
Extra Work Can Improve
       Efficiency
            • More complex example

            • Example: 3x3 blocking
               – Logical grid of 3x3 cells
               – Pad with zeros
               – “Fill ratio” = 1.5



            • On Pentium III:
              1.5x speedup! (2/3 time)
       Tuning for Workloads
• BiCG - equal mix of A*x and AT*y
  – 3x1: Ax, ATy = 1053, 343 Mflop/s
  – 3x3: Ax, ATy = 806, 826 Mflop/s


• Higher-level operation - (Ax, ATy) kernel
  – 3x1: 757 Mflop/s
  – 3x3: 1400 Mflop/s
  Optimizations Available in OSKI:
 Optimized Sparse Kernel Interface
• Optimizations for SpMV
   –   Register blocking (RB): up to 4x over CSR
   –   Variable block splitting: 2.1x over CSR, 1.8x over RB
   –   Diagonals: 2x over CSR
   –   Reordering to create dense structure + splitting: 2x over CSR
   –   Symmetry: 2.8x over CSR, 2.6x over RB
   –   Cache blocking: 3x over CSR
   –   Multiple vectors (SpMM): 7x over CSR
   –   And combinations…
• Sparse triangular solve
   – Hybrid sparse/dense data structure: 1.8x over CSR
• Higher-level kernels
   – AAT*x, ATA*x: 4x over CSR, 1.8x over RB
   – A2*x: 2x over CSR, 1.5x over RB
• Available stand alone or integrated into PETSc

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:48
posted:9/28/2010
language:English
pages:90