Demmel_ScaLAPACK

Document Sample
Demmel_ScaLAPACK Powered By Docstoc
					    The Future of
LAPACK and ScaLAPACK

        Jim Demmel
        UC Berkeley
 CScADS Autotuning Workshop
        9 July 2007
                   Challenges
• Increasing parallelism
  – From supercomputers to multicore
• Exponentially growing gaps between
  – Floating point time << 1/Memory BW << Memory Latency
     • Improving 59%/year vs 23%/year vs 5.5%/year
  – Floating point time << 1/Network BW << Network Latency
     • Improving 59%/year vs 26%/year vs 15%/year
• Heterogeneity (performance and semantics)
• Asynchrony
• Unreliability
 Goals of next Sca/LAPACK
1. Better algorithms
     – Faster, more accurate
2.   Automate performance tuning
3.   Better software engineering
4.   Improve ease of use
5.   Expand contents
     – More functions, more parallel implementations
6. Increased community involvement
What could go into Sca/LAPACK?
For all linear algebra problems

     For all matrix structures

          For all data types

              For all architectures and networks

                   For all programming interfaces

                          Produce best algorithm(s) w.r.t.
                           performance and accuracy
                           (including condition estimates, etc)

               Need to prioritize, automate!
    How do we best explore this large tuning space?
•   Algorithm tuning space includes
     – Numerous block sizes, not just in underlying BLAS
     – Many possible layers of parallelism, many mappings to HW
     – Different traversals of underlying DAGs
        • Left and right looking two of many; asynchronous algorithms
     – “Redundant” algorithms
     – Recursive layouts and algorithms
     – New “optimal” algorithms for variations on standard factorizations
     – New and old eigenvalue algorithms
     – Redundancy, mixed precision, …

•   Is there a concise set of abstractions to describe, generate tuning space?
     – Block matrices, (partial, tree) factorizations, DAGs, …
     – FLAME, CSS, Spiral, Sequoia, Telescoping languages, Bernoulli, Rose, …

•   Challenge: What fraction of dense linear algebra can be done?
     – Lots more than when we started
         • Parallel BLAS -> LU -> other factorizations -> …
     – Most of dense linear algebra?
         • Not eigenvalue algorithms (on compact forms)
         • What fraction of LAPACK can be done?
         • Rest of loop “for all linear algebra problems…”
     – For all interesting architectures…?
                Exploiting GPUs
• Numerous emerging co-processors
  – Cell, SSE, Grape, GPU, “physics coprocessor,” …
• When can we exploit them?
  – LIttle help if memory is bottleneck
  – Various attempts to use GPUs for dense linear algebra
• Bisection on GPUs for symmetric tridiagonal
  eigenproblem
  – Evaluate Count(x) = #(evals < x) for many x
  – Very little memory traffic, but much redundant work
  – Speedups up to 100x (Volkov)
     • 43 Gflops on ATI Radeon X1900 vs running on 2.8 GHz Pentium 4
     • Overall eigenvalue solver 6.8x faster
• Port of CLAPACK to NVIDIA underway
    Iterative Refinement: For Accuracy
    Conventional Gaussian Elimination
                                        With extra precise
             1/e                        iterative refinement




e




             e = n1/2 2-24
Iterative Refinement: For Speed
• What if double precision much slower than
  single?
  – Cell processor in Playstation 3
     • 256 GFlops single, 25 GFlops double

  – Pentium SSE2: single twice as fast as double
• Given Ax=b in double precision
  – Factor in single, do refinement in double
  – If k(A) < 1/esingle, runs at speed of single
• 8x speedup on Cell, 1.9x on Intel-based laptop
• Applies to many algorithms, if difference large
      New algorithm for roots(p)
• To find roots of polynomial p            -p1 -p2     … -pd
  – Roots(p) calls eig(C(p))                1 0        … 0
  – Costs O(n3), stable, reliable    C(p)= 0 1         … 0
• O(n2) Alternatives                        ……         … …
  – Newton, Jenkins-Traub, Laguerre, …      0 …         1 0
   – Stable? Reliable?
• New: Exploit “semiseparable” structure of C(p)
   – Low rank of any submatrix of upper triangle of C(p)
     preserved under QR iteration
   – Complexity drops from O(n3) to O(n2), stable in practice
• Related work: Gemignani, Bini, Pan, et al
• Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David
  Bindel, David Garmire, Jim Demmel
    New optimal algorithms (1)
• Long known (Strassen) how to invert
  matrices as fast as matmul, but unstably:
            -1
    T11 T12 = T11-1 -T11-1 T12T22-1
        T22             T22-1

• New results
  – Can make solving Ax=b, least squares,
    eigenvalue problems as fast as fastest
    matmul, and stable
• “Optimal” arithmetic complexity, but fast?
      New optimal algorithms (2)
• Communication costs of current ScaLAPACK
  – LU & QR: O(n log p) messages
  – Cholesky: O(n/b log p) messages
• New “LU” and “QR” algorithms
  – As few messages as Cholesky
  – “QR” returns QR but represented differently
  – “LU” “equivalent” to LU in complexity, stability TBD
• “Optimal” communication complexity, but fast?
 Goal 2 – Automate Performance Tuning


• 1300 calls to ILAENV() to get block sizes, etc.
  – Never been systematically tuned
• Extend automatic tuning techniques of ATLAS, etc.
  to these other parameters
  – Automation important as architectures evolve
• Convert ScaLAPACK data layouts on the fly
  – Important for ease-of-use too
           ScaLAPACK Data Layouts

1D Block                             1D Cyclic




1D Block                            2D Block
 Cyclic                              Cyclic
                               Execution time of PDGESV for various grid shape




          100
                                                                                                                    90-100
          90                                                                                                        80-90
          80                                                                                                        70-80
          70                                                                                                        60-70
          60                                                                                                        50-60

seconds   50                                                                                                        40-50
                                                                                                      1x60
          40                                                                                                        30-40
                                                                                                    2x30
                                                                                                                    20-30
          30
                                                                                               3x20
                                                                                                       grid shape   10-20
          20                                                                                 4x15
                                                                                                                    0-10
          10                                                                            5x12
           0                                                                          6x10
                10000 9000   8000   7000   6000   5000    4000   3000   2000   1000
                                           problem size


   Speedups for using 2D processor grid range from 2x to 8x
   Cost of redistributing from 1D to best 2D layout 1% - 10%
  Times obtained on:
  60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect
  2GB Memory
               Intel’s Clovertown Quad Core
3 Implementations of LU factorization                                            1. LAPACK (BLAS Fork-Join Parallelism)

Quad core w/2 sockets per board, w/ 8 Treads                                     2. ScaLAPACK (Mess Pass using mem copy)
                                                                                 3. DAG Based (Dynamic Scheduling)
            45000


            40000


            35000


            30000
  Mflop/s




            25000

            20000


            15000


            10000
                                                                  8 Core Experiments
             5000


                0
                1000   2000   3000   4000   5000   6000   7000   8000   9000 10000 11000 12000 13000 14000 15000

                                                          Problems Size
        Source: Jack Dongarra
         15
                     Motivation
• LAPACK and ScaLAPACK are widely used
  – Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks,
    NAG, NEC, SGI, …
  – >72M web hits @ Netlib (incl. CLAPACK, LAPACK95)
     • 35K hits/day
• Many ways to improve them, based on
  –   Own algorithmic research
  –   Enthusiastic participation of research community
  –   Opportunities and demands of new architectures
  –   Autotuning
• New releases planned (NSF support)
  – LAPACK 3.1.1 released Feb 2007
                      Participants
• UC Berkeley:
   – Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li,
     Osni Marques, Christof Voemel, David Bindel, Yozo Hida,
     Jason Riedy, undergrads…
• U Tennessee, Knoxville
   – Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek,
     Stan Tomov, Alfredo Buttari, Jakub Kurzak
• Other Academic Institutions
   – UT Austin, U Colorado, UC Davis, Florida IT, U Kansas, U Maryland,
     North Carolina SU, San Jose SU, UC Santa Barbara
   – TU Berlin, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid,
     U Manchester, U Umeå, U Wuppertal, U Zagreb
• Research Institutions
   – CERFACS, LBL, INRIA
• Industrial Partners
   – Cray, HP, Intel, Interactive Supercomputing, MathWorks, NAG, SGI