Docstoc

STRASSEN+ATLAS

Document Sample
STRASSEN+ATLAS Powered By Docstoc
					            Adaptive Strassen and
              ATLAS’s DGEMM


               Paolo D’Alberto (CMU)
                        and
               Alexandru Nicolau (UCI)

12/2/2005               HPC Asia         1
The Problem: Matrix Computations

• The evolution of systems is modeled by matrix
  computations

• The prediction and evaluation of such models (of
  complex systems) is fundamental in scientific
  computing.

     – For example, the solution of linear equations or the solution of
       least square systems.

12/2/2005                       HPC Asia                              2
The Problem: BLAS
• The Basic Linear Algebra Subroutines is an interface
  describing a set of (basic) matrix and vector
  computations
     – Historically, the BLAS was a set of algorithms


• Library implementing the BLAS are the back-bone
  of nowadays high performance computations
     – For ScaLAPACK


• ESSL, PHiPac and ATLAS
12/2/2005                      HPC Asia                 3
The Problem: ATLAS
• Implementation of BLAS 3 are based on
  Matrix Multiplication

• In practice, ATLAS automatically generates a
  custom-tailored MM:
     – It probes the system
     – It tailors a kernel of MM to a specific system
     – It uses the MM as a basic routine for the other
       BLAS-3 routines
12/2/2005                   HPC Asia                     4
Matrix Multiplication (basics)

C0          C1             A0         A1         B0   B1

                  =
 C2         C3              A2          B3
                                             *   B2   B3



C0= A0B0 + A1B2       C1= A0B1 + A1B3


C2= A2B0 + A3B2       C3= A2B1 + A3B3



12/2/2005                        HPC Asia                  5
The Problem: MM
• ATLAS uses this classic matrix multiply
     – For square matrices of size nxn, the algorithm takes O(n3)
     – It achieves 80-90% of peak performance


• Strassen’s algorithm for large problems.
     – Because it reduces the number of computations (thus
       shortening the execution time)

• We investigate the effects on single-processor
  systems
12/2/2005                       HPC Asia                            6
The Problem: Strassen’s
• Strassen’s for 2n–size matrices O(nlog 7)
• For even-size matrices, one recursive step is
  always applicable
• Otherwise
     – Dynamic and static padding
     – Peeling:
• For odd-size matrices [Hauss 97 & Luo 2004]:

12/2/2005                 HPC Asia                7
       Odd-Size Square Matrices
            A               B
       2n




                                     B0




                                               2n
2n+1




                    A0

                                          2n



                   2n
                              A0 * B0 is an even-size problem.
                   2n+1       Strassen is applied once more
       12/2/2005          HPC Asia                          8
Our Approach: balanced division

• For any matrix size, we apply a balanced
  Strassen’s division process
     – This reduces the number of computations further
       than an odd/even size problem (or padded)
• Balanced division = balanced workload
     – Thus, predictable performance
• Balanced sized operands
     – Better data cache utilization

12/2/2005                   HPC Asia                     9
Balanced Division Matrices
                Near Square: m = n+p with min|n-p|

                 A0             A1               B0           B1
        n
m




                 A2            A3                B2           B3
        p




                      n              p
                          m
    The quadrants are near square matrices.
    At any step of the recursion, all sub-matrices are near square matrices
12/2/2005                            HPC Asia                             10
Balanced Matrices
(New matrix add and multiplication)


• The balanced division with Strassen’s
  recursion needs a new MA definition
     – because addition of matrices of different sizes

• We generalize the operations such that:
     – The algorithm is correct
     – The extra control for the irregular sizes is completely
       negligible and only for matrix additions


12/2/2005                     HPC Asia                       11
Experimental Results
• We considered 14 systems
     – We hand coded the MA for each specific system
• We measure performance of ATLAS’s MM and MA
     – We specify an adaptive recursion point size for each system
     – We encode the recursion point in the algorithm
• We measured the relative performance Strassen vs
  ATLAS

• We report the details for three systems shortly

12/2/2005                       HPC Asia                             12
12/2/2005   HPC Asia   13
           19          S-adaptive
                       S-1-unfold                                                          Opteron
           14
                       S-2-unfold
% time .




                                                                                          Strassen + ATLAS
            9
                       S-3-unfold

            4


           -1                                                                      N
                1175      1850      2525   3200                    3875     4550
                                                       88



                                                       86
                                            % PEAK .




                                                       84

                  ATLAS’s Performance
                                                       82




(the higher the better)                                80
                                                            1175          2075     2975       3875     4775 N
           12/2/2005                                        HPC Asia                                     14
           16


           14          S-adaptive
           12
                       S-1-unfold
                       S-2-unfold
                                                                                                8600 PA-RISC
           10
% Time .




                       S-3-unfold
           8
                                                                                                       Strassen + ATLAS
           6


           4


           2


           0
                1175     1850       2525              3200        3875        4550          N
                                                      72

                                                      71

                                                      70
                                           % PEAK .




                                                      69

                                                      68

                                                      67
                ATLAS’s Performance
                                                      66
                                                           1175      1850            2525       3200      3875    4550     N
            12/2/2005                                              HPC Asia                                               15
     32
                      S-adaptive
     27


     22
                      S-1-unfold
                      S-2-unfold
                                                                                                  ALPHA
% Time .




     17               S-3-unfold
     12
                                                                                                  Strassen + ATLAS
           7


           2

                700        1175   1850   2525                   3200     3875      4550
      -3
                                                                                          N
                                                           92



                                                           87
                                                % PEAK .




                                                           82



                                                           77

               ATLAS’s Performance
                                                           72

                                                            95
                                                               0          25         00      75      50      25      00
                                                                       16         23      29      36      43      50       N
               12/2/2005                                               HPC Asia                                       16
Conclusions

• Our approach uses the balanced division as Strassen’s does
• We performed an exhaustive testing of performance
     – Some architectures do not offer practical opportunity for S’s

• We use benchmarking of ATLAS’s MM and MA for specific
  code tuning.
     – In the spirit of adaptive software packages

• We speed up ATLAS’s MM without introducing any
  overhead
     – Due to data layout or extra control.

12/2/2005                             HPC Asia                         17
Future work
• The algorithm extends to rectangular matrices
     – We will characterize its performance
     – Parallel formulation and performance
• Power management
     – MM and MA compose the application however they
       have different architecture utilization
     – Hardware configurations adaptation (e.g., Xscale)


12/2/2005                  HPC Asia                        18

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:8/17/2012
language:Latin
pages:18