# STRASSEN+ATLAS

Document Sample

```					            Adaptive Strassen and
ATLAS’s DGEMM

Paolo D’Alberto (CMU)
and
Alexandru Nicolau (UCI)

12/2/2005               HPC Asia         1
The Problem: Matrix Computations

• The evolution of systems is modeled by matrix
computations

• The prediction and evaluation of such models (of
complex systems) is fundamental in scientific
computing.

– For example, the solution of linear equations or the solution of
least square systems.

12/2/2005                       HPC Asia                              2
The Problem: BLAS
• The Basic Linear Algebra Subroutines is an interface
describing a set of (basic) matrix and vector
computations
– Historically, the BLAS was a set of algorithms

• Library implementing the BLAS are the back-bone
– For ScaLAPACK

• ESSL, PHiPac and ATLAS
12/2/2005                      HPC Asia                 3
The Problem: ATLAS
• Implementation of BLAS 3 are based on
Matrix Multiplication

• In practice, ATLAS automatically generates a
custom-tailored MM:
– It probes the system
– It tailors a kernel of MM to a specific system
– It uses the MM as a basic routine for the other
BLAS-3 routines
12/2/2005                   HPC Asia                     4
Matrix Multiplication (basics)

C0          C1             A0         A1         B0   B1

=
C2         C3              A2          B3
*   B2   B3

C0= A0B0 + A1B2       C1= A0B1 + A1B3

C2= A2B0 + A3B2       C3= A2B1 + A3B3

12/2/2005                        HPC Asia                  5
The Problem: MM
• ATLAS uses this classic matrix multiply
– For square matrices of size nxn, the algorithm takes O(n3)
– It achieves 80-90% of peak performance

• Strassen’s algorithm for large problems.
– Because it reduces the number of computations (thus
shortening the execution time)

• We investigate the effects on single-processor
systems
12/2/2005                       HPC Asia                            6
The Problem: Strassen’s
• Strassen’s for 2n–size matrices O(nlog 7)
• For even-size matrices, one recursive step is
always applicable
• Otherwise
– Peeling:
• For odd-size matrices [Hauss 97 & Luo 2004]:

12/2/2005                 HPC Asia                7
Odd-Size Square Matrices
A               B
2n

B0

2n
2n+1

A0

2n

2n
A0 * B0 is an even-size problem.
2n+1       Strassen is applied once more
12/2/2005          HPC Asia                          8
Our Approach: balanced division

• For any matrix size, we apply a balanced
Strassen’s division process
– This reduces the number of computations further
than an odd/even size problem (or padded)
• Balanced division = balanced workload
– Thus, predictable performance
• Balanced sized operands
– Better data cache utilization

12/2/2005                   HPC Asia                     9
Balanced Division Matrices
Near Square: m = n+p with min|n-p|

A0             A1               B0           B1
n
m

A2            A3                B2           B3
p

n              p
m
The quadrants are near square matrices.
At any step of the recursion, all sub-matrices are near square matrices
12/2/2005                            HPC Asia                             10
Balanced Matrices

• The balanced division with Strassen’s
recursion needs a new MA definition
– because addition of matrices of different sizes

• We generalize the operations such that:
– The algorithm is correct
– The extra control for the irregular sizes is completely
negligible and only for matrix additions

12/2/2005                     HPC Asia                       11
Experimental Results
• We considered 14 systems
– We hand coded the MA for each specific system
• We measure performance of ATLAS’s MM and MA
– We specify an adaptive recursion point size for each system
– We encode the recursion point in the algorithm
• We measured the relative performance Strassen vs
ATLAS

• We report the details for three systems shortly

12/2/2005                       HPC Asia                             12
12/2/2005   HPC Asia   13
S-1-unfold                                                          Opteron
14
S-2-unfold
% time .

Strassen + ATLAS
9
S-3-unfold

4

-1                                                                      N
1175      1850      2525   3200                    3875     4550
88

86
% PEAK .

84

ATLAS’s Performance
82

(the higher the better)                                80
1175          2075     2975       3875     4775 N
12/2/2005                                        HPC Asia                                     14
16

12
S-1-unfold
S-2-unfold
8600 PA-RISC
10
% Time .

S-3-unfold
8
Strassen + ATLAS
6

4

2

0
1175     1850       2525              3200        3875        4550          N
72

71

70
% PEAK .

69

68

67
ATLAS’s Performance
66
1175      1850            2525       3200      3875    4550     N
12/2/2005                                              HPC Asia                                               15
32
27

22
S-1-unfold
S-2-unfold
ALPHA
% Time .

17               S-3-unfold
12
Strassen + ATLAS
7

2

700        1175   1850   2525                   3200     3875      4550
-3
N
92

87
% PEAK .

82

77

ATLAS’s Performance
72

95
0          25         00      75      50      25      00
16         23      29      36      43      50       N
12/2/2005                                               HPC Asia                                       16
Conclusions

• Our approach uses the balanced division as Strassen’s does
• We performed an exhaustive testing of performance
– Some architectures do not offer practical opportunity for S’s

• We use benchmarking of ATLAS’s MM and MA for specific
code tuning.
– In the spirit of adaptive software packages

• We speed up ATLAS’s MM without introducing any
– Due to data layout or extra control.

12/2/2005                             HPC Asia                         17
Future work
• The algorithm extends to rectangular matrices
– We will characterize its performance
– Parallel formulation and performance
• Power management
– MM and MA compose the application however they
have different architecture utilization
– Hardware configurations adaptation (e.g., Xscale)

12/2/2005                  HPC Asia                        18

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 3 posted: 8/17/2012 language: Latin pages: 18