# Elementary Math Functions by malj

VIEWS: 6 PAGES: 36

• pg 1
```									 NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Libraries
and
Program Performance

NERSC User Services Group
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

An Embarrassment of Riches: Serial
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Parallel Libraries
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Elementary Math Functions
• Three libraries provide elementary math
functions:
– C/Fortran intrinsics
– MASS/MASSV (Math Acceleration Subroutine
System)
– ESSL/PESSL (Engineering Scientific Subroutine
Library
• Language intrinsics are the most
convenient, but not the best performers
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Elementary Functions in Libraries
• MASS
– sqrt rsqrt exp log sin cos tan atan atan2 sinh
cosh tanh dnint x**y
• MASSV
– cos dint exp log sin log tan div
rsqrt sqrt atan
See
http://www.nersc.gov/nusers/resources/software/libs/math/MASS/
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Other Intrinsics in Libraries
• ESSL
–   Linear Algebra Subprograms
–   Matrix Operations
–   Linear Algebraic Equations
–   Eigensystem Analysis
–   Fourier Transforms, Convolutions, Correlations,
and Related Computations
–   Sorting and Searching
–   Interpolation
–   Random Number Generation
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Comparing Elementary Functions
• Loop schema for elementary functions:

99   write(6,98)
98   format( " sqrt: " )
x = pi/4.0
call f_hpmstart(1,"sqrt")
do 100 i = 1, loopceil
y = sqrt(x)
x = y * y
100   continue
call f_hpmstop(1)
write(6,101) x
101   format( "    x = ", g21.14 )
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Comparing Elementary Functions
• Execution schema for elementary functions:
setenv F1 "-qfixed=80 -qarch=pwr3 -qtune=pwr3 O3 -qipa"
module list
setenv L1 "-Wl,-v,-bC:massmap"
xlf90_r masstest.F \$F1 \$L1 \$MASS \$HPMTOOLKIT -o masstest

timex mathtest < input > mathout
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Results Examined
•   User execution time
•   # Floating and FMA instructions
•   Operation rate in Mflip/sec
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Results Observed
•   Best times/rates at -O3 or -O4
•   ESSL no different from intrinsics
•   MASS much faster than intrinsics
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Comparing Higher Level Functions
• Several sources of matrix-multiply function:
–   User coded scalar computation
–   Fortran intrinsic matmul
–   Single processor ESSL dgemm
–   Single processor IMSL dmrrrr (32-bit)
–   Single processor NAG f01ckf
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Sample Problem
• Multiply dense matrixes :
– A(1:n,1:n) = i + j
– B(1:n,1:n) = j – i
– C(1:n,1:n) = A * B
Output C to verify result
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Kernel of user matrix multiply
do i=1,n
do j=1,n
a(i,j) = real(i+j)
b(i,j) = real(j-i)
enddo
enddo
call f_hpmstart(1,"Matrix multiply")
do j=1,n
do k=1,n
do i=1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
call f_hpmstop(1)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Comparison of Matrix Multiply
(N1=5,000)
Version     Wall Clock(sec)           Mflip/s                 Scaled Time (1=Fastest)
User Scalar    1,490                     168                  106     Slowest
Intrinsic      1,477                     169                  106     Slowest
ESSL             195                   1,280                   13.9
IMSL             194                    1,290                  13.8
NAG              195                    1,280                  13.9
ESSL-SMP          14                  17,800                    1.0 Fastest
NAG-SMP           14                  17,800                    1.0 Fastest
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Observations on Matrix Multiply
• Fastest times were obtained by the two SMP libraries,
ESSL-SMP and NAG-SMP, which both obtained 74% of
the peak node performance
• All the single processor library functions took 14 times
more wall clock time than the SMP versions, each
obtaining about 85% of peak for a single processor
• Worst times were from the user code and the Fortran
intrinsic, which took 100 times more wall clock time than
the SMP libraries
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Comparison of Matrix Multiply
(N2=10,000)
Version        Wall Clock(sec)                        Mflip/s      Scaled Time
ESSL-SMP            101                               19,800          1.01
NAG-SMP             100                               19,900          1.00

• Scaling with Problem Size (Complexity increase ~8x)
Version           Wall Clock(N2/N1)                            Mflip/s (N2/N1)
ESSL-SMP                  7.2                                         1.10
NAG-SMP                   7.1                                         1.12

Both ESSL-SMP and NAG-SMP showed 10% performance
gains with the larger problem size.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Observations on Scaling
• Scaling of problem size was only done for the
SMP libraries, to fit into reasonable times.
• Doubling N results in 8 times increase of
computational complexity for dense matrix
multiplication
• Performance actually increased for both routines
for larger problem size.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

ESSL-SMP Performance vs. Number of Threads
• All for N=10,000

20000

16000
dgemm Mflip/s

12000

8000

4000

0
0   4      8     12    16    20    24     28   32   36
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Parallelism: Choices Based on Problem Size

Doing a month’s
work in a few
minutes!

Three Good Choices
• ESSL / LAPACK
• ESSL-SMP
• SCALAPACK

Only beyond a certain
problem size is there
any opportunity for
parallelism.

Matrix-Matrix Multiply
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Larger Functions: FFT’s
• ESSL, FFTW, NAG, IMSL
See
http://www.nersc.gov/nusers/resources/software/libs/math/fft/

• We looked at ESSL, NAG, and IMSL
– One-D, forward and reverse
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

One-D FFT’s
• NAG
–   c06eaf - forward
–   c06ebf - inverse, conjugate needed
–   c06faf - forward, work-space needed
–   c06ebf - inverse, work-space & conjugate needed
• IMSL
– z_fast_dft - forward & reverse, separate arrays
• ESSL
– drcft - forward & reverse, work-space & initialization
step needed
• All have size constraints on their data sets
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

One-D FFT Measurement
• 2^24 real*8 data points input (a synthetic signal)
• Each transform ran in a 20-iteration loop
• All performed both forward and inverse transforms on
same data
• Input and inverse outputs were identical
• Measured with HPMToolkit
second1 = rtc()
call f_hpmstart(33,"nag2 forward")
do loopind=1, loopceil
w(1:n) = x(1:n)
call c06faf( w, n, work, ifail )
end do
call f_hpmstop(33)
second2 = rtc()
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

One-D FFT Performance
NAG    c06eaf             fwd      25.182 sec. 54.006 Mflip/s
C06ebf             inv      24.465 sec. 40.666 Mflip/s
c06faf             fwd      29.451 sec. 46.531 Mflip/s
c06ebf             inv      24.469 sec. 40.663 Mflip/s
(required a data copy for each iteration for each transform)

IMSL   z_fast_dft         fwd        71.479 sec.               46.027 Mflip/s
z_fast_dft         inv        71.152 sec.               48.096 Mflip/s

ESSL   drcft              init         0.032      sec.  62.315 Mflip/s
drcft              fwd          3.573      sec. 274.009 Mflip/s
drcft              init         0.058      sec. 96.384 Mflip/s
drcft              inv          3.616      sec. 277.650 Mflip/s
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

ESSL and ESSL-SMP :
“Easy” parallelism
( -qsmp=omp –qessl
-lomp –lesslsmp )

•   For simple problems
can dial in the local

•   Cache reuse can lead to
superlinear speed up.

•   NH II node has 128
MB Cache!
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Parallelism Beyond One Node : MPI
2D problem in 4 tasks                                          Distributed Data Decomposition

•    Distributed parallelism (MPI) requires both
1        2                    local and global addressing contexts

1 2 3 4                                             •    Dimensionality of decomposition can have
profound impact on scalability
3        4
•    Consider surface to volume ratio
Surface = communication (MPI)
2D problem in 20 tasks                                       Volume = local work (HPM)

•    Decomposition is often cause of load
imbalance which can reduce parallel efficiency
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

Example FFTW: 3D
• Popular for its portability and performance
• Also consider PESSL’s FFTs (not treated here)
• Uses a slab (1D) data decomposition
• Direct algorithms for transforms of dimensions of size:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64
• For parallel FFTW calls transforms are done in place
Plan:
fftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny, nz, FFTW_FORWARD,flags);
fftwnd_mpi_local_sizes(plan, &lnx, &lxs, &lnyt, &lyst, &lsize);

Transform:
fftwnd_mpi(plan, 1, data, work, transform_flags);
What are these?
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

FFTW: Data Decomposition
Each MPI rank owns a portion of the problem

nx

for(x=0;x<lnx;x++)                     ny                                for(x=0;x<nx;x++)
for(y=0;y<ny;y++)                                                        for(y=0;y<ny;y++)
for(z=0;z<nz;z++) {                                                      for(z=0;z<nz;z++)
data[x][[y][z] =                                              nz        {
f(x+lxs,y,z);                                                             data[x][[y][z] =
}                                                                         f(x,y,z);
lxs lxs+lnx
}
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

FFTW : Parallel Performance

•   FFT may be complex
function of problem
size . Prime factors of
the dimensions and
concurrency determine
performance
•   Consider data
decompositions and
optimal local data sizes:
cache use and prime
factors
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

FFTW : Wisdom
•   Runtime performance
optimization, can be stored to
file

•   Wise options:
FFTW_MEASURE |
FFTW_USE_WISDOM

•   Unwise options:
FFTW_ESTIMATE

•   Wisdom works better for serial
FFTs. Some benefit for parallel
must amortize increase in