Docstoc

Elementary Math Functions

Document Sample
Elementary Math Functions Powered By Docstoc
					 NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
 CENTER




      Libraries
        and
Program Performance

NERSC User Services Group
      NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
      CENTER



An Embarrassment of Riches: Serial
    NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
    CENTER


Threaded Libraries (Threaded)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER

    Parallel Libraries
(Distributed & Threaded)
          NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
          CENTER



    Elementary Math Functions
• Three libraries provide elementary math
  functions:
  – C/Fortran intrinsics
  – MASS/MASSV (Math Acceleration Subroutine
    System)
  – ESSL/PESSL (Engineering Scientific Subroutine
    Library
• Language intrinsics are the most
  convenient, but not the best performers
             NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
             CENTER



  Elementary Functions in Libraries
• MASS
   – sqrt rsqrt exp log sin cos tan atan atan2 sinh
     cosh tanh dnint x**y
• MASSV
   – cos dint exp log sin log tan div
     rsqrt sqrt atan
See
http://www.nersc.gov/nusers/resources/software/libs/math/MASS/
            NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
            CENTER



       Other Intrinsics in Libraries
• ESSL
  –   Linear Algebra Subprograms
  –   Matrix Operations
  –   Linear Algebraic Equations
  –   Eigensystem Analysis
  –   Fourier Transforms, Convolutions, Correlations,
      and Related Computations
  –   Sorting and Searching
  –   Interpolation
  –   Numerical Quadrature
  –   Random Number Generation
          NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
          CENTER



Comparing Elementary Functions
• Loop schema for elementary functions:

  99   write(6,98)
  98   format( " sqrt: " )
       x = pi/4.0
       call f_hpmstart(1,"sqrt")
       do 100 i = 1, loopceil
               y = sqrt(x)
               x = y * y
 100   continue
       call f_hpmstop(1)
       write(6,101) x
 101   format( "    x = ", g21.14 )
            NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
            CENTER



Comparing Elementary Functions
• Execution schema for elementary functions:
setenv F1 "-qfixed=80 -qarch=pwr3 -qtune=pwr3 O3 -qipa"
module load hpmtoolkit
module load mass
module list
setenv L1 "-Wl,-v,-bC:massmap"
xlf90_r masstest.F $F1 $L1 $MASS $HPMTOOLKIT -o masstest

timex mathtest < input > mathout
           NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
           CENTER



             Results Examined
•   Answers after 50e6 iterations
•   User execution time
•   # Floating and FMA instructions
•   Operation rate in Mflip/sec
           NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
           CENTER



             Results Observed
•   No difference in answers
•   Best times/rates at -O3 or -O4
•   ESSL no different from intrinsics
•   MASS much faster than intrinsics
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
           NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
           CENTER



 Comparing Higher Level Functions
• Several sources of matrix-multiply function:
  –   User coded scalar computation
  –   Fortran intrinsic matmul
  –   Single processor ESSL dgemm
  –   Multi-threaded SMP ESSL dgemm
  –   Single processor IMSL dmrrrr (32-bit)
  –   Single processor NAG f01ckf
  –   Multi-threaded SMP NAG f01ckf
         NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
         CENTER



             Sample Problem
• Multiply dense matrixes :
  – A(1:n,1:n) = i + j
  – B(1:n,1:n) = j – i
  – C(1:n,1:n) = A * B
  Output C to verify result
      NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
      CENTER



Kernel of user matrix multiply
   do i=1,n
     do j=1,n
        a(i,j) = real(i+j)
        b(i,j) = real(j-i)
     enddo
  enddo
  call f_hpmstart(1,"Matrix multiply")
  do j=1,n
     do k=1,n
        do i=1,n
           c(i,j) = c(i,j) + a(i,k)*b(k,j)
        enddo
     enddo
  enddo
  call f_hpmstop(1)
              NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
              CENTER

     Comparison of Matrix Multiply
                                (N1=5,000)
Version     Wall Clock(sec)           Mflip/s                 Scaled Time (1=Fastest)
User Scalar    1,490                     168                  106     Slowest
Intrinsic      1,477                     169                  106     Slowest
ESSL             195                   1,280                   13.9
IMSL             194                    1,290                  13.8
NAG              195                    1,280                  13.9
ESSL-SMP          14                  17,800                    1.0 Fastest
NAG-SMP           14                  17,800                    1.0 Fastest
             NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
             CENTER



Observations on Matrix Multiply
• Fastest times were obtained by the two SMP libraries,
  ESSL-SMP and NAG-SMP, which both obtained 74% of
  the peak node performance
• All the single processor library functions took 14 times
  more wall clock time than the SMP versions, each
  obtaining about 85% of peak for a single processor
• Worst times were from the user code and the Fortran
  intrinsic, which took 100 times more wall clock time than
  the SMP libraries
               NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
               CENTER

      Comparison of Matrix Multiply
                                (N2=10,000)
Version        Wall Clock(sec)                        Mflip/s      Scaled Time
ESSL-SMP            101                               19,800          1.01
NAG-SMP             100                               19,900          1.00

• Scaling with Problem Size (Complexity increase ~8x)
Version           Wall Clock(N2/N1)                            Mflip/s (N2/N1)
ESSL-SMP                  7.2                                         1.10
NAG-SMP                   7.1                                         1.12

  Both ESSL-SMP and NAG-SMP showed 10% performance
  gains with the larger problem size.
           NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
           CENTER



       Observations on Scaling
• Scaling of problem size was only done for the
  SMP libraries, to fit into reasonable times.
• Doubling N results in 8 times increase of
  computational complexity for dense matrix
  multiplication
• Performance actually increased for both routines
  for larger problem size.
                            NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
                            CENTER

   ESSL-SMP Performance vs. Number of Threads
• All for N=10,000
• Number of threads controlled by environment variable OMP_NUM_THREADS

                          20000

                          16000
          dgemm Mflip/s




                          12000

                          8000

                          4000

                              0
                                  0   4      8     12    16    20    24     28   32   36
                                                         Threads
 NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
 CENTER



Parallelism: Choices Based on Problem Size


                                                 Doing a month’s
                                                 work in a few
                                                 minutes!

                                                 Three Good Choices
                                                 • ESSL / LAPACK
                                                 • ESSL-SMP
                                                 • SCALAPACK

                                                 Only beyond a certain
                                                  problem size is there
                                                 any opportunity for
                                                 parallelism.


         Matrix-Matrix Multiply
             NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
             CENTER



        Larger Functions: FFT’s
• ESSL, FFTW, NAG, IMSL
See
http://www.nersc.gov/nusers/resources/software/libs/math/fft/



• We looked at ESSL, NAG, and IMSL
   – One-D, forward and reverse
              NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
              CENTER



                      One-D FFT’s
• NAG
   –   c06eaf - forward
   –   c06ebf - inverse, conjugate needed
   –   c06faf - forward, work-space needed
   –   c06ebf - inverse, work-space & conjugate needed
• IMSL
   – z_fast_dft - forward & reverse, separate arrays
• ESSL
   – drcft - forward & reverse, work-space & initialization
     step needed
• All have size constraints on their data sets
            NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
            CENTER



      One-D FFT Measurement
• 2^24 real*8 data points input (a synthetic signal)
• Each transform ran in a 20-iteration loop
• All performed both forward and inverse transforms on
  same data
• Input and inverse outputs were identical
• Measured with HPMToolkit
       second1 = rtc()
       call f_hpmstart(33,"nag2 forward")
       do loopind=1, loopceil
             w(1:n) = x(1:n)
             call c06faf( w, n, work, ifail )
       end do
       call f_hpmstop(33)
       second2 = rtc()
               NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
               CENTER



         One-D FFT Performance
NAG    c06eaf             fwd      25.182 sec. 54.006 Mflip/s
       C06ebf             inv      24.465 sec. 40.666 Mflip/s
       c06faf             fwd      29.451 sec. 46.531 Mflip/s
       c06ebf             inv      24.469 sec. 40.663 Mflip/s
       (required a data copy for each iteration for each transform)

IMSL   z_fast_dft         fwd        71.479 sec.               46.027 Mflip/s
       z_fast_dft         inv        71.152 sec.               48.096 Mflip/s

ESSL   drcft              init         0.032      sec.  62.315 Mflip/s
       drcft              fwd          3.573      sec. 274.009 Mflip/s
       drcft              init         0.058      sec. 96.384 Mflip/s
       drcft              inv          3.616      sec. 277.650 Mflip/s
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER
 NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
 CENTER



ESSL and ESSL-SMP :
                                                 “Easy” parallelism
                                                 ( -qsmp=omp –qessl
                                                 -lomp –lesslsmp )

                                                 •   For simple problems
                                                     can dial in the local
                                                     data size by adjusting
                                                     number of threads

                                                 •   Cache reuse can lead to
                                                     superlinear speed up.

                                                 •   NH II node has 128
                                                     MB Cache!
                        NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
                        CENTER


         Parallelism Beyond One Node : MPI
2D problem in 4 tasks                                          Distributed Data Decomposition

                                                         •    Distributed parallelism (MPI) requires both
                                1        2                    local and global addressing contexts

     1 2 3 4                                             •    Dimensionality of decomposition can have
                                                              profound impact on scalability
                                3        4
                                                         •    Consider surface to volume ratio
                                                              Surface = communication (MPI)
 2D problem in 20 tasks                                       Volume = local work (HPM)

                                                         •    Decomposition is often cause of load
                                                              imbalance which can reduce parallel efficiency
                 NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
                 CENTER



                Example FFTW: 3D
• Popular for its portability and performance
• Also consider PESSL’s FFTs (not treated here)
• Uses a slab (1D) data decomposition
• Direct algorithms for transforms of dimensions of size:
  1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64
• For parallel FFTW calls transforms are done in place
Plan:
fftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny, nz, FFTW_FORWARD,flags);
fftwnd_mpi_local_sizes(plan, &lnx, &lxs, &lnyt, &lyst, &lsize);

Transform:
fftwnd_mpi(plan, 1, data, work, transform_flags);
                                                                 What are these?
                         NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
                         CENTER



           FFTW: Data Decomposition
                              Each MPI rank owns a portion of the problem

                                                      nx


Local Address Context:                                                   Global Address Context:

for(x=0;x<lnx;x++)                     ny                                for(x=0;x<nx;x++)
 for(y=0;y<ny;y++)                                                        for(y=0;y<ny;y++)
  for(z=0;z<nz;z++) {                                                      for(z=0;z<nz;z++)
    data[x][[y][z] =                                              nz        {
   f(x+lxs,y,z);                                                             data[x][[y][z] =
  }                                                                         f(x,y,z);
                                                  lxs lxs+lnx
                                                                           }
    NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
    CENTER



FFTW : Parallel Performance

                                                    •   FFT may be complex
                                                        function of problem
                                                        size . Prime factors of
                                                        the dimensions and
                                                        concurrency determine
                                                        performance
                                                    •   Consider data
                                                        decompositions and
                                                        paddings that lead to
                                                        optimal local data sizes:
                                                        cache use and prime
                                                        factors
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING
CENTER



   FFTW : Wisdom
                                                •   Runtime performance
                                                    optimization, can be stored to
                                                    file

                                                •   Wise options:
                                                           FFTW_MEASURE |
                                                           FFTW_USE_WISDOM

                                                •   Unwise options:
                                                           FFTW_ESTIMATE

                                                •   Wisdom works better for serial
                                                    FFTs. Some benefit for parallel
                                                    must amortize increase in
                                                    planning overhead.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:10/3/2011
language:English
pages:36