Document Sample

ScaLAPACK Tutorial Susan Blackford University of Tennessee http://www.netlib.org/scalapack/ 1 Outline Introduction ScaLAPACK Project Overview Basic Linear Algebra Subprograms (BLAS) Linear Algebra PACKage (LAPACK) Basic Linear Algebra Communication Subprograms (BLACS) Parallel BLAS (PBLAS) Design of ScaLAPACK http://www.netlib.org/scalapack 2 Outline continued Contents of ScaLAPACK Applications Performance Example Programs Issues of Heterogeneous Computing HPF Interface to ScaLAPACK http://www.netlib.org/scalapack 3 Outline continued FutureDirections Conclusions Hands-On Exercises http://www.netlib.org/scalapack 4 Introduction http://www.netlib.org/scalapack 5 High-Performance Computing Today In the past decade, the world has experienced one of the most exciting periods in computer development. Microprocessors have become smaller, denser, and more powerful. The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference in attacking some of the most important problems of http://www.netlib.org/scalapack 6 science and engineering. Growth of Microprocessor Performance Performance in Mflop/s 10000 Cray T90 Cray C90 1000 Cray 2 Cray X-MP Cray Y-MP RS6000/590 Alpha Alpha 100 Cray 1S RS6000/540 i860 Supers 10 R2000 1 Micros 80387 0.1 80287 6881 8087 0.01 7 The Maturation of Highly Parallel Technology Affordable parallel systems now out-perform the best conventional supercomputers. Performance per dollar is particularly favorable. The field is thinning to a few very capable systems. Reliability is greatly improved. Third-party scientific and engineering applications are appearing. Business applications are appearing. Commercial customers, not just research labs, http://www.netlib.org/scalapack 8 are acquiring systems. Architecture Alternatives Distributed, private memories using message passing to communicate among processors Single address space implemented with physically distributed memories. Both approaches require: – Attention to locality of access. – Scalable interconnection technology. http://www.netlib.org/scalapack 9 Directions Move toward shared memory – SMPs and Distributed Shared Memory – Shared address space w/deep memory hierarchy Clustering of shared memory machines for scalability Efficiency of message passing and data parallel programming – Helped by standards efforts such as MPI http://www.netlib.org/scalapack 10 and HPF (PVM & OpenMP) Challenges in Developing Distributed Memory Libraries How to integrate software? Where is the data – Until recently no standards – Who owns it? – Many parallel languages – Opt data distribution – Various parallel Who determines data programming models layout – Assumptions about the – Determined by user? parallel environment – Determined by library » granularity developer? » topology » overlapping of – Allow dynamic data dist. communication/computation – Load balancing » development tools http://www.netlib.org/scalapack 11 ScaLAPACK Project Overview http://www.netlib.org/scalapack 12 http://www.netlib.org/scalapack 13 ScaLAPACK Team Susan Blackford, UT Sven Hammarling, NAG (UT) Jaeyoung Choi, Soongsil Mike Heath, UIUC University, Korea (UT) Greg Henry, Intel Andy Cleary, LLNL (UT) Antoine Petitet, UT Tony Chan, UCLA Padma Raghavan, UT Ed D'Azevedo, ORNL Dan Sorensen, Rice U Jim Demmel, UC-B Ken Stanley, UC-B Inder Dhillon, IBM (UC-B) David Walker, Cardiff (ORNL) Jack Dongarra, UT/ORNL Clint Whaley, UT Victor Eijkhout, UT Plus others not funded by http://www.netlib.org/scalapack DARPA 14 Scalable Parallel Library for Numerical Linear Algebra ScaLAPACK Software Library for Dense, Banded, Sparse – University of Tennessee – Oak Ridge National Laboratory – University of California, Berkeley P_ARPACK Large Sparse Non-Symmetric Eigenvalues – Rice University CAPSS Direct Sparse Systems Solvers – University of Illinois, U/UC – University of Tennessee ParPreParallel Preconditioners for Iterative http://www.netlib.org/scalapack 15 Methods NLA - Software Development BLAS,LINPACK, EISPACK LAPACK – From hand-held calculators to most powerful vector supercomputers – RISC, Vector, and SMP target – In use by CRAY, IBM, SGI, HP-Convex, SUN, Fujitsu, Hitachi, NEC, NAG, IMSL, KAI – Still under development – High accuracy routines, parameterizing for memory hierarchies, annotated libraries http://www.netlib.org/scalapack 16 NLA - ScaLAPACK Follow on to LAPACK Designed for distributed parallel computing (MPP & Clusters) First math software package to do this Numerical software that will work on a heterogeneous platform Latent tolerant algorithms on systems with deep memory hierarchy “Out of Core” and fault-tolerant implementations In use by Cray, IBM, HP-Convex, Fujitsu, NEC, NAG, IMSL – Tailor performance & provide support http://www.netlib.org/scalapack 17 Goals - Port LAPACK to Distributed- Memory Environments. Efficiency – Optimized compute and communication engines – Block-partitioned algorithms (Level 3 BLAS) utilize memory hierarchy and yield good node performance Reliability – Whenever possible, use LAPACK algorithms and error bounds. Scalability – As the problem size and number of processors grow – Replace LAPACK algorithm that did not scale; New ones into LAPACK Portability – Isolate machine dependencies to BLAS and the BLACS Flexibility – Modularity: Build rich set of linear algebra tools: BLAS, BLACS, http://www.netlib.org/scalapack 18 PBLAS ScaLAPACK Team Susan Blackford, UT Sven Hammarling, NAG Jaeyoung Choi, Greg Henry, Intel Soongsil University Osni Marques, Andy Cleary, LLNL LBNL/NERSC Ed D'Azevedo, ORNL Caroline Papadopoulos, Jim Demmel, UC-B UCSD Inder Dhillon, IBM (UC- Antoine Petitet, UT B) Ken Stanley, UC-B Jack Dongarra, Francoise Tisseur, U Man UT/ORNL David Walker, Cardiff U 19 Ray Fellers, UC-B Clint Whaley, UT Programming Style SPMD Fortran 77 using an object based design Built on various modules – PBLAS Interprocessor communication – BLACS » PVM, MPI, IBM SP, CRI T3, Intel, TMC » Provides right level of notation. – BLAS LAPACK software expertise/quality – software approach 20 Overall Structure of Software Each global data object is assigned an array descriptor. The array descriptor – Contains information required to establish mapping between a global array entry and its corresponding process and memory location. – Is differentiated by the DTYPE_ (first entry) in the descriptor. – Provides a flexible framework to easily specify additional data distributions or matrix types. Using concept of context http://www.netlib.org/scalapack 21 PBLAS Similar to the BLAS in functionality and naming. Built on the BLAS and BLACS Provide global view of matrix CALL DGEXXX ( M, N, A( IA, JA ), LDA,... ) CALL PDGEXXX( M, N, A, IA, JA, DESCA,... ) http://www.netlib.org/scalapack 22 ScaLAPACK Structure ScaLAPACK PBLAS Global Local LAPACK BLACS BLAS PVM/MPI/... http://www.netlib.org/scalapack 23 Parallelism in ScaLAPACK Level 3 BLAS block Task parallelism operations – Sign function eigenvalue – All the reduction routines computations Pipelining Divide and Conquer – QR Algorithm, Triangular – Tridiagonal and band Solvers, classic solvers, symmetric factorizations eigenvalue problem and Sign function Redundant Cyclic reduction computations – Reduced system in the – Condition estimators band solver – QR based eigensolvers Static work assignment – Bisection http://www.netlib.org/scalapack 24 Heterogeneous Computing Softwareintended to be used in this context Machine precision and other machine specific parameters Communication of ft. pt. numbers between processors Repeatability - – run to run differences, e.g. order in which quantities summed Coherency - – within run differences, e.g. arithmetic varies from proc to proc http://www.netlib.org/scalapack 25 Prototype Codes http://www.netlib.org/scalapack/prototype/ – PBLAS (version 2.0 ALPHA) – Packed Storage routines for LLT, SEP, GSEP – Out-of-Core Linear Solvers for LU, LLT, and QR – Matrix Sign Function for Eigenproblems – SuperLU and SuperLU_MT – HPF Interface to ScaLAPACK Distributed memory SuperLU coming soon! http://www.netlib.org/scalapack 26 Out of Core Software Approach High-level I/O Interface ScaLAPACK uses a `Right-looking’ variant for LU, QR and Cholesky factorizations. A `Left-looking’ variant is used for Out-of-core factorization to reduce I/O traffic. http://www.netlib.org/scalapack 27 Requires two in-core Out-of-Core Performance QR Factorization on 64 processors Intel Paragon 8000 7000 6000 Out-of-core In-core 5000 Time (s) 4000 3000 2000 1000 0 5000 8000 10000 14000 16000 22400 http://www.netlib.org/scalapack Square problem size 28 HPF Version Interface provided for a subset of routines Linear solvers (general, pos. def., least squares) Eigensolver (pos. def.) matrix multiply, & triangular solve 29 HPF Version Given these declarations: REAL, DIMENSION(1000,1000) :: A, B, C !HPF$ DISTRIBUTE (CYCLIC(64), CYCLIC(64)) :: A, B, C Calls could be made as simple as: CALL LA_GESV(A, B) CALL LA_POSV(A, B) CALL LA_GELS(A, B) CALL LA_SYEV(A, W, Z) CALL LA_GEMM(A, B, C) CALL LA_TRSM(A,B) Fortran 90 version follows these ideas 30 ScaLAPACK - Ongoing Work Increasing flexibility and usability – Algorithm redistribution in PBLAS – Removal of alignment restrictions – Algorithmic blocking Increased functionality – Divide and conquer for SEP and SVD Grail” for Symmetric Eigenvalue “Holy Problem Sparse Direct Solver 31 Direct Sparse Solvers CAPSS is a package to solve Ax=b on a message passing multiprocessor; the matrix A is SPD and associated with a mesh in 2 or 3D. (Version for Intel & MPI) MFACT -- multifrontal sparse solver http://www.cs.utk.edu/~padma/mfact.html SuperLU - sequential and parallel implementations of Gaussian with partial pivoting. eliminationhttp://www.netlib.org/scalapack 32 Sparse Gaussian Elimination Super LU_MT, supernodal SMP version Designed to exploit memory hierarchy – Serial Supernode-panel organization permits "BLAS 2.5" performance – Up to 40% of machine peak on large sparse matrices on IBM RS6000/590, MIPS R8000, 25% on Alpha 21164 http://www.netlib.org/scalapack 33 Super LU Several data structures redefined, redesigned for parallel access Barrier-less, nondeterministic implementation Task queue for load balance (over 95% on most large problems) For n=16614 matrix from 3D flow calculation, 66 nonzeros/row Machine Nprocs Speedup Gflop/s % Peak Alpha 8400 8 7 0.78 17 Cray J90 8 6.5 0.83 25 Cray C90 6.6 8 http://www.netlib.org/scalapack 2.58 25 34 Parallel Sparse Eigenvalue Solvers P_ARPACK (D. Sorensen et al) – Designed to compute a few values and corresponding vectors of a general matrix. – Appropriate for large sparse or structured matrices A – This software is based Arnoldi process called the Implicitly Restarted Arnoldi Method. http://www.netlib.org/scalapack 35 – Reverse Communication Interface. ParPre Library of parallel preconditioners for iterative solution methods for linear systems of equations http://www.netlib.org/scalapack 36 Netlib downloads for ScaLAPACK material ScaLAPACK -- 4230 ScaLAPACK HPF version -- 228 SuperLU -- 78 ARPACK -- 1129 PARPACK -- 292 CAPSS -- 479 PARPRE -- 102 manpages -- 3292 This is a count of the downloads for the archive,tar, http://www.netlib.org/scalapack 37 or prebuilt libraries. Related Projects http://www.netlib.org/scalapack 38 Java Java likely to be a dominant language. Provides for machine independent code. C++ like language No pointers, goto’s, overloading arith ops, or memory deallocation Portability achieved via abstract machine Java is a convenient user interface 39 JAVA the suitability of Java for Investigating Math Software libraries. – Http://math.nist.gov/javanumerics/ JAMA (Java Matrix Package) – BLAS, LU, QR, eigenvalue routines, SVD LAPACK to Java translator – http://www.cs.utk.edu/f2j/download.html Involved in the community discussion 40 LAPACK to JAVA Allows Java programmers to access to BLAS/LAPACK routines. Translator to go from LAPACK to Java Byte Code – f2j: formal compiler of a subset of f77 sufficient for BLAS & LAPACK Plan to enable all of LAPACK – Compiler provides quick, reliable translation. Focus on LAPACK Fortran – Simple - no COMMON, EQUIVALANCE, SAVE, I/O 41 Parameterized Libraries Architecture features for performance – Cache size – Latency – Bandwidth Latency tolerant algorithms – Latency reduction – Granularity management High levels of concurrency Issues of precision 42 Grid aware Motivation for Network Enabled Solvers Design an easy-to-use tool to provide efficient and uniform access to a variety of scientific packages on UNIX platforms Client-Server Design Computational Resources Reply Choice Non-hierarchical system Load Balancing Clien Agen t t Fault Tolerance Heterogeneous Request Environment 43 NetSolve -- References Softwareand documentation can be obtained via the WWW or anonymous ftp: – http://www.cs.utk.edu/netsolve/ to Comments/questions netsolve@cs.utk.edu. 44 ATLAS atlas@cs.utk.edu 45 What is ATLAS A package that Currently provided adapts to differing – Level 3 BLAS architectures via code generation + » Generated GEMM timing 1-2 hours install – Initially, supply time per precision BLAS » Recursive GEMM- Package contains: based L3 BLAS – Code generators Antoine Petitet – Sophisticated timers Next Release: – Robust search – Real matvec routines – Various L1 46 ? Threading Why ATLAS is needed BLAS require many man- hours/platform – Only done if financial incentive is there » Many platforms will never have an optimal version – Lags behind hardware – May not be affordable by everyone – Improves Vendor code Operationsmay be important, but not general enough for standard 47 Algorithmic approach for Lvl 3 Only generated code is on-chip multiply All BLAS operations written in terms of generated on-chip multiply All transpose cases coerced through data copy to 1 case of on-chip multiply N K N M – Only 1 case generated per platform A NB C M * B K 48 Code generation strategy Code is On-chip multiply iteratively optimizes for: generated & – TLB access timed until – L1 cache reuse optimal case is found. We try: – FP unit usage – Differing NBs – Memory fetch – Breaking false – Register reuse dependencies – Loop overhead – M, N and K minimization 49 loop unrolling DC MFLOPS G DE LX C 21 Al 16 ph 100.0 200.0 300.0 400.0 500.0 600.0 700.0 0.0 a 4a HP 21 -5 PA 16 33 4a 80 - 00 433 HP 18 90 0M 00 hz IB IB /7 M M 35 Po Po /1 we we 2 5 rP r2 C -1 Pe 60 35 nt 4e iu -3 m 32 M Pe nt M iu X- m 15 0 Vendor Matrix Multiply Pe Pro nt -2 iu 00 m II SG -26 IR 6 4 Various Architectures SG 600 ATLAS Matrix Multiply SG I R I R 500 Su 0 n SG 80 0 M 500x500 DGEMM Across icr I R1 0ip2 os 00 1 F77 BLAS pa 00 rc ip2 II 7 M Su Su od n n el Ul Da 70 tra rw 2 in M -2 70 od el 22 00 50 MFLOPS D CG LX D 21 EC 16 100 200 300 400 500 600 0 Al 4a ph - 53 a 3 21 1 64 a- 43 H 3 P PA IB 8 00 M 0 IB Po M w Po er w 2- 13 er PC 5 60 4e Pe -3 nt 32 iu m Pr o- LU w/Vendor BLAS 20 Pe 0 nt iu m II- 26 6 SG IR SG 50 IR 0 0 10 0 00 LU w/ATLAS BLAS Su ip n 27 Su Da n r wi Ul n -2 tr Right-Looking LU fact. (OLD) a2 7 0 M od el 500 x 500 Double Precision RB 22 00 51 Extensions to present work Level 2 BLAS Threading GEMV for Specializatio next release n for user 1 BLAS Level applications GEMM tweaks – Better small case perf. – prefetch 52 Sparse Linear Algebra Operations Reuse of present Runtime work adaptation – GEMV – Sparsity Compile time analysis adaptations – Iterative code – Dense opts improvement – Caching Adaptive libraries parameters Many open- – TLB info ended questions 53 Java ATLAS Java wrappers around C native methods – supply full performance for Java programs Under development now Java reference implementation – Provides BLAS interface for applets Under development now Need Java code generator for pure- 54 Java BLAS, usable by applets Related Projects FFTW www.theory.lcs.mid.edu/~fftw – Fastest FFT in the West – Automatically tuned FFTs in ANSI C PHiPAC www.icsi.berkeley.edu/~bilmes/phipac – Portable High Performance ANSI C – Automatically tuned BLAS 55 The ATLAS team Jack Dongarra Clint Whaley Jeff Horner Antoine Petitet 56 Basic Linear Algebra Subprograms (BLAS) http://www.netlib.org/scalapack 57 BLAS -- Introduction Clarity: code is shorter and easier to read, Modularity: gives programmer larger building blocks, Performance: manufacturers will provide tuned machine-specific BLAS, Program portability: machine dependencies are confined to the BLAS http://www.netlib.org/scalapack 58 Memory Hierarchy Key to high Register performance in s effective use of L1 Cache memory hierarchy L2 Cache True on all Local architectures Memory Remote Memory Secondary Memory http://www.netlib.org/scalapack 59 Level 1, 2 and 3 BLAS Level 1 BLAS Vector-Vector + * operations Level 2 BLAS * Matrix-Vector operations Level 3 BLAS + * Matrix-Matrix operations http://www.netlib.org/scalapack 60 Why Higher Level BLAS? Can only do arithmetic on data at the top of the hierarchy Higher level BLAS lets us do this BLAS Mem ory Flops Flops/ Ref s Mem ory Ref s Level 1 3n 2n 2/ 3 Registers y=y+x L1 Cache Level 2 n2 2n2 2 L2 Cache y=y+Ax Local Memory 2 3 Level 3 4n 2n / n 2 Remote Memory C=C+AB http://www.netlib.org/scalapack Secondary 61 Memory BLAS for Performance IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak) 250 Level 3 BLAS 200 150 Mflop/s 100 Level 2 BLAS 50 Level 1 BLAS 0 10 100 200 300 400 500 Order of vector/Matrices Development of blocked algorithms important for performance http://www.netlib.org/scalapack 62 BLAS -- References BLAS software and documentation can be obtained via: – WWW: http://www.netlib.org/blas, – (anonymous) ftp ftp.netlib.org: cd blas; get index – email netlib@www.netlib.org with the message: send index from blas Comments and questions can be addressed to: lapack@cs.utk.edu http://www.netlib.org/scalapack 63 BLAS Papers C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, 5:308--325, 1979. J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 14(1):1--32, 1988. J. Dongarra, J. Du Croz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 16(1):1--17, 1990. http://www.netlib.org/scalapack 64 BLAS Technical Forum http://www.netlib.org/blas/blast-forum/blast-forum.html Established a Forum to consider expanding the BLAS in light of modern software, language, and hardware developments. Minutes available from each meeting Working proposals for the following: – Dense/Band BLAS – Sparse BLAS – Extended Precision BLAS – Interval BLAS – C interfaces to Legacy BLAS Final document in preparation http://www.netlib.org/scalapack 65 Aims Expanding the BLAS in a number of ways in light of modern software, language, and hardware developments. Enable linear algebra libraries (both public domain and commercial) to interoperate efficiently and easily. Attendees are application programmers, library developers, and vendors http://www.netlib.org/scalapack 66 Goals Stimulate thought Discussion Functionality and Future development of a set of standards for basic matrix data structures, Both dense and sparse, Calling sequences for a set of low-level computational kernels for the parallel 67 http://www.netlib.org/scalapack Language Bindings Model implementation and test suite for all routines – Fortran 95 – Fortran 77 –C Environmental Enquiry routine Future forum for C++, Java... http://www.netlib.org/scalapack 68 Dense and Banded BLAS Extensions – LAPACK auxiliary routines – “Missing” BLAS routines – Combined operations -- DOT_AXPY, GEMVT, TRMT, GEMVER – Multiple Instances -- generate and apply simultaneous givens transformations http://www.netlib.org/scalapack 69 Sparse BLAS Small subset of BLAS functionality – matrix multiply and triangular solve – not include operations when both operands are sparse Many storage schemes for sparse matrices Generic interface – handle-based interface for matrices – allows of implementation independent of http://www.netlib.org/scalapack 70 storage scheme Extended and Mixed-Precision BLAS Extended precision used internally, input and output arguments are the same Mixed precision refers to having some input/output args in single and double prec Simpler, faster, more accurate algs Different machines support extended prec in different ways http://www.netlib.org/scalapack 71 Interval BLAS Data are represented by machine representable intervals to avoid truncation and round-off error in fl. Pt. Arithmetic Mathematical operations involving intervals, interval vectors, and dense, banded, and tridiagonal interval matrices Subset of functionality in Dense and http://www.netlib.org/scalapack 72 Band BLAS chapter Journal of Development Environmental routines Distributed Memory Parallel BLAS Lite/Thin BLAS interfaces Fortran 95 Thin BLAS Interface http://www.netlib.org/scalapack 73 Interfaces to the Legacy BLAS C interface to the Legacy BLAS Fortran 95 interface to the Legacy BLAS http://www.netlib.org/scalapack 74 Conclusions Propose new standards for many flavors of the BLAS in light of software, language, and hardware developments Adopted by vendors, ISVs Users will be able to have more efficient applications with machine- dependencies isolated to the BLAS http://www.netlib.org/scalapack 75 Linear Algebra PACKage (LAPACK) http://www.netlib.org/scalapack 76 EISPACK and LINPACK EISPACK – Design for the algebraic eigenvalue problem, Ax = x and Ax = Bx. – work of J. Wilkinson and colleagues in the 70’s. – Fortran 77 software based on translation of ALGOL. LINPACK – Design for the solving systems of equations, Ax = b. http://www.netlib.org/scalapack 77 – Fortran 77 software using the Level 1 History of Block-Partitioned Algorithms Early algorithms involved use of small main memory using tapes as secondary storage. Recent work centers on use of vector registers, level 1 and 2 cache, main memory, and “out of core” memory. http://www.netlib.org/scalapack 78 Block-Partitioned Algorithms LU Factorization Orthogonal reduction Cholesky factorization to: Symmetric indefinite – (upper) Hessenberg form factorization – symmetric tridiagonal Matrix inversion form QR, QL, RQ, LQ – bidiagonal form factorizations Block QR iteration for Form Q or QTC nonsymmetric eigenvalue problems http://www.netlib.org/scalapack 79 LAPACK Linear Algebra library in Fortran 77 – Solution of systems of equations – Solution of eigenvalue problems Combine algorithms from LINPACK and EISPACK into a single package Efficient on a wide range of computers – RISC, Vector, SMPs User interface similar to LINPACK – Single, Double, Complex, Double Complex http://www.netlib.org/scalapack 80 Built on the Level 1, 2, and 3 BLAS LAPACK Most of the parallelism in the BLAS. Advantages of using the BLAS for parallelism: – Clarity – Modularity – Performance – Portability http://www.netlib.org/scalapack 81 Derivation of Blocked Algorithms Cholesky Factorization A = UTU A11 a j A13 U 11 T 0 0 U 11 u j U 13 T T a j a jj j uj T u jj 0 0 u jj j T T T T A13 j A33 U 13 j U 33 0 0 U 33 Equating coefficient of the jth column, we obtain a j U11u j T a jj u u j u jj T j 2 Hence, if U11 has already been computed, we can UTu a compute uj and u jj from the equations: 11 j j a jj uT u j u2 http://www.netlib.org/scalapack jj j 82 LINPACK Implementation Here is the body of the LINPACK routine SPOFA which implements the method: DO 30 J = 1, N INFO = J S = 0.0E0 JM1 = J - 1 IF( JM1.LT.1 ) GO TO 20 DO 10 K = 1, JM1 T = A( K, J ) - SDOT( K-1, A( 1, K ), 1,A( 1, J ), 1 ) T = T / A( K, K ) A( K, J ) = T S = S + T*T 10 CONTINUE 20 CONTINUE S = A( J, J ) - S C ...EXIT IF( S.LE.0.0E0 ) GO TO 40 A( J, J ) = SQRT(http://www.netlib.org/scalapack S) 83 30 CONTINUE LAPACK Implementation DO 10 J = 1, N CALL STRSV( 'Upper', 'Transpose', 'Non-Unit’, J-1, A, LDA, A( 1, J ), 1 ) S = A( J, J ) - SDOT( J-1, A( 1, J ), 1, A( 1, J ), 1 ) IF( S.LE.ZERO ) GO TO 20 A( J, J ) = SQRT( S ) 10 CONTINUE Thischange by itself is sufficient to significantly improve the performance on a number of machines. From 72 to 251 Mflop/s for a matrix of order 500 on one processor of a CRAY Y-MP. However on 378 Mflop/s on 8 Procs. Of a CRAY Y- MP. http://www.netlib.org/scalapack 84 Suggest further work needed. Derivation of Blocked Algorithms A11 A12 A13 U11 0 T 0 U11 U12 U13 T T A12 A22 A12 U12 U 22 0 0 U 22 U 23 T T T T T A13 T A12 A33 U13 U 23 U 33 0 T 0 U 33 Equating coefficient of second block of columns, we obta A12 U 11U 12 T A22 U 12U 12 U 22U 22 T T Hence, if U11 has already been computed, we can compute U12 as the solution of the following equations by a 11U 12 A12 T Ucall to the Level 3 BLAS routine STRSM: T A22 U 12U 12T U 22U 22http://www.netlib.org/scalapack 85 LAPACK Blocked Algorithms DO 10 J = 1, N, NB CALL STRSM( 'Left', 'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A, LDA, $ A( 1, J ), LDA ) CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE, A( 1, J ), LDA, ONE, $ A( J, J ), LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ), LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10 CONTINUE •On Y-MP, L3 BLAS squeezes a little more out of 1 proc, but makes a large improvement when using 8 procs. Cray Y- MP 1Proc 8 p c ro s h l sk , n C oe y =5 0 0 j- v e t ari n 72 72 I P C LN A K j- v e t ari n 25 1 3 78 e e L v l2 B A L S j- v e t ari n 28 7 2 1 25 e e L v l3 B A L S - ari n k v e t 29 0 4 4 1 1 http://www.netlib.org/scalapack 86 e e L v l3 B A L S LAPACK Contents Combines algorithms from LINPACK and EISPACK into a single package. User interface similar to LINPACK. Built on the L 1, 2 and 3 BLAS, for high performance (manufacturers optimize BLAS) LAPACK does not provide routines for structured problems or general sparse matrices (i.e sparse storage formats such http://www.netlib.org/scalapack 87 as compressed-row, -column, -diagonal, LAPACK -- Motivations Basic Problems: – Linear systems: Ax=b – Least squares: min||Ax-b||2, A=USigmaVT – Eigenvalues and vectors: Ax=lambda x, Ax=lambda Bx Uses (manufacturer optimized) BLAS ==> high performance (Trans)portable Fortran77 (+BLAS), 805K lines (386K library, 419K testing and timing) http://www.netlib.org/scalapack 88 LAPACK -- Release 3.0 Evolved into a worldwide community effort Add functionality – Holy Grail (SEP using Relatively Robust Representations) – divide and conquer SVD, – divide and conquer least squares, – new d&c and expert drivers for GSEP, – new simple and expert drivers for GNEP, – faster QRP, – faster solver for the rank-deficient LS (xGELSY) http://www.netlib.org/scalapack 89 LAPACK -- Release 3.0 Performance enhancements to existing routines LAPACK Users’ Guide, Third Edition http://www.netlib.org/scalapack 90 LAPACK Ongoing Work Add functionality – updating/downdating, ,bidiagonal bisection, bidiagonal inverse iteration, band SVD, Jacobi methods, ... Move to new generation of high performance machines – IBM SPs, CRAY T3E, SGI Origin 2000, clusters of workstations New challenges – New languages: FORTRAN 90, HP FORTRAN, ... – (CMMD, MPL, NX ...) » many flavors of message passing, need standard (PVM, MPI): BLACS Computational speed Highly varying ratioCommunication speed Many ways to layout data, Fastest parallel algorithm sometimes less stable http://www.netlib.org/scalapack 91 numerically. LAPACK -- Summary LAPACK success from HP-48G to Cray C-90 (Trans)portable Fortran77 (+BLAS), 805K lines (386K library, 419K testing and timing) Targets workstations, vector machines, and shared memory computers Initial release February 1992 1,066,000 requests since then (as of 6/97) Linear systems, least squares, eigenproblems http://www.netlib.org/scalapack 92 LAPACK -- Summary contd LAPACK Users’ Guide, Second Edition, available from SIAM http://www.netlib.org/lapack/lug/lapack_lug.html Being translated into Japanese and Russian Third release and manual in Spring 1999 http://www.netlib.org/scalapack 93 LAPACK -- References E. Anderson, Z. Bai, c. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide, Second Edition, SIAM, Philadelphia, PA, 1995. LAPACK software and documentation can be obtained via the WWW or anonymous ftp: – http://www.netlib.org/lapack, – http://www.netlib.org/lapack/lawns, – http://www.netlib.org/clapack, – http://www.netlib.org/c++/lapack++, – http://www.netlib.org/lapack90, – (anonymous) ftp ftp.netlib.org: Comments and questions can be addressed to http://www.netlib.org/scalapack 94 lapack@cs.utk.edu Basic Linear Algebra Communication Subprograms (BLACS) http://www.netlib.org/scalapack 95 BLACS -- Introduction A design tool, they are a conceptual aid in design and coding. Associate widely recognized mnemonic names with communication operations, improve – program readability, – self-documenting quality of the code. Promote efficiency by identifying frequently occurring operations of linear algebra which can be optimized on various computers. http://www.netlib.org/scalapack 96 BLACS -- Intro contd. Program portability is improved through – standardization of kernels without sacrificing efficiency since optimized versions are or will be available. http://www.netlib.org/scalapack 97 BLACS -- Basics Processes are embedded in a two- dimensional grid, 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 http://www.netlib.org/scalapack 98 BLACS -- Basics Anoperation which involves more than one sender and one receiver is called a “scoped operation”. Using a 2D-grid, there are 3 natural scopes: Scope Meaning Row All processes in a process row part icipat e. Column All processes in a process column part icipat e. All All processes in t he process grid part icipat e. http://www.netlib.org/scalapack 99 BLACS -- Basics CALL BLACS_PINFO( IAM, NPROCS ) * If underlying system needs additional setup, do it now IF( NPROCS.LT.1 ) THEN IF( IAM.EQ.0 ) NPROCS = 4 CALL BLACS_SETUP( IAM, NPROCS ) END IF * Get default system context CALL BLACS_GET( 0, 0, ICTXT ) * http://www.netlib.org/scalapack 100 BLACS -- Basics * define 1 x (NPROCS/2+1) process grid NPROW = 1 NPCOL = NPROCS / 2 + 1 CALL BLACS_GRIDINIT( ICTXT, ‘Row’, NPROW, NPCOL ) CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) * If I’m not in the grid, go to end of program IF( MYROW.EQ.-1) GO TO 10 ... CALL BLACS_GRIDEXIT( ICTXT) 10 CONTINUE CALL BLACS_EXIT( 0 ) END http://www.netlib.org/scalapack 101 BLACS -- Basics Operations are matrix-based and ID- less Types of BLACS routines: point-to- point communication, broadcast, combine operations and support routines. http://www.netlib.org/scalapack 102 BLACS -- Communication Routines Send/Receive Send (sub)matrix from one process to another: – _xxSD2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, RDEST, CDEST) – _xxRV2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, RSRC, CSRC) _ (Dat a t ype) xx (Mat rix t ype) I: Int eger, GE: General rect angular S: Real, mat rix, D: Double Precision, TR: Trapezoidal mat rix. C: Complex, Z: Double Complex. http://www.netlib.org/scalapack 103 BLACS -- Point to Point CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN CALL DGESD2D( ICTXT, 5, 1, X, 5, 1, 0 ) ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.0 ) THEN CALL DGERV2D( ICTXT, 5, 1, Y, 5, 0, 0 ) END IF http://www.netlib.org/scalapack 104 BLACS -- Communication Routines Broadcast Send (sub)matrix to all processes or subsection of processes in SCOPE, using various distribution patterns (TOP): – _xxBS2D(ICTXT, SCOPE, TOP, [UPLO, DIAG], M, N, A, LDA) – _xxBR2D(ICTXT, SCOPE, TOP, [UPLO, DIAG], M, N, A, LDA, SCOPE TOP RSRC, CSRC ) ‘Row’ ‘ ‘ (def ault ) ‘Column’ ‘Increasing Ring’ ‘All’ -t ‘1 ree’ ... http://www.netlib.org/scalapack 105 BLACS -- Broadcast CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) IF( MYROW.EQ.0 ) THEN IF( MYCOL.EQ.0 ) THEN CALL DGEBS2D( ICTXT, ‘Row’, ‘ ‘, 2, 2, A, 3 ) ELSE CALL DGEBR2D( ICTXT, ‘Row’, ‘ ‘, 2, 2, A, 3, 0, 0 ) END IF END IF http://www.netlib.org/scalapack 106 BLACS -- Combine Operations Global combine operations Perform element-wise SUM, |MAX|, |MIN|, operations on triangular matrices – _GSUM2D(ICTXT, SCOPE, TOP, M, N, A, LDA, RDEST, CDEST) – _GAMX2D(ICTXT, SCOPE, TOP, M, N, A, LDA, RA, CA, RCFLAG, RDEST, CDEST) – _GAMN2D(ICTXT, SCOPE, TOP, M, N, A, LDA, RA, CA, RCFLAG, RDEST, CDEST) RDEST = -1 indicates that the result of the operation on all processes selected by SCOPE, should be left http://www.netlib.org/scalapack 107 BLACS -- Combine operations For |MAX|, |MIN|, when RCFLAG = -1, RA and CA are not referenced; otherwise, these arrays are set on output with the coordinates of the process owning the corresponding maximum (resp, minimum) element in absolute value of A. Both arrays must be at least of size M X N and their leading dimension is specified by RCFLAG. http://www.netlib.org/scalapack 108 BLACS -- Combine CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) IF( MYROW.EQ.0 ) THEN K00 = MYCOL CALL IGSUM2D( ICTXT, ‘Row’, ‘ ‘, 1, 1, K00, 1, 0, 0 ) KALL = MYCOL CALL IGSUM2D( ICTXT, ‘Row’, ‘ ‘, 1, 1, KALL, 1, -1, 0 ) END IF http://www.netlib.org/scalapack 109 BLACS -- Combine EPS = 1.0D+0 10 CONTINUE EPS = EPS / 2.0D+0 IF( ( 1.0D+0 + EPS ) .GT. 1.0D+0 ) GO TO 10 CALL DGAMX2D( ICTXT, ‘All’, ‘ ‘, 1, 1, EPS, 1, IRSRC, ICSRC, -1, -1, 0 ) http://www.netlib.org/scalapack 110 BLACS -- Advanced Topics The BLACS context is the BLACS mechanism for partitioning communication space. A defining property of a context is that a message in a context cannot be sent or received in another context. The BLACS context includes the definition of a grid, and the process coordinates within it. It allows the user to create arbitrary groups of processes, create multiple overlapping and/or disjoint grids, isolate each process grid so that grids do not interfere with each other. MPI communicator BLACS contexthttp://www.netlib.org/scalapack 111 BLACS -- Advanced Topics The BLACS provide “globally-blocking” point-to-point receive, broadcasts and combines. The BLACS point-to-point send is “locally-blocking”. – Globally blocking: The return from the operation implies that the buffer is available for re-use. The operation may not complete unless the complement of the operation is called. – Locally blocking: The return from the send implies that the buffer is available for re-use. The send will complete regardless of whether the corresponding receive is posted. http://www.netlib.org/scalapack 112 BLACS -- Advanced Topics CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL ) IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN CALL DGESD2D( ICTXT, 5, 1, X, 5, 1, 0 ) CALL DGERV2D( ICTXT, 5, 1, X, 5, 1, 0 ) ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.0 ) THEN CALL DGESD2D( ICTXT, 5, 1, X, 5, 0, 0 ) CALL DGERV2D( ICTXT, 5, 1, X, 5, 0, 0 ) END IF http://www.netlib.org/scalapack 113 BLACS -- Advanced Topics The topology parameter allows the user to vary the cost per process of the messages involved in a distributed (broadcast or combine) operation. There are two main classes of topologies: – Pipelining topologies (ring-based), – Non-pipelining topologies (tree-based). Topologies enhance performance http://www.netlib.org/scalapack 114 BLACS -- Advanced Topics A routine is repeatable iff it is guaranteed to give the same answer if called multiple times with the same parallel configuration and input. A routine is coherent iff all processes selected to receive an answer get identical results. http://www.netlib.org/scalapack 115 BLACS -- Advanced Topics Repeatability and coherence do not affect correctness. A routine may be both incoherent and nonrepeatable, and still give the correct output, It is just that inaccuracies in floating point calculations may cause the routine to return differing values, all of which are equally valid. http://www.netlib.org/scalapack 116 BLACS -- Advanced Topics Examples: – Coherence: ring-based sum reduction, – Repeatability: race conditions in message-passing may influence the order of floating point computations. http://www.netlib.org/scalapack 117 BLACS -- Advanced Topics Homogeneous Coherency: All processes selected to possess the result receive the exact same answer if: – Communication does not change the value of the communicated data, – All processes perform floating point arithmetic exactly the same. Heterogeneous Coherency: All processes will receive the exact same answer if communication does not change the value of the communicated data. http://www.netlib.org/scalapack 118 BLACS -- Advanced Topics Examples: – One does not want to use a stopping criteria based on incoherent results, – Repeatability makes debugging easier. The BLACS assume the communication is coherent. For combine operations, the BLACS allow to set flags enforcing combines to be repeatable and/or heterogeneous coherent. Setting these flags may result in loss of performance. http://www.netlib.org/scalapack 119 BLACS -- Example Programs HELLO is one of the simplest BLACS programs. This program takes the available processes, forms them into a process grid, and then has each process check in with the process at (0,0) in the process grid. PI calculates the value of pi, using numerical integration with parallel processing, and clocks the solution time. PROCMAP illustrates how to map processes to a grid using BLACS_GRIDMAP. This is a fundamental example using simultaneous multiple (possibly overlapping) grids. http://www.netlib.org/scalapack 120 BLACS -- References BLACS software and documentation can be obtained via WWW and anonymous ftp: – http://www.netlib.org/blacs/ – (anonymous) ftp ftp.netlib.org Comments/questions to blacs@cs.utk.edu. R. C. Whaley, Basic Linear Algebra Communication Subprograms: Analysis and Implementation Across Multiple Parallel Architectures, UT Tech Report CS-94-234, LAPACK Working Note #73, 1994. J. Dongarra and R. C. Whaley, A User’s Guide to the BLACS v1.1, UT Tech Report CS-95-281, LAPACK Working Note #94, 1995 (updated May, 1997) http://www.netlib.org/scalapack 121 Parallel Basic Linear Algebra Subprograms (PBLAS) http://www.netlib.org/scalapack 122 PBLAS -- Introduction Parallel Basic Linear Algebra Subprograms for distributed-memory MIMD computers Do both the communication and computation. Simplification of the parallelization: especially when BLAS-based, Modularity: gives programmer larger building blocks, Portability: machine dependencies are confined to the BLAS and BLACS. http://www.netlib.org/scalapack 123 Scope of the PBLAS Level 1 PBLAS Vector-Vector + * operations Level 2 PBLAS * Matrix-Vector operations Level 3 PBLAS + * Matrix-Matrix operations http://www.netlib.org/scalapack 124 Scope of the PBLAS No vector rotations, No dedicated subprograms for banded matrices, Matrix transposition available. Prototype version of packed storage PBLAS – http://www.netlib.org/scalapack/prototype/ http://www.netlib.org/scalapack 125 PBLAS -- Naming Conventions The PBLAS naming scheme follows the conventions of the BLAS, with necessary adaptations: P_XXYYY _ (Dat a t ype) xx (Mat rix t ype) S: Real, GE: All mat rix operands are General D: Double Precision, rect angular, C: Complex, HE: One of t he mat rix operands is Z: Double Complex. HErmit ian, SY: One of t he mat rix operands is SYmmet ric, TR: One of t he mat rix operands is TRiangular. http://www.netlib.org/scalapack 126 PBLAS -- Naming Conventions The Level 1 routines use meaningful abbreviations, for example, PSCOPY, PSDOT, ... YYY The matrix transposition routine is called P_TRANx. (Operat ion) MM Mat rix-mat rix product ; MV Mat rix-vect or product ; R Rank-1updat e of a mat rix; R2 Rank-2 updat e of a mat rix; RK Rank-k updat e of a symmet ric / Hermit ian mat rix; R2K Rank-2k updat e of a symmet ric / Hermit ian mat rix; SM Solves a syst em of linear eqns. f or a mat rix of rhs; SV em of linear eqns. f Solves a syst http://www.netlib.org/scalapackor a rhs vect or. 127 PBLAS Similar to the BLAS in functionality and naming. Built on the BLAS and BLACS Provide global view of matrix CALL DGEXX ( M, N, A( IA, JA ), LDA,... ) CALL PDGEXX( M, N, A, IA, JA, DESCA,... ) http://www.netlib.org/scalapack 128 PBLAS -- Syntax Global view of the matrix operands, allowing global addressing of distributed matrices (hiding complex local indexing), JA N_ N IA A(IA:IA+M-1,JA:JA+N-1) M_ M http://www.netlib.org/scalapack 129 Data Distributions Global view of the matrix operands, allowing global addressing of distributed matrices (hiding complex local indexing), Fits with the distribution patterns for HPF Load balance maintained http://www.netlib.org/scalapack 130 PBLAS -- Storage Conventions An M_ -by-N_ matrix is block-partitioned into MB_-by-NB_ blocks and distributed according to the two-dimensional block- cyclic scheme load balanced computations, scalability Locally, the scattered columns are stored contiguously (FORTRAN “Column Major”) – re-use of the BLAS (leading dimension LLD_) http://www.netlib.org/scalapack 131 Distribution and Storage Matrix is block-partitioned & maps blocks Distributed 2-D block-cyclic scheme 5x5 matrix partitioned in 2x2 blocks 2x2 process grid point of view A11 A12 A13 A14 A15 A11 A12 A15 A13 A14 A21 A22 A23 A24 A25 A21 A22 A25 A23 A24 A3 1 A3 2 A3 3 A3 4 A3 5 A5 1 A5 2 A5 5 A5 3 A5 4 A4 1 A4 2 A4 3 A4 4 A4 5 A3 1 A3 2 A3 5 A3 3 A3 4 A5 1 A5 2 A5 3 A5 4 A5 5 A4 1 A4 2 A4 5 A4 3 A4 4 Routines available to distribute/redistribute data. 132 PBLAS -- Storage Conventions Descriptor DESC_: 9-Integer array describing the matrix layout, containing DTYPE_, CTXT_, M_, N_, MB_, NB_, RSRC_, CSRC_, LLD_, where (RSRC_,CSRC_) are the coordinates of the process owning the first matrix entry in the grid specified by CTXT_. Ex: M_=N_5, MB_=NB_=2, RSRC_=CSRC_=0, LLD_ >= 3 (in process row 0), and 2 (in process row 1). http://www.netlib.org/scalapack 133 PBLAS -- Auxiliary Subprograms Using the BLACS topologies for efficiency and scalability: SUBROUTINE PTOPSET(ICTXT, OP, SCOPE, TOP) SUBROUTINE PTOPGET(ICTXT, OP, SCOPE, TOP) INTEGER ICTXT CHARACTER*1 OP,SCOPE, TOP PTOPSET assigns the BLACS topology TOP to be used in the communication operations OP along the scope specified by SCOPE. PTOPGET retrieves the topology. These routines should be called by every process in the grid identified by the BLACS context ICTXT. http://www.netlib.org/scalapack 134 PBLAS -- Auxiliary Subprograms Disposing of the PBLAS buffer allocated in every process’s dynamic memory: SUBROUTINE PBFREEBUF() http://www.netlib.org/scalapack 135 PBLAS -- Rationale Software reusability: BLAS and PBLAS interfaces similar, Level 1 PBLAS: Output scalar is only correct in operand scope, Level 3 PBLAS uses “shaped adaptive” procedure to perform the computations, The PBLAS routines follow an SPMD programming model (every process in the grid must call the routine). http://www.netlib.org/scalapack 136 PBLAS -- Rationale contd The BLACS context provides the ability to have separate “universes” of message passing: Safe co-existence with other parallel libraries, The PBLAS do not perform “inter- context” operations. http://www.netlib.org/scalapack 137 PBLAS -- Examples INTEGER IAM, ICTXT, INFO, LDA, LDB, LDC, NMAX, NPROCS PARAMETER ( NMAX = 3, LDA = NMAX, LDB = NMAX, LDC = NMAX ) INTEGER DESCA( 9 ), DESCB( 9 ), DESCC( 9 ) DOUBLE PRECISION A( NMAX, NMAX ), B( NMAX, NMAX ), C( NMAX, NMAX ) CALL BLACS_PINFO( IAM, NPROCS ) IF( NPROCS.LT.1 ) THEN NPROCS = 4 CALL BLACS_SETUP( IAM, NPROCS ) END IF CALL BLACS_GET( -1, 0, ICTXT ) CALL BLACS_GRIDINIT( ICTXT, ‘Row’, 2, 2 ) http://www.netlib.org/scalapack 138 PBLAS -- Examples . . . CALL DESCINIT( DESCA, 5, 5, 2, 2, 0, 0, ICTXT, LDA, INFO ) CALL DESCINIT( DESCB, 5, 5, 2, 2, 0, 0, ICTXT, LDB, INFO ) CALL DESCINIT( DESCC, 5, 5, 2, 2, 0, 0, ICTXT, LDC, INFO ) .. . CALL PDGEMM( ‘No transpose’, ‘No transpose’, 4, 4, 4, 1.0D+0, $ A, 1, 1, DESCA, B, 1, 1, DESCB, 0.0D+0, C, 1, 1, DESCC ) . . . CALL PBFREEBUF() CALL BLACS_GRIDEXIT( ICTXT ) CALL BLACS_EXIT( 0 ) http://www.netlib.org/scalapack 139 PBLAS -- Example Programs PBLAS shows how to set the matrix descriptors and call the Level 1, 2, and 3 PBLAS routines. SUMMA illustrates the use of the BLAS and the BLACS to implement an efficient and scalable matrix-matrix multiply routine. http://www.netlib.org/scalapack 140 PBLAS -- Example Programs This implementation of SUMMA was first suggested by: – R. Agarwal, F. Gustavson, and M. Zubair, A High Performance Matrix Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication, IBM Journal of Research and Development, Vol. 38, No. 6, pp. 673--681, 1994. For a scalability analysis of this algorithm see: – R. van de Geijn, and J. Watts, SUMMA: Scalable Universal Matrix Multiplication Algorithm, UT Tech Report CS-95-286, LAPACK Working Note #96, 1995. http://www.netlib.org/scalapack 141 Features of PBLAS V2 ALPHA Software is backward compatible with current version 1.5. Improved ease-of-use: – Previous alignment restrictions have all been removed. Re-alignment is performed on-the-fly and only when necessary. – General rectangular block cyclic distribution of the operands is supported for all routines. http://www.netlib.org/scalapack 142 Features of PBLAS V2 ALPHA – Support for replicated operands » Increased functionality is truly usable as illustrated by the packed storage ScaLAPACK prototype routines » Enable the expression of currently missing algorithms in the ScaLAPACK dense library such as constrained linear least squares solvers. » Simply the expression of complex algorithms such as divide and conquer algorithms (sign function). http://www.netlib.org/scalapack 143 Features of PBLAS V2 ALPHA Efficient for a wider range of possible distribution parameters: – Algorithmic blocking techniques used throughout Level 2 and3PBLAS. » Enable the use of such techniques at the ScaLAPACK level. Testing and timing programs upgraded to test the above new functionalities. http://www.netlib.org/scalapack 144 Performance of PBLAS V2 ALPHA Specific (and more efficient) algorithms have been implemented for large numbers of right-hand-sides, e.g., triangular solve, triangular matrix- multiply, symmetric or Hermitian matrix- multiply, rank-k and rank-2k updates. Efficiency improved on a wider range of problem sizes and machine configurations than version 1.5. http://www.netlib.org/scalapack 145 Performance of PBLAS V2 ALPHA Fine performance tuning is ongoing work, in particular , profiling and precise timing analysis of each component for various distributed-memory concurrent computers. Documentation as well as software design description will be made available with the final release of PBLAS V2 (early 1998). http://www.netlib.org/scalapack 146 PBLAS V2 will replace current version PBLAS -- References J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Whaley, A Proposal for a Set of Parallel Basic Linear Algebra Subprograms, UT Tech Report CS-95-292, LAPACK Working Note 100, 1995. PBLAS software and documentation can be obtained via the WWW and anonymous ftp: – http://www.netlib.org/scalapack – (anonymous) ftp ftp.netlib.org Comments/suggestions to scalapack@cs.utk.edu. http://www.netlib.org/scalapack 147 Design of ScaLAPACK http://www.netlib.org/scalapack 148 ScaLAPACK Structure ScaLAPACK PBLAS Global Local LAPACK BLACS BLAS PVM/MPI/... http://www.netlib.org/scalapack 149 Goals - Port LAPACK to Distributed- Memory Environments. Efficiency – Optimized compute and communication engines – Block-partitioned algorithms (Level 3 BLAS) utilize memory hierarchy and yield good node performance Reliability – Whenever possible, use LAPACK algorithms and error bounds. Scalability – As the problem size and number of processors grow – Replace LAPACK algorithm that did not scale; New ones into LAPACK Portability – Isolate machine dependencies to BLAS and the BLACS Flexibility – Modularity: Build rich set of linear algebra tools: BLAS, BLACS, http://www.netlib.org/scalapack 150 PBLAS Object-Based Design in Fortran77 Each global data object is assigned an array descriptor. The array descriptor – Contains information required to establish mapping between a global array entry and its corresponding process and memory location. – Is differentiated by the DTYPE_ (first entry) in the descriptor. – Provides a flexible framework to easily specify additional data distributions or matrix types. Using concept of context http://www.netlib.org/scalapack 151 Array Descriptors Array descriptors currently supported for: – dense matrices – band and tridiagonal matrices – out-of-core matrices User must distribute all global arrays prior to the invocation of a ScaLAPACK routine. http://www.netlib.org/scalapack 152 Choosing a Data Distribution The two main issues in choosing a data layout for dense matrix computations are: – load balance, or splitting the work reasonably evenly among the processors throughout the algorithm, and – use of the Level 3 BLAS during computations on a single processor to utilize the memory hierarchy on each processor. http://www.netlib.org/scalapack 153 Possible Data Layouts 1D block and cyclic column distributions 1D block-cycle column and 2D block-cyclic distribution 2D block-cyclic used in ScaLAPACK for dense 154 http://www.netlib.org/scalapack matrices Two-dimensional Block-Cyclic Distribution 5x5 matrix partitioned in 2x2 blocks 2x2 process grid point of view A11 A12 A13 A14 A15 A11 A12 A15 A13 A14 A21 A22 A23 A24 A25 A21 A22 A25 A23 A24 A3 1 A3 2 A3 3 A3 4 A3 5 A5 1 A5 2 A5 5 A5 3 A5 4 A4 1 A4 2 A4 3 A4 4 A4 5 A3 1 A3 2 A3 5 A3 3 A3 4 A5 1 A5 2 A5 3 A5 4 A5 5 A4 1 A4 2 A4 5 A4 3 A4 4 Need redistribution routines to go from one distribution to another. 155 Two-dimensional Block-Cyclic Distribution Ensure good load balance --> Performance and scalability, Encompasses a large number of (but not all) data distribution schemes, Need redistribution routines to go from one distribution to the other. http://www.netlib.org/scalapack 156 Array descriptor for Dense Matrices DESC_() Symbolic Scope Def init ion Name 1 DTYPE_A (global) Descript or t ype DTYPE_A=1f or dense mat rices. 2 CTXT_A (global) BLACS cont ext handle. 3 M_A (global) No. of rows in global array A 4 N_A (global) No. of cols. In global array A 5 MB_A (global) Blocking f act or used t o dist ribut e t he rows of array A. 6 NB_A (global) Blocking f act or used t o dist ribut e t he columns of array A. 7 RSRC_A (global) Process row over which t he f irst row of t he array A is dist ribut ed. 8 CSRC_A (global) Process column over which t he f irst column of t he array A is dist ribut ed. 9 LLD_A (local) Leading dimension of t he local array. http://www.netlib.org/scalapack 157 Narrow Band and Tridiagonal Matrices The ScaLAPACK routines solving narrow- band and tridiagonal linear systems assume – the narrow band or tridiagonal coefficient matrix to be distributed in a block-column fashion, and – the dense matrix of right-hand-side vectors to be distributed in a block-row fashion. Divide-and-conquer algorithms have been implemented because they offer greater scope for exploiting parallelism than the corresponding adapted dense algorithms. http://www.netlib.org/scalapack 158 Array descriptor for Narrow Band Matrices DESC_() Symbolic Scope Def init ion Name 1 DTYPE_A (global) Descript or t ype DTYPE_A=5 0 1f or 1x Pc process grid f or band and t ridiagonal mat rices block-column dist ribut ed. 2 CTXT_A (global) BLACS cont ext handle. 3 N_A (global) No. of cols. In global array A 4 NB_A (global) Blocking f act or used t o dist ribut e t he columns of array A. 5 CSRC_A (global) Process column over which t he f irst column of t he array A is dist ribut ed. 6 LLD_A (local) Leading dimension of t he local array. For t he t ridiagonal subrout ines, t his ent ry is ignored. 7 Unused, reserved. http://www.netlib.org/scalapack 159 Array descriptor for Right Hand Sides for Narrow Band Linear Solvers DESC_() Symbolic Scope Def init ion Name 1 DTYPE_B (global) Descript or t ype DTYPE_B=5 0 2 f or Pr x 1 process grid f or block-row dist ribut ed mat rices . 2 CTXT_B (global) BLACS cont ext handle. 3 M_B (global) No. of rows in global array B 4 MB_B (global) Blocking f act or used t o dist ribut e t he rows of array B. 5 RSRC_B (global) Process row over which t he f irst row of t he array B is dist ribut ed. 6 LLD_B (local) Leading dimension of t he local array. For t he t ridiagonal subrout ines, t his ent ry is ignored. 7 Unused, reserved. http://www.netlib.org/scalapack 160 Error Handling Driverand Computational routines perform global and local input error- checking. – Global checking -- synchronization – Local checking -- validity No input error-checking is performed on the auxiliary routines. If an error is detected in a PBLAS or BLACS routine, program execution is http://www.netlib.org/scalapack 161 stopped. Application Debugging Hints Look at ScaLAPACK example programs Always check the value of INFO on exit from a ScaLAPACK routine. Query for size of workspace, LWORK = -1 Link to the Debug Level 1 BLACS (specified by BLACSDBGLVL=1 in Bmake.inc) http://www.netlib.org/scalapack 162 ScaLAPACK Implementation BLAS (Performance and Portability) – Blocked data access (Level 3 BLAS) yields good local performance, – Portability: standardization of efficient kernels. BLACS (Performance and Portability) – Correct level of notation: communication of matrices, – Efficiency: identify frequent linear algebra operations which can be optimized on various computers. http://www.netlib.org/scalapack 163 Functionality Problem t ype SDrv EDrv Fact or Solve Inv Cond It er Ax = b Est Ref in Triangular X X X X Timing and Testing SPD X X X X X X X SPD Banded X X X routines for SPD Tridiagonal X X X almost all General X X X X X X X This is a General Banded X X X large General Tridiagonal X X X component Least squares X X X GQR X of the GRQ X package Ax = x or Ax = Bx SDrv Edrv Reduct Solut ion Prebuilt Symmet ric X X X X libraries General + X + available for Generalized BSPD X X SP, PCA, SVD + X X + 164 O2K, PGON, Functionality continued Orthogonal/unitary transformation routines Prototype Codes – PBLAS (version 2.0 ALPHA) – Packed Storage routines for LLT, SEP, GSEP – Out-of-Core Linear Solvers for LU, LLT, and QR – Matrix Sign Function for Eigenproblems – SuperLU and SuperLU_MT – HPF Interface to ScaLAPACK http://www.netlib.org/scalapack 165 Parallelism in ScaLAPACK Level 3 BLAS block Task parallelism operations – Sign function eigenvalue – All the reduction routines computations Pipelining Divide and Conquer – QR Algorithm, Triangular – Tridiagonal and band Solvers, classic solvers, symmetric factorizations eigenvalue problem and Sign function Redundant Cyclic reduction computations – Reduced system in the – Condition estimators band solver Static work assignment – Bisection http://www.netlib.org/scalapack 166 Documentation, Test Suites, Example Programs, ... Documentation – ScaLAPACK Users’ Guide http://www.netlib.org/scalapack/slug/scalapack_slug.html – Installation Guide for ScaLAPACK – LAPACK Working Notes TestSuites for ScaLAPACK, PBLAS, BLACS Example Programs http://www.netlib.org/scalapack/examples/ Prebuilt ScaLAPACK libraries on netlib167 http://www.netlib.org/scalapack Commercial Use ScaLAPACK has been incorporated into the following commercial packages – Fujitsu – Hewlett-Packard/Convex – Hitachi – IBM Parallel ESSL – NAG Numerical PVM (and MPI) Library – Cray LIBSCI – NEC Scientific Software Library – Sun Scientific Software Library – Visual Numerics (IMSL) http://www.netlib.org/scalapack 168 User Applications http://www.netlib.org/scalapack 169 Impact -- Applications Large scale solvers are permitting new range of problems to be addressed Technology transfer of software to the commercial marketplace New computations in electronic structures is having impact in nanoscale technology Open region electromagnetic scattering is important to remote atmospheric sensing and cross-section aircraft Eigensolvers important in many areas (data compression, reactive scattering, etc.) Sparse solvers used in structural mechanics, electromagnetics, and optimization. 170 ScaLAPACK in ASCI Application Amounts To Saving of $1.1M - $5.4M MP-Quest is a simulation code for materials modeling – Surface structures at SNL. (Pb-Zn-Ti) MP-Quest will account for 10% of all cycles on ASCI-Red (presently at 20-30%) – Symmetric Eigensolver accounts for 25% of cycles in MP-Quest – Hence 2.5% of Asci Red cycles in MP-Quest's symmetric eigensolver. ScaLAPACK’s Sym Eig solver is 10x faster than MP- Quest's – 90% savings in cycles by using ScaLAPACK ASCI Red cost roughly $50 million – $50M x 2.5% x 90% = $1.125M today.(may be closer to $5 million) 171 Right now large problems are too slow because of the Interaction with ASCI at Caltech Quantum Mechanical Calculations Caltech ASCI Level 1 Alliance Center “A Virtual Shock Physics Test Facility” Most time intensive level of theory NO 2 All Caltech Materials Properties N O2N N HMX Ta (BCC) Ultimately are Based upon QMExplosive N Metal NO 2 N Two time intensive steps O2N » Formation of Fock Operator O(N4) Compute Materials Properties Compute Materials Properties for High Explosives for Solid Dynamics » Diagonalization of Fock Operator O(N3) Ax – Linear Algebra Bx 172 Requirements Ocean Circulation Model OCEANS – Uses second-order finite differences with an explicit leapfrog time stepping scheme on a logically rectangular grid. » A. Wallcraft, H. Hurlburt, R. Rhodes, & J. Shriver, NRL - Stennis 540% speedup over the previous parallel software. (64*145.6 Mflop/s) = 9.3 Gflop/s Translated into a run that was taking 2 hours is now 25 minutes. ScaLAPACK has also improved accuracy of 173 results. ScaLAPACK Performance http://www.netlib.org/scalapack 174 Target Machines for ScaLAPACK Dedicated distributed-memory machines (MIMD) – IBM SP series – Intel series (Paragon) – Cray T3 series – SGI Power Challenge Array – SGI Origin 2000 – HP/Convex Exemplar Networks of workstations or PCs Any system for which MPI or PVM is available http://www.netlib.org/scalapack 175 Scalability -- Introduction How well does a parallel algorithm allow a large number of processors to be efficiently utilized? http://www.netlib.org/scalapack 176 Scalability -- Introduction Ifthe ratio Mem(n)/p is held constant, then an algorithm is scalable if – E(p,u,n(p)) = constant. In other words, maintaining memory use per node constant allows efficiency to be maintained, In practice a slight degradation is acceptable. http://www.netlib.org/scalapack 177 Scalability For dense matrix computations, an implementation is said to be scalable if the parallel efficiency is an increasing function of N**2/P, the problem size per node. The algorithms implemented in ScaLAPACK are scalable in this sense. http://www.netlib.org/scalapack 178 Achieving High Performance Chapter 5, ScaLAPACK Users’ Guide ScaLAPACK achieves high performance on distributed-memory computers, such as the SP2, T3E, O2K, and Paragon. ScaLAPACK can achieve high performance on some networks of workstations or PCs. http://www.netlib.org/scalapack 179 Achieving High Performance on a Distributed-Memory Computer Use the right number of processors – Rule of thumb: P=MxN/1,000,000 for an MxN matrix. This provides a local matrix of size approximately 1000-by-1000. – Do not try to solve a small problem on too many processors. – Do not exceed physical memory. http://www.netlib.org/scalapack 180 Achieving High Performance on a Distributed-Memory Computer Use an efficient data distribution. – Block size (I.e., MB,NB) = 64. – Square processor grid, Pr = Pc. Use efficient machine-specific BLAS (not the Fortran77 reference implementation from netlib) and BLACS (nondebug, BLACSDBGLVL=0 in Bmake.inc) http://www.netlib.org/scalapack 181 Achieving High Performance on a Network of Workstations The bandwidth per node, if measured in Megabytes per second per node, should be no less than one tenth the peak floating-point rate as measured in megaflops/second/node. The underlying network must allow simultaneous messages, that is, not standard ethernet and not FDDI. Message latency should be no more than 500 microseconds. http://www.netlib.org/scalapack 182 Achieving High Performance on a Network of Workstations All processors should be similar in architecture and performance. ScaLAPACK will be limited by the slowest processor. Data format conversion significantly reduces communication performance. Dedicated use of processors No more than one process should be executed per processor. http://www.netlib.org/scalapack 183 Obtaining High Performance Use the best BLAS and BLACS libraries available. Start with a standard data distribution. – A square processor grid (Pr=Pc) if P >= 9 – A one-dimensional processor grid (Pr=1,Pc=P) if P<9 – Block size (NB) = 64 Determine whether reasonable performance is being achieved. Identify the performance bottlenecks, if any. http://www.netlib.org/scalapack 184 Mflop/s LU/Solve on PII 300 MHz Cluster 1400 1200 1000 800 1 600 2 4 400 6 200 8 10 0 0 12 0 0 0 00 00 00 00 00 00 00 00 00 10 30 50 70 90 11 13 15 17 Order of Matrix http://www.netlib.org/scalapack 185 Details of Cluster timings Fast ethernet MPICH 1.1 NB = 36 DGEMM obtained from ATLAS Peak DGEMM = 189 Mflops/proc Obtain approximately 62% DGEMM performance per processor http://www.netlib.org/scalapack 186 Performance LU fact+solve on IBM SP2 thin nodes 8000 7000 6000 5000 Mflops 4000 3000 2000 64 nodes 1000 32 nodes 16 nodes 0 8 nodes 1000 2000 3000 4000 4 nodes 5000 7500 10000 12500 15000 ro P blem size http://www.netlib.org/scalapack 187 Details of SP2 timings Performance increases with the number of nodes Double the number of nodes, double the performance memory saturated DGEMM (thin node) = 180 Mflops/node 8 Gflops/64 = 125/180 = 70% DGEMM performance http://www.netlib.org/scalapack 188 Performance LU fact.+solve on Intel Paragon 4500 4000 3500 3000 Mflops 2500 2000 1500 1000 128 nodes 500 64 nodes 0 32 nodes 16 nodes 1000 2000 3000 4000 5000 8 nodes 7500 10000 12500 15000 17500 20000 Problem siz e http://www.netlib.org/scalapack 189 Details of Paragon timings Performance increases with the number of nodes DGEMM = 45 Mflops/node Achieve 80% DGEMM performance per node Double the number of nodes, double the performance memory saturated http://www.netlib.org/scalapack 190 Performance LU fact+solve on Sparc Ultra 1 (167 Mhz) connected via 155 Mbps switched ATM 1200 1000 800 Mflops 600 400 200 0 12 nodes 8 nodes 1000 2000 3000 4000 5000 4 nodes 6000 7000 8000 9000 10000 11000 Problem size 12000 http://www.netlib.org/scalapack 191 LU Performance (Mflop/s) on 32 nodes Intel XP/S MP Paragon 300 25 0 20 0 15 0 10 0 50 0 10 0 0 20 0 0 3000 4000 5000 http://www.netlib.org/scalapack 192 Performance of LU fact. + solve on the Intel MP XP/S Paragon (Mflop/s) (2 computational processors per node) 4500 4000 3500 1x4 3000 1x8 25 0 0 2 x8 20 0 0 4 x8 15 0 0 4 x16 10 0 0 500 0 5000 7500 10000 12500 15000 20000 22500 http://www.netlib.org/scalapack 193 ScaLAPACK Example Programs http://www.netlib.org/scalapack 194 ScaLAPACK Example Program #1 Http://www.netlib.org/scalapack/examples/example1.f http://www.netlib.org/scalapack 195 ScaLAPACK Example Program #2 Http://www.netlib.org/scalapack/examples/scaex.shar http://www.netlib.org/scalapack 196 Issues of Heterogeneous Computing http://www.netlib.org/scalapack 197 Heterogeneous Computing Software intended to be used in cluster computing context Difficulties arise with the following issues: – Communication of ft. pt. numbers between processors – Repeatability and Coherency – Machine precision and other machine specific parameters – Different versions of compilers – Checking global floating-point arguments – Iterative convergence across clusters of processors 198 Heterogeneous Computing Defensive programming required to address these issues. Details of ScaLAPACK experiences can be found in: – http://www.netlib.org/lapack/lawns/lawn112.ps http://www.netlib.org/scalapack 199 Homogeneous Versus Heterogeneous Issue #1: Hardware What is a homogeneous machine? A homogeneous machine is one which satisfies (1), and guarantees that the result of a particular sequence of floating point operations is the same no matter which processor is used. Is this sufficient? http://www.netlib.org/scalapack 200 Homogeneous Versus Heterogeneous Issue #2: Communication layer What is a homogeneous network? A homogeneous network is a collection of machines which satisfy (1) and (2). Floating point numbers are communicated exactly between homogeneous machines. Is this sufficient? http://www.netlib.org/scalapack 201 Homogeneous Versus Heterogeneous Issue #3: Software (operating system, compiler, ...) What is a homogeneous computing environment? A homogeneous computing environment is a network which satisfies (1), (2), and (3). The operating system, compiler, and compiler options do not alter the representation of floating point values. If any of these requirements is not met, the system is heterogeneous. http://www.netlib.org/scalapack 202 Heterogeneous Computing Issues Issues to be considered: – Communication of floating point values between processors – Machine precision and other machine parameters – Checking global floating-point arguments – Algorithmic Integrity – Deadlock http://www.netlib.org/scalapack 203 Communicating on IEEE Machines The IEEE standard specifies how machines should represent floating point numbers. XDR (External Data Representation) uses the IEEE representation. PVM uses XDR and thus will communicate numbers without change on IEEE machines (Denormalized numbers?) MPI suggests, but does not mandate, the use of XDR. http://www.netlib.org/scalapack 204 Machine Precision Smallest floating point number u such that 1 + u > 1. Used in: – Convergence tests -- has an iteration converged? – Tolerance tests -- is a matrix numerically singular? – Error analysis -- how does floating point arithmetic affect results? http://www.netlib.org/scalapack 205 Heterogeneous Machine Precision Return the maximum relative machine precision over all the processors LAPACK --> DLAMCH ScaLAPACK --> PDLAMCH Note that even on IEEE machines, the machine precision can vary slightly. For double precision: e = 2**-53 + q where, q=2**-105 or q=2**-64 + 2**-105. http://www.netlib.org/scalapack 206 Other Machine Parameters A number of other parameters are used to characterize a machine. E.g., – underflow threshold – overflow threshold – smallest number that can be safely reciprocated – . . . For small numbers -- use the largest over all the processors. For large numbers -- use the smallest over all the processors. http://www.netlib.org/scalapack 207 Heterogeneous Networks -- Arithmetic Issues Even on IEEE machines results may differ between machines, compilers and compiler switches – Global variables may have different values – Eigenproblem for a tridiagonal matrix -- run QR on each processor and each processor finds k eigenvectors. But each processor may compute a different QR sequence – Stopping criteria for iterative methods -- may not be satisfied on each processor simultaneously http://www.netlib.org/scalapack 208 Algorithmic Integrity -- Examples Convergence of iterative methods Scaling vectors and matrices Bisection and inverse iteration for a symmetric tridiagonal matrix Solving the eigenproblem for a symmetric tridiagonal matrix by the QR algorithm http://www.netlib.org/scalapack 209 QR Algorithm for a Tridiagonal Matrix Computing the eigenvalues is O(n**2) arithmetic, but computing the eigenvectors is O(n**3) Run the QR algorithm on each processor Save the plane rotations on each processor Let each processor compute k of the eigenvectors QR is backward stable, but not forward stable, so sequence of plane rotations may be different Have one processor run the QR algorithm and broadcast the plane rotations http://www.netlib.org/scalapack 210 Heterogeneous Conclusions Defensive programming Machine parameters Communication and representation between processors Controlling processor Additional communication Testing strategies http://www.netlib.org/scalapack 211 HPF Interface to ScaLAPACK http://www.netlib.org/scalapack 212 HPF Version Interface provided for a subset of routines Linear solvers (general, pos. def., least squares) Eigensolver (pos. def.) matrix multiply, & triangular solve 213 HPF Version Given these declarations: REAL, DIMENSION(1000,1000) :: A, B, C !HPF$ DISTRIBUTE (CYCLIC(64), CYCLIC(64)) :: A, B, C Calls could be made as simple as: CALL LA_GESV(A, B) CALL LA_POSV(A, B) CALL LA_GELS(A, B) CALL LA_SYEV(A, W, Z) CALL LA_GEMM(A, B, C) CALL LA_TRSM(A,B) Fortran 90 version follows these ideas 214 HPF Interface -- Note No presently available compiler supports the full features of HPF. Determining the minimal features required to effectively support an HPF library which calls a process local message passing library http://www.netlib.org/scalapack 215 HPF -- Redistribution Expensive in both Space (memory) and Time, Users often utilize all available memory, ScaLAPACK performance is not very sensitive for a large subset of all supported cyclic mappings. http://www.netlib.org/scalapack 216 HPF -- Redistribution Use !HPF$ INHERIT for most general purpose libraries: – Data will be left in place across subroutines calls, – Whenever possible data passed to ScaLAPACK without redistribution, – Otherwise, redistribute to an optimal distribution. http://www.netlib.org/scalapack 217 Determining Distribution 1. HPF_DISTRIBUTE from HPF_LIBRARY returns: – Distribution type (CYCLIC,BLOCK,…) – Row and column blocking factors (MB,NB), – Size of the process grid, of the ultimate align target. Use HPF_ALIGNMENT from HPF_LIBRARY to see how this information affects our matrix operand. http://www.netlib.org/scalapack 218 Determining Distribution 2. Find on what processor particular blocks of the matrix reside: – GLOBAL_TO_LOCAL from HPF_LOCAL_LIBRARY, or – HPF$ ALIGN and a mapping array. If redistribution is necessary, use !HPF$ PROCESSORS PROC(P,Q) !HPF$ DISTRIBUTE(CYCLIC(MB),CYCLIC(NB)) ONTO PROC http://www.netlib.org/scalapack 219 Calling Fortran77 From HPF IfFortran77 is called directly from HPF, one cannot determine the leading dimension of the local arrays, --> Intermediate layer (HPF --> F77) LAYER 1: User callable wrapper functions, and their HPF tools (written in strict HPF): http://www.netlib.org/scalapack 220 Calling Fortran77 From HPF Determine if input data distribution is acceptable: – All matrix operands can be expressed as some variant of CYCLIC(K), – All matrix operands are distributed across the same process grid, – Process grid is one or two dimensional, – Certain routine-dependent alignment restrictions are met. Re-map to an acceptable redistribution. http://www.netlib.org/scalapack 221 Calling Fortran77 From HPF LAYER 2: Transition layer – Extrinsic with local view of the matrix » HPF_LOCAL (layer 3 F77_LOCAL or F90_LOCAL) » F90_LOCAL (layer 3 F77_LOCAL or F90_LOCAL) – Translate HPF’s assumed shape array to Fortran77’s assumed size array, – Determine leading dimension of assumed size array so Fortran77 routines can do index computation. LAYER 3: ScaLAPACK message passing layer – Extrinsic F77_LOCAL or F90_LOCAL http://www.netlib.org/scalapack 222 HPF Interface -- Summary !HPF$ INHERIT to avoid unneeded redistribution across subroutine calls, HPF_DISTRIBUTION from HPF_LIBRARY to find out the distribution of the matrix’s ultimate align target, HPF_ALIGNMENT from HPF_LIBRARY to determine how the distribution of the ultimate align target affects the actual matrix distribution, http://www.netlib.org/scalapack 223 HPF Interface -- Summary !HPF$ PROCESSORS PROC(P,Q), where P and Q are dummy arguments to a routine. When one has determined redistribution must occur, this command ensures that all matrices passed to the routine are distributed on the same process grid. !HPF$ DISTRIBUTE A(CYCLIC(MB),CYCLIC(NB)) ONTO PROC, with MB and NB parameters to the routine and PROC from above, – to redistribute data according to the two- dimensional block cyclic scheme, http://www.netlib.org/scalapack 224 HPF Interface -- Summary GLOBAL_TO_LOCAL from HPF_LOCAL_LIBRARY, or !HPF$ ALIGN M(I,J) WITH A(1+(I-1)*MB, 1+(J-1)*NB), with MB and NB parameters to the routine, – to label the processes allowing for system independent process grid setup, EXTRINSIC(F90_LOCAL) or, EXTRINSIC(HPF_LOCAL) + EXTRINSIC(F77_LOCAL) – to call Fortran77 from HPF. http://www.netlib.org/scalapack 225 HPF Performance for LU on 12-node Cluster Sun Ultra using PGI Compiler 10 0 0 Time in Seconds 800 HPF Call 600 ScaLAPACK 400 Wrapper 20 0 0 0 0 0 0 0 0 0 0 0 0 10 0 3 5 8 Mat rix Order 226 Future Directions http://www.netlib.org/scalapack 227 ScaLAPACK -- Ongoing Work Increased flexibility and usability – Incorporate PBLAS (version 2.0 ALPHA) into library – Algorithmic blocking and removal of alignment restrictions at top-level routines Increased functionality – Divide-and-Conquer SEP and SVD C++ and Java interface to ScaLAPACK Iterative Solvers http://www.netlib.org/scalapack 228 Conclusions http://www.netlib.org/scalapack 229 ScaLAPACK Summary Follow-on to LAPACK -- Designed for MPP’s Goals attained through the promotion and development of standards (BLAS, PBLAS) Object-based design provides a framework for additional data distributions and matrix types Software available today http://www.netlib.org/scalapack 230 ScaLAPACK Summary Portabilityacross wide range of architectures – will work on a heterogeneous platform Alreadyin use by SGI/Cray, IBM, HP- Convex, Fujitsu, NEC, NAG, & Visual Numerics (IMSL) – Tailor performance & provide support Software, examples, prebuilts available on netlib ScaLAPACK Team Susan Blackford, UT Sven Hammarling, NAG Jaeyoung Choi, Soongsil Greg Henry, Intel University Osni Marques, LBNL/NERSC Andy Cleary, LLNL Caroline Papadopoulos, Ed D'Azevedo, ORNL UCSD Jim Demmel, UC-B Antoine Petitet, UT Inder Dhillon, IBM ( UC-B) Ken Stanley, UC-B Jack Dongarra, UT/ORNL Francoise Tisseur, UMan Ray Fellers, UC-B David Walker, Cardiff U Clint Whaley, UT http://www.netlib.org/scalapack 232 scalapack@cs.utk.edu ScaLAPACK -- References L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R. C. Whaley, ScaLAPACK Users’ Guide, Second Edition, SIAM, Philadelphia, PA, 1997. ScaLAPACK software and documentation can be obtained via WWW and anonymous ftp: – http://www.netlib.org/scalapack, – http://www.netlib.org/lapack/lawns, – (anonymous) ftp ftp.netlib.org Comments/suggestions to scalapack@cs.utk.edu http://www.netlib.org/scalapack 233

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 24 |

posted: | 10/26/2012 |

language: | Unknown |

pages: | 233 |

OTHER DOCS BY xiaopangnv

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.