tutorial

Document Sample
tutorial Powered By Docstoc
					ScaLAPACK Tutorial
            Susan Blackford
        University of Tennessee
    http://www.netlib.org/scalapack/

                                       1
Outline
   Introduction
   ScaLAPACK         Project Overview
   Basic Linear Algebra Subprograms
    (BLAS)
   Linear Algebra PACKage (LAPACK)
   Basic Linear Algebra Communication
    Subprograms (BLACS)
   Parallel BLAS (PBLAS)
   Design of ScaLAPACK
              http://www.netlib.org/scalapack   2
Outline continued
   Contents  of ScaLAPACK
   Applications
   Performance
   Example Programs
   Issues of Heterogeneous Computing
   HPF Interface to ScaLAPACK



               http://www.netlib.org/scalapack   3
Outline continued
   FutureDirections
   Conclusions
   Hands-On Exercises




             http://www.netlib.org/scalapack   4
Introduction




           http://www.netlib.org/scalapack   5
High-Performance Computing
Today
   In the past decade, the world has
    experienced one of the most exciting
    periods in computer development.
   Microprocessors have become smaller,
    denser, and more powerful.
   The result is that microprocessor-based
    supercomputing is rapidly becoming the
    technology of preference in attacking
    some of the most important problems of
               http://www.netlib.org/scalapack 6
    science and engineering.
           Growth of Microprocessor
           Performance
Performance in Mflop/s

                    10000                                                                          Cray T90
                                                                               Cray C90
                     1000                    Cray 2

                                          Cray X-MP
                                                               Cray Y-MP

                                                                                             RS6000/590
                                                                                                          Alpha

                                                                                     Alpha

                      100       Cray 1S
                                                                             RS6000/540

                                                                      i860
                                                                                                                  Supers
                       10                                         R2000



                        1                                                                                         Micros
                                                          80387
                      0.1                  80287
                                                   6881

                            8087

                     0.01

                                                                                                                      7
The Maturation of Highly
Parallel Technology
 Affordable parallel systems now out-perform the
  best conventional supercomputers.
 Performance per dollar is particularly favorable.
 The field is thinning to a few very capable
  systems.
 Reliability is greatly improved.
 Third-party scientific and engineering
  applications are appearing.
 Business applications are appearing.
 Commercial customers, not just research labs,
                    http://www.netlib.org/scalapack   8

  are acquiring systems.
Architecture Alternatives
   Distributed, private memories using
    message passing to communicate
    among processors
   Single address space implemented with
    physically distributed memories.
   Both approaches require:
    – Attention to locality of access.
    – Scalable interconnection technology.
               http://www.netlib.org/scalapack   9
Directions
   Move   toward shared memory
    – SMPs and Distributed Shared Memory
    – Shared address space w/deep memory
      hierarchy
   Clustering  of shared memory machines
    for scalability
   Efficiency of message passing and data
    parallel programming
    – Helped by standards efforts such as MPI
              http://www.netlib.org/scalapack   10
      and HPF (PVM & OpenMP)
 Challenges in Developing
 Distributed Memory Libraries
 How   to integrate software?                   Where is the data
  – Until recently no standards                    – Who owns it?
  – Many parallel languages                        – Opt data distribution
  – Various parallel                             Who determines data
    programming models                            layout
  – Assumptions about the                          – Determined by user?
    parallel environment
                                                   – Determined by library
     » granularity
                                                     developer?
     » topology
     » overlapping of                              – Allow dynamic data dist.
       communication/computation                   – Load balancing
     » development tools http://www.netlib.org/scalapack                     11
ScaLAPACK Project Overview




         http://www.netlib.org/scalapack   12
http://www.netlib.org/scalapack   13
ScaLAPACK Team
   Susan Blackford, UT                  Sven Hammarling, NAG (UT)
   Jaeyoung Choi, Soongsil              Mike Heath, UIUC
    University, Korea (UT)               Greg Henry, Intel
   Andy Cleary, LLNL (UT)               Antoine Petitet, UT
   Tony Chan, UCLA                      Padma Raghavan, UT
   Ed D'Azevedo, ORNL                   Dan Sorensen, Rice U
   Jim Demmel, UC-B                     Ken Stanley, UC-B
   Inder Dhillon, IBM (UC-B)            David Walker, Cardiff (ORNL)
   Jack Dongarra, UT/ORNL               Clint Whaley, UT
   Victor Eijkhout, UT                  Plus others not funded by
    http://www.netlib.org/scalapack
                                          DARPA                     14
Scalable Parallel Library for
Numerical Linear Algebra
 ScaLAPACK            Software Library for Dense, Banded,
  Sparse
   – University of Tennessee
   – Oak Ridge National Laboratory
   – University of California, Berkeley
 P_ARPACK          Large Sparse Non-Symmetric
  Eigenvalues
   – Rice University
 CAPSS Direct Sparse              Systems Solvers
  – University of Illinois, U/UC
  – University of Tennessee
 ParPreParallel Preconditioners for Iterative
                  http://www.netlib.org/scalapack        15

  Methods
NLA - Software Development
 BLAS,LINPACK, EISPACK
 LAPACK
  – From hand-held calculators to most powerful
    vector supercomputers
  – RISC, Vector, and SMP target
  – In use by CRAY, IBM, SGI, HP-Convex, SUN,
    Fujitsu, Hitachi, NEC, NAG, IMSL, KAI
  – Still under development
  – High accuracy routines, parameterizing for
    memory hierarchies, annotated libraries
                  http://www.netlib.org/scalapack 16
NLA - ScaLAPACK
   Follow on to LAPACK
   Designed for distributed parallel computing (MPP &
    Clusters)
   First math software package to do this
   Numerical software that will work on a heterogeneous
    platform
   Latent tolerant algorithms on systems with deep memory
    hierarchy
   “Out of Core” and fault-tolerant implementations
   In use by Cray, IBM, HP-Convex, Fujitsu, NEC, NAG,
    IMSL
    – Tailor performance & provide support
                       http://www.netlib.org/scalapack   17
Goals - Port LAPACK to Distributed-
Memory Environments.
    Efficiency
      – Optimized compute and communication engines
      – Block-partitioned algorithms (Level 3 BLAS) utilize memory hierarchy and yield good
        node performance

    Reliability
      – Whenever possible, use LAPACK algorithms and error bounds.
    Scalability
      – As the problem size and number of processors grow
      – Replace LAPACK algorithm that did not scale; New ones into
        LAPACK
    Portability
      – Isolate machine dependencies to BLAS and the BLACS
    Flexibility
      – Modularity: Build rich set of linear algebra tools: BLAS, BLACS,
                         http://www.netlib.org/scalapack               18
        PBLAS
ScaLAPACK Team
 Susan  Blackford, UT  Sven Hammarling, NAG
 Jaeyoung Choi,           Greg Henry, Intel
  Soongsil University      Osni Marques,
 Andy Cleary, LLNL         LBNL/NERSC
 Ed D'Azevedo, ORNL  Caroline Papadopoulos,
 Jim Demmel, UC-B          UCSD
 Inder Dhillon, IBM (UC-  Antoine Petitet, UT
  B)                       Ken Stanley, UC-B
 Jack Dongarra,           Francoise Tisseur, U Man
  UT/ORNL                  David Walker, Cardiff U
                                                    19
 Ray Fellers, UC-B        Clint Whaley, UT
Programming Style
 SPMD   Fortran 77 using an object based
  design
 Built on various modules
  – PBLAS Interprocessor communication
  – BLACS
    » PVM, MPI, IBM SP, CRI T3, Intel, TMC
    » Provides right level of notation.
  – BLAS
 LAPACK   software expertise/quality
  – software approach                        20
Overall Structure of Software
    Each global data object is assigned an
     array descriptor.
    The array descriptor
      – Contains information required to establish
          mapping between a global array entry and its
          corresponding process and memory location.
        – Is differentiated by the DTYPE_ (first entry) in the
          descriptor.
        – Provides a flexible framework to easily specify
          additional data distributions or matrix types.
      Using concept of context
                  http://www.netlib.org/scalapack            21
PBLAS
  Similar to the BLAS in functionality and
   naming.
  Built on the BLAS and BLACS
  Provide global view of matrix
   CALL DGEXXX ( M, N, A( IA, JA ), LDA,... )

   CALL PDGEXXX( M, N, A, IA, JA, DESCA,...
    )

             http://www.netlib.org/scalapack   22
ScaLAPACK Structure
             ScaLAPACK

                                                      PBLAS
Global
Local
         LAPACK                                      BLACS



            BLAS
                                                 PVM/MPI/...

                   http://www.netlib.org/scalapack             23
Parallelism in ScaLAPACK
    Level 3 BLAS block                            Task parallelism
     operations                                      – Sign function eigenvalue
     – All the reduction routines                      computations

    Pipelining                                    Divide and Conquer
     – QR Algorithm, Triangular                      – Tridiagonal and band
       Solvers, classic                                solvers, symmetric
       factorizations                                  eigenvalue problem and
                                                       Sign function
    Redundant
                                                   Cyclic reduction
     computations
                                                     – Reduced system in the
     – Condition estimators
                                                            band solver
     – QR based eigensolvers
    Static work assignment
     – Bisection          http://www.netlib.org/scalapack                         24
 Heterogeneous Computing
 Softwareintended to be used in this context
 Machine precision and other machine specific
  parameters
 Communication of ft. pt. numbers between
  processors
 Repeatability -
  – run to run differences, e.g. order in which quantities
    summed
 Coherency    -
  – within run differences, e.g. arithmetic varies from proc to
    proc              http://www.netlib.org/scalapack         25
Prototype Codes
  http://www.netlib.org/scalapack/prototype/
     – PBLAS (version 2.0 ALPHA)
     – Packed Storage routines for LLT, SEP, GSEP
     – Out-of-Core Linear Solvers for LU, LLT, and QR
     – Matrix Sign Function for Eigenproblems
     – SuperLU and SuperLU_MT
     – HPF Interface to ScaLAPACK
   Distributed    memory SuperLU coming
    soon!
                 http://www.netlib.org/scalapack        26
Out of Core Software
Approach
 High-level  I/O Interface
 ScaLAPACK uses a
  `Right-looking’ variant for
  LU, QR and Cholesky
  factorizations.
 A `Left-looking’ variant is
  used for Out-of-core
  factorization to reduce I/O
  traffic.
                 http://www.netlib.org/scalapack   27
 Requires two in-core
Out-of-Core Performance
                     QR Factorization on 64
                    processors Intel Paragon
             8000

             7000

             6000          Out-of-core             In-core
             5000
  Time (s)




             4000

             3000

             2000

             1000

               0
                    5000     8000        10000        14000      16000   22400
                               http://www.netlib.org/scalapack
                                       Square problem size                       28
HPF Version
  Interface   provided for a subset of
   routines
  Linear solvers (general, pos. def., least
   squares)
  Eigensolver (pos. def.)
  matrix multiply, & triangular solve




                                               29
HPF Version
    Given these declarations:
      REAL, DIMENSION(1000,1000) :: A, B, C
      !HPF$ DISTRIBUTE (CYCLIC(64), CYCLIC(64)) :: A, B, C

    Calls could be made as simple as:
       CALL LA_GESV(A, B)
      CALL LA_POSV(A, B)
      CALL LA_GELS(A, B)
      CALL LA_SYEV(A, W, Z)
      CALL LA_GEMM(A, B, C)
      CALL LA_TRSM(A,B)

    Fortran 90 version follows these ideas
                                                             30
ScaLAPACK - Ongoing Work
 Increasing   flexibility and usability
  – Algorithm redistribution in PBLAS
  – Removal of alignment restrictions
  – Algorithmic blocking
 Increased    functionality
  – Divide and conquer for SEP and SVD
      Grail” for Symmetric Eigenvalue
 “Holy
 Problem
 Sparse Direct Solver                     31
Direct Sparse Solvers
   CAPSS   is a package to solve Ax=b on
    a message passing multiprocessor; the
    matrix A is SPD and associated with a
    mesh in 2 or 3D. (Version for Intel &
    MPI)
   MFACT -- multifrontal sparse solver
   http://www.cs.utk.edu/~padma/mfact.html
   SuperLU  - sequential and parallel
   implementations of Gaussian
                with partial pivoting.
   eliminationhttp://www.netlib.org/scalapack   32
Sparse Gaussian Elimination
   Super LU_MT, supernodal SMP version
   Designed to exploit memory hierarchy
    – Serial Supernode-panel organization
      permits "BLAS 2.5" performance
    – Up to 40% of machine peak on large
      sparse matrices on IBM RS6000/590,
      MIPS R8000, 25% on Alpha 21164



              http://www.netlib.org/scalapack   33
Super LU
   Several data structures redefined,
   redesigned for parallel access Barrier-less,
    nondeterministic implementation
   Task queue for load balance (over 95% on
    most large problems)
   For n=16614 matrix from 3D flow calculation,
    66 nonzeros/row
Machine    Nprocs         Speedup Gflop/s               % Peak
Alpha 8400     8                   7             0.78      17
Cray J90       8                 6.5             0.83      25
Cray C90                         6.6
               8 http://www.netlib.org/scalapack 2.58      25    34
Parallel Sparse Eigenvalue
Solvers
P_ARPACK        (D. Sorensen et al)
 – Designed to compute a few values and
   corresponding vectors of a general
   matrix.
 – Appropriate for large sparse or
   structured matrices A
 – This software is based Arnoldi process
   called the Implicitly Restarted Arnoldi
   Method.
                http://www.netlib.org/scalapack   35
 – Reverse Communication Interface.
ParPre
   Library of parallel preconditioners for
   iterative solution methods for linear
   systems of equations




              http://www.netlib.org/scalapack   36
Netlib downloads for
ScaLAPACK material
 ScaLAPACK   --           4230
 ScaLAPACK HPF version -- 228
 SuperLU --                    78
 ARPACK --                   1129
 PARPACK --                  292
 CAPSS --                    479
 PARPRE --                    102
 manpages --               3292
   This is a count of the downloads for the archive,tar,
                         http://www.netlib.org/scalapack    37
    or prebuilt libraries.
Related Projects




           http://www.netlib.org/scalapack   38
Java
   Java likely to be a dominant language.
   Provides for machine independent
    code.
   C++ like language
   No pointers, goto’s, overloading arith
    ops, or memory deallocation
   Portability achieved via abstract
    machine
   Java is a convenient user interface    39
JAVA
              the suitability of Java for
  Investigating
  Math Software libraries.
   – Http://math.nist.gov/javanumerics/
  JAMA   (Java Matrix Package)
   – BLAS, LU, QR, eigenvalue routines, SVD
  LAPACK     to Java translator
   – http://www.cs.utk.edu/f2j/download.html
  Involved   in the community discussion
                                               40
LAPACK to JAVA
   Allows Java programmers to access to
    BLAS/LAPACK routines.
   Translator to go from LAPACK to Java Byte
    Code
      – f2j: formal compiler of a subset of f77 sufficient for
        BLAS & LAPACK
     Plan to enable all of LAPACK
      – Compiler provides quick, reliable translation.
   Focus    on LAPACK Fortran
      – Simple - no COMMON, EQUIVALANCE, SAVE,
        I/O                                                  41
Parameterized Libraries
 Architecture   features for performance
  – Cache size
  – Latency
  – Bandwidth
 Latency   tolerant algorithms
  – Latency reduction
  – Granularity management
 High levels of concurrency
 Issues of precision
                                            42
 Grid aware
     Motivation for Network
     Enabled Solvers
 Design an easy-to-use tool to provide efficient and uniform
access to a variety of scientific packages on UNIX platforms
 Client-Server
  Design           Computational Resources
                  Reply                Choice
 Non-hierarchical
  system
 Load Balancing     Clien           Agen
                       t               t
 Fault Tolerance
 Heterogeneous            Request
  Environment                             43
NetSolve -- References
   Softwareand documentation can be
   obtained via the WWW or anonymous
   ftp:
    – http://www.cs.utk.edu/netsolve/
                     to
   Comments/questions
   netsolve@cs.utk.edu.



                                        44
 ATLAS
atlas@cs.utk.edu




                   45
What is
ATLAS
A   package that           Currently provided
  adapts to differing         – Level 3 BLAS
  architectures via
  code generation +              » Generated GEMM
  timing                              1-2 hours install
   – Initially, supply                 time per precision
     BLAS
                                 » Recursive GEMM-
 Package contains:
                                   based L3 BLAS
   – Code generators
                                      Antoine Petitet
   – Sophisticated
     timers                 Next Release:
   – Robust search            – Real matvec
     routines                 – Various L1
                                                       46
                              ? Threading
Why ATLAS is
needed
   BLAS require many man-
   hours/platform
    – Only done if financial incentive is there
       » Many platforms will never have an optimal
         version
    – Lags behind hardware
    – May not be affordable by everyone
    – Improves Vendor code
   Operationsmay be important, but not
   general enough for standard                       47
Algorithmic approach for Lvl 3
     Only   generated code is on-chip
      multiply
     All BLAS operations written in terms
      of generated on-chip multiply
     All transpose cases coerced through
      data copy to 1 case of on-chip multiply
          N             K
                                       N

M
      – Only 1 case generated per platform
                         A
     NB   C         M          *       B        K



                                                48
Code generation
strategy
   Code  is              On-chip multiply
   iteratively             optimizes for:
   generated &              – TLB access
   timed until              – L1 cache reuse
   optimal case is
   found. We try:           – FP unit usage
    – Differing NBs         – Memory fetch
    – Breaking false        – Register reuse
      dependencies          – Loop overhead
    – M, N and K              minimization
                                               49
      loop unrolling
            DC                                                 MFLOPS
                 G
         DE          LX
           C          21
                Al        16
                  ph




                                                   100.0
                                                           200.0
                                                                   300.0
                                                                           400.0
                                                                                   500.0
                                                                                           600.0
                                                                                                                       700.0




                                             0.0
                  a           4a
            HP        21          -5
               PA         16 33
                              4a
                   80              -
                        00 433
               HP            18
                  90             0M
                       00              hz
       IB      IB           /7
          M       M            35
            Po        Po           /1
               we          we 2 5
                  rP           r2
                       C           -1
              Pe           60         35
                 nt           4e
                   iu              -3
                       m              32
                           M
               Pe
                  nt          M
                      iu         X-
                         m           15
                                         0
                                                                                              Vendor Matrix Multiply




                  Pe Pro
                       nt           -2
                          iu          00
                             m
                                 II
                       SG -26
                             IR 6
                                  4
                                                                                                                               Various Architectures




                       SG 600
                                                                                              ATLAS Matrix Multiply




                 SG I R
                     I R 500
     Su                                 0
       n       SG 80
                               0
         M
                                                                                                                               500x500 DGEMM Across




           icr I R1 0ip2
              os          00            1
                                                                                              F77 BLAS




                 pa           00
                    rc             ip2
                        II             7
                           M
         Su      Su           od
            n        n             el
              Ul        Da             70
                tra          rw
                    2           in
                       M            -2
                                       70
                           od
                              el
                                  22
                                       00
50
                                                                       MFLOPS
         D
             CG
                   LX
     D                    21
         EC                     16




                                                           100
                                                                 200
                                                                         300
                                                                                400
                                                                                      500
                                                                                                               600




                                                       0
              Al                     4a
                   ph                   -   53
                      a                        3
                          21
                             1      64
                                         a-
                                           43
                          H                      3
                                P
                                    PA
                 IB                    8    00
                    M                            0
     IB                   Po
        M                   w
             Po                      er
               w                       2-
                                         13
                    er
                      PC                         5
                                 60
                                       4e
                 Pe                      -3
                        nt                    32
                           iu
                                m
                                    Pr
                                      o-
                                                                                            LU w/Vendor BLAS




                                        20
                        Pe                       0
                             nt
                                iu
                                     m
                                         II-
                                            26
                                                   6
                             SG
                               IR
                 SG                       50
                   IR                        0   0
                                10
                                   0   00
                                                                                            LU w/ATLAS BLAS




                   Su                       ip
                         n                    27
      Su                     Da
            n                   r    wi
                Ul                      n   -2
                   tr
                                                                                                                     Right-Looking LU fact. (OLD)




                        a2                     7   0
                             M
                              od
                                     el
                                                                                                                     500 x 500 Double Precision RB




                                          22
                                               00
51
Extensions to present
work

  Level   2 BLAS    Threading
    GEMV for        Specializatio
     next release     n for user
      1 BLAS
  Level              applications
  GEMM tweaks
    – Better small
      case perf.
    – prefetch                        52
Sparse Linear Algebra
Operations
   Reuse   of present    Runtime
    work                   adaptation
     – GEMV                 – Sparsity
   Compile time              analysis
    adaptations             – Iterative code
     – Dense opts             improvement
     – Caching            Adaptive libraries
       parameters         Many open-
     – TLB info            ended questions

                                                53
Java
ATLAS
    Java wrappers around C native
     methods
     – supply full performance for Java
      programs
     Under development now
  Java   reference implementation
     – Provides BLAS interface for applets
     Under development now
  Need   Java code generator for pure-      54
     Java BLAS, usable by applets
Related Projects
   FFTW
    www.theory.lcs.mid.edu/~fftw
    – Fastest FFT in the West
    – Automatically tuned FFTs in ANSI C
   PHiPAC
    www.icsi.berkeley.edu/~bilmes/phipac
    – Portable High Performance ANSI C
    – Automatically tuned BLAS
                                            55
     The ATLAS
     team
 Jack  Dongarra
 Clint Whaley
 Jeff Horner
 Antoine Petitet




                    56
Basic Linear Algebra
Subprograms (BLAS)




          http://www.netlib.org/scalapack   57
BLAS -- Introduction
   Clarity:   code is shorter and easier to
    read,
   Modularity: gives programmer larger
    building blocks,
   Performance: manufacturers will
    provide tuned machine-specific BLAS,
   Program portability: machine
    dependencies are confined to the BLAS
                 http://www.netlib.org/scalapack   58
Memory Hierarchy
                                   Key to high
      Register                      performance in
         s                          effective use of
        L1
       Cache                        memory hierarchy
        L2
      Cache                        True on all
      Local                         architectures
     Memory
      Remote
      Memory
     Secondary
      Memory

             http://www.netlib.org/scalapack           59
Level 1, 2 and 3 BLAS
 Level 1 BLAS
  Vector-Vector                                +
                                                       *
  operations
 Level 2 BLAS
                                                   *
  Matrix-Vector
  operations
 Level 3 BLAS                                 +           *
  Matrix-Matrix
  operations http://www.netlib.org/scalapack                   60
Why Higher Level BLAS?
   Can  only do arithmetic on data at the
    top of the hierarchy
   Higher level BLAS lets us do this
    BLAS      Mem  ory Flops                     Flops/
              Ref s                              Mem  ory
                                                 Ref s
    Level 1   3n               2n                2/ 3        Registers

    y=y+x                                                      L1
                                                              Cache
    Level 2   n2
                               2n2
                                                 2             L2
                                                             Cache
    y=y+Ax                                                   Local
                                                            Memory
                2                3
    Level 3   4n               2n                 /
                                                 n 2         Remote
                                                             Memory
    C=C+AB         http://www.netlib.org/scalapack          Secondary    61
                                                             Memory
BLAS for Performance
                        IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak)


             250
                                                                       Level 3 BLAS
             200

             150
   Mflop/s




             100
                                                                           Level 2 BLAS

              50                                                       Level 1 BLAS
              0
                   10     100       200         300         400      500
                                       Order of vector/Matrices


   Development       of blocked algorithms
   important for performance
              http://www.netlib.org/scalapack                                        62
BLAS -- References
   BLAS software and documentation can
   be obtained via:
    – WWW: http://www.netlib.org/blas,
    – (anonymous) ftp ftp.netlib.org: cd blas; get index
    – email netlib@www.netlib.org with the message: send index
      from blas

   Comments  and questions can be
   addressed to: lapack@cs.utk.edu


                  http://www.netlib.org/scalapack            63
BLAS Papers
     C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear
      Algebra Subprograms for Fortran Usage, ACM Transactions on
      Mathematical Software, 5:308--325, 1979.
     J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, An
      Extended Set of Fortran Basic Linear Algebra Subprograms,
      ACM Transactions on Mathematical Software, 14(1):1--32,
      1988.
     J. Dongarra, J. Du Croz, I. Duff, S. Hammarling, A Set of Level
      3 Basic Linear Algebra Subprograms, ACM Transactions on
      Mathematical Software, 16(1):1--17, 1990.




                      http://www.netlib.org/scalapack               64
BLAS Technical Forum
http://www.netlib.org/blas/blast-forum/blast-forum.html

       Established a Forum to consider expanding the
        BLAS in light of modern software, language, and
        hardware developments.
       Minutes available from each meeting
       Working proposals for the following:
        –   Dense/Band BLAS
        –   Sparse BLAS
        –   Extended Precision BLAS
        –   Interval BLAS
        –   C interfaces to Legacy BLAS
       Final document in preparation
                        http://www.netlib.org/scalapack   65
Aims
   Expanding   the BLAS in a number of
    ways in light of modern software,
    language, and hardware developments.
   Enable linear algebra libraries (both
    public domain and commercial) to
    interoperate efficiently and easily.
   Attendees are application
    programmers, library developers, and
    vendors
             http://www.netlib.org/scalapack   66
Goals
   Stimulate thought
   Discussion
   Functionality and
   Future development of a set of
    standards for basic matrix data
    structures,
   Both dense and sparse,
   Calling sequences for a set of low-level
    computational kernels for the parallel 67
               http://www.netlib.org/scalapack
Language Bindings
   Model  implementation and test suite for
   all routines
    – Fortran 95
    – Fortran 77
    –C
   Environmental  Enquiry routine
   Future forum for C++, Java...


              http://www.netlib.org/scalapack   68
Dense and Banded BLAS
   Extensions
    – LAPACK auxiliary routines
    – “Missing” BLAS routines
    – Combined operations -- DOT_AXPY,
      GEMVT, TRMT, GEMVER
    – Multiple Instances -- generate and apply
      simultaneous givens transformations



               http://www.netlib.org/scalapack   69
Sparse BLAS
   Small   subset of BLAS functionality
    – matrix multiply and triangular solve
    – not include operations when both
      operands are sparse
   Many  storage schemes for sparse
    matrices
   Generic interface
    – handle-based interface for matrices
    – allows of implementation independent of
               http://www.netlib.org/scalapack   70
      storage scheme
Extended and Mixed-Precision
BLAS
   Extended    precision used internally,
    input and output arguments are the
    same
   Mixed precision refers to having some
    input/output args in single and double
    prec
   Simpler, faster, more accurate algs
   Different machines support extended
    prec in different ways
                http://www.netlib.org/scalapack   71
Interval BLAS
   Data  are represented by machine
    representable intervals to avoid
    truncation and round-off error in fl. Pt.
    Arithmetic
   Mathematical operations involving
    intervals, interval vectors, and dense,
    banded, and tridiagonal interval
    matrices
   Subset of functionality in Dense and
                http://www.netlib.org/scalapack   72
    Band BLAS chapter
Journal of Development
   Environmental  routines
   Distributed Memory Parallel BLAS
   Lite/Thin BLAS interfaces
   Fortran 95 Thin BLAS Interface




             http://www.netlib.org/scalapack   73
Interfaces to the Legacy BLAS
  C  interface to the Legacy BLAS
   Fortran 95 interface to the Legacy
    BLAS




              http://www.netlib.org/scalapack   74
Conclusions
   Propose   new standards for many
    flavors of the BLAS in light of software,
    language, and hardware developments
   Adopted by vendors, ISVs
   Users will be able to have more efficient
    applications with machine-
    dependencies isolated to the BLAS


              http://www.netlib.org/scalapack   75
Linear Algebra PACKage
(LAPACK)




          http://www.netlib.org/scalapack   76
EISPACK and LINPACK
  EISPACK
  – Design for the algebraic eigenvalue
    problem, Ax = x and Ax = Bx.
  – work of J. Wilkinson and colleagues in the
    70’s.
  – Fortran 77 software based on translation of
    ALGOL.
  LINPACK
  – Design for the solving systems of
    equations, Ax = b.
               http://www.netlib.org/scalapack    77
  – Fortran 77 software using the Level 1
History of Block-Partitioned
Algorithms
   Early algorithms involved use of small
    main memory using tapes as secondary
    storage.

   Recent  work centers on use of vector
    registers, level 1 and 2 cache, main
    memory, and “out of core” memory.


              http://www.netlib.org/scalapack   78
Block-Partitioned Algorithms
 LU Factorization                   Orthogonal reduction
 Cholesky factorization              to:
 Symmetric indefinite
                                       – (upper) Hessenberg
                                         form
  factorization
                                       – symmetric tridiagonal
 Matrix inversion                       form
 QR, QL, RQ, LQ                       – bidiagonal form
  factorizations                     Block QR iteration for
 Form Q or QTC                       nonsymmetric
                                      eigenvalue problems
                http://www.netlib.org/scalapack                  79
LAPACK
 Linear   Algebra library in Fortran 77
  – Solution of systems of equations
  – Solution of eigenvalue problems
 Combine    algorithms from LINPACK and
  EISPACK into a single package
 Efficient on a wide range of computers
  – RISC, Vector, SMPs
 User    interface similar to LINPACK
  – Single, Double, Complex, Double Complex
                 http://www.netlib.org/scalapack   80
 Built   on the Level 1, 2, and 3 BLAS
LAPACK
  Most  of the parallelism in the BLAS.
  Advantages of using the BLAS for
   parallelism:
   – Clarity
   – Modularity
   – Performance
   – Portability


             http://www.netlib.org/scalapack   81
Derivation of Blocked Algorithms
Cholesky Factorization A = UTU
     A11 a j    A13     U 11
                             T
                                    0      0   U 11 u j   U 13 
     T                  T                                   
     a j a jj    j    uj
                   T
                                   u jj    0   0 u jj     j 
                                                              T

     T                  T                T                  
     A13  j    A33     U 13    j     U 33   0  0     U 33 
 Equating coefficient of the jth column, we obtain
                       a j  U11u j
                              T


                   a jj  u u j  u jj
                               T
                               j
                                              2


     Hence, if U11 has already been
     computed, we can
                 UTu  a
     compute uj and u jj from the equations:
                           11 j           j

                          a jj  uT u j
                     u2 http://www.netlib.org/scalapack
                      jj                j                            82
LINPACK Implementation
  Here is the body of the LINPACK routine
    SPOFA which implements the method:
        DO 30 J = 1, N
              INFO = J
              S = 0.0E0
              JM1 = J - 1
              IF( JM1.LT.1 ) GO TO 20
              DO 10 K = 1, JM1
                  T = A( K, J ) - SDOT( K-1, A( 1, K ), 1,A( 1, J ), 1 )
                  T = T / A( K, K )
                  A( K, J ) = T
                  S = S + T*T
    10      CONTINUE
    20      CONTINUE
             S = A( J, J ) - S
  C      ...EXIT
             IF( S.LE.0.0E0 ) GO TO 40
             A( J, J ) = SQRT(http://www.netlib.org/scalapack
                                S)                                         83
     30 CONTINUE
LAPACK Implementation
      DO 10 J = 1, N
         CALL STRSV( 'Upper', 'Transpose', 'Non-Unit’, J-1, A, LDA, A( 1, J ), 1 )
         S = A( J, J ) - SDOT( J-1, A( 1, J ), 1, A( 1, J ), 1 )
         IF( S.LE.ZERO ) GO TO 20
         A( J, J ) = SQRT( S )
   10 CONTINUE

 Thischange by itself is sufficient to significantly
  improve the performance on a number of machines.
 From 72 to 251 Mflop/s for a matrix of order 500 on
  one processor of a CRAY Y-MP.
 However on 378 Mflop/s on 8 Procs. Of a CRAY Y-
  MP.
                  http://www.netlib.org/scalapack     84
 Suggest further work needed.
Derivation of Blocked Algorithms
   A11   A12    A13     U11 0
                            T
                                     0  U11 U12 U13 
   T                    T                        
   A12   A22    A12   U12 U 22 0   0 U 22 U 23 
                                 T                 T

   T                    T          T             
   A13    T
          A12    A33     U13 U 23 U 33   0
                                 T
                                               0 U 33 
 Equating coefficient of second block of columns, we obta
                      A12  U 11U 12
                              T


                A22  U 12U 12  U 22U 22
                        T          T


  Hence, if U11 has already been computed, we
  can
  compute U12 as the solution of the following
  equations by a 11U 12  A12
                    T
                 Ucall to the Level 3 BLAS routine
  STRSM:      T
                       A22  U 12U 12T
            U 22U 22http://www.netlib.org/scalapack       85
LAPACK Blocked Algorithms
     DO 10 J = 1, N, NB
       CALL STRSM( 'Left', 'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A, LDA,
     $        A( 1, J ), LDA )
       CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE, A( 1, J ), LDA, ONE,
     $        A( J, J ), LDA )
       CALL SPOTF2( 'Upper', JB, A( J, J ), LDA, INFO )
       IF( INFO.NE.0 ) GO TO 20
    10 CONTINUE

•On Y-MP, L3 BLAS squeezes a little more out of 1 proc, but
  makes a large improvement when using 8 procs.
     Cray Y- MP                    1Proc           8 p  c
                                                      ro s
      h l sk , n
     C oe   y   =5 0 0
     j- v   e t
         ari n              72                 72
       I P C
     LN A K
     j- v   e t
         ari n              25 1               3 78
       e e
     L v l2 B A L S
     j- v   e t
         ari n              28 7                 2
                                               1 25
       e e
     L v l3 B A L S
       - ari n
     k v    e t             29 0                 4 4
                                               1 1
                  http://www.netlib.org/scalapack                                   86
       e e
     L v l3 B A L S
LAPACK Contents
 Combines    algorithms from LINPACK and
  EISPACK into a single package. User
  interface similar to LINPACK.
 Built on the L 1, 2 and 3 BLAS, for high
  performance (manufacturers optimize
  BLAS)
 LAPACK does not provide routines for
  structured problems or general sparse
  matrices (i.e sparse storage formats such
                http://www.netlib.org/scalapack 87
  as compressed-row, -column, -diagonal,
LAPACK -- Motivations
     Basic Problems:
      – Linear systems: Ax=b
      – Least squares: min||Ax-b||2, A=USigmaVT
      – Eigenvalues and vectors: Ax=lambda x,
        Ax=lambda Bx
   Uses (manufacturer optimized) BLAS ==>
    high performance
   (Trans)portable Fortran77 (+BLAS), 805K
    lines (386K library, 419K testing and timing)

                  http://www.netlib.org/scalapack   88
LAPACK -- Release 3.0
   Evolved into a worldwide community effort
   Add functionality
      – Holy Grail (SEP using Relatively Robust
        Representations)
      – divide and conquer SVD,
      – divide and conquer least squares,
      – new d&c and expert drivers for GSEP,
      – new simple and expert drivers for GNEP,
      – faster QRP,
      – faster solver for the rank-deficient LS (xGELSY)
                   http://www.netlib.org/scalapack         89
LAPACK -- Release 3.0
   Performance enhancements to existing
    routines
   LAPACK Users’ Guide, Third Edition




              http://www.netlib.org/scalapack   90
LAPACK Ongoing Work
   Add functionality
    – updating/downdating, ,bidiagonal bisection, bidiagonal inverse
      iteration, band SVD, Jacobi methods, ...
   Move to new generation of high performance machines
    – IBM SPs, CRAY T3E, SGI Origin 2000, clusters of workstations
   New challenges
    – New languages: FORTRAN 90, HP FORTRAN, ...
    – (CMMD, MPL, NX ...)
        » many flavors of message passing, need standard (PVM, MPI): BLACS
                          Computational speed
   Highly varying ratioCommunication speed
   Many ways to layout data,
   Fastest parallel algorithm sometimes less stable
                         http://www.netlib.org/scalapack                 91
    numerically.
LAPACK -- Summary
   LAPACK success from HP-48G to Cray C-90
   (Trans)portable Fortran77 (+BLAS), 805K
    lines (386K library, 419K testing and timing)
   Targets workstations, vector machines, and
    shared memory computers
   Initial release February 1992
   1,066,000 requests since then (as of 6/97)
   Linear systems, least squares,
    eigenproblems
                http://www.netlib.org/scalapack   92
LAPACK -- Summary contd
   LAPACK   Users’ Guide, Second Edition,
   available from SIAM
   http://www.netlib.org/lapack/lug/lapack_lug.html

   Being  translated into Japanese and
    Russian
   Third release and manual in Spring
    1999


                   http://www.netlib.org/scalapack    93
LAPACK -- References
     E. Anderson, Z. Bai, c. Bischof, J. Demmel, J. Dongarra, J. Du
      Croz, A. Greenbaum, S. Hammarling, A. McKenney, S.
      Ostrouchov, and D. Sorensen. LAPACK Users’ Guide, Second
      Edition, SIAM, Philadelphia, PA, 1995.
     LAPACK software and documentation can be
      obtained via the WWW or anonymous ftp:
       –   http://www.netlib.org/lapack,
       –   http://www.netlib.org/lapack/lawns,
       –   http://www.netlib.org/clapack,
       –   http://www.netlib.org/c++/lapack++,
       –    http://www.netlib.org/lapack90,
       –   (anonymous) ftp ftp.netlib.org:
     Comments and questions can be addressed to
                 http://www.netlib.org/scalapack                   94
      lapack@cs.utk.edu
Basic Linear Algebra
Communication Subprograms
(BLACS)




         http://www.netlib.org/scalapack   95
BLACS -- Introduction
   A design tool, they are a conceptual aid in
    design and coding.
   Associate widely recognized mnemonic
    names with communication operations,
    improve
      – program readability,
      – self-documenting quality of the code.
     Promote efficiency by identifying frequently
      occurring operations of linear algebra which
      can be optimized on various computers.
                  http://www.netlib.org/scalapack    96
BLACS -- Intro contd.
   Program     portability is improved
      through
      – standardization of kernels
     without sacrificing efficiency since
      optimized versions are or will be
      available.



                 http://www.netlib.org/scalapack   97
BLACS -- Basics
Processes are embedded in a two-
dimensional grid,
                 0                  1              2
             3
         0       0           1           2             3

         1       4           5           6             7
        2        8           9          10 11
                     http://www.netlib.org/scalapack       98
BLACS -- Basics
  Anoperation which involves more than one
  sender and one receiver is called a “scoped
  operation”. Using a 2D-grid, there are 3
  natural scopes:
   Scope Meaning
   Row    All processes in a process row part icipat e.
   Column All processes in a process column
          part icipat e.
   All    All processes in t he process grid
          part icipat e.


                   http://www.netlib.org/scalapack        99
BLACS -- Basics
      CALL BLACS_PINFO( IAM, NPROCS )
  *   If underlying system needs additional setup, do it now
      IF( NPROCS.LT.1 ) THEN
       IF( IAM.EQ.0 ) NPROCS = 4
       CALL BLACS_SETUP( IAM, NPROCS )
      END IF
  *   Get default system context
      CALL BLACS_GET( 0, 0, ICTXT )
  *




                      http://www.netlib.org/scalapack          100
BLACS -- Basics
  *    define 1 x (NPROCS/2+1) process grid
       NPROW = 1
       NPCOL = NPROCS / 2 + 1
       CALL BLACS_GRIDINIT( ICTXT, ‘Row’, NPROW, NPCOL )
       CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL )
  *    If I’m not in the grid, go to end of program
       IF( MYROW.EQ.-1) GO TO 10
       ...
       CALL BLACS_GRIDEXIT( ICTXT)
      10 CONTINUE
       CALL BLACS_EXIT( 0 )
       END




                              http://www.netlib.org/scalapack     101
BLACS -- Basics
   Operations   are matrix-based and ID-
    less
   Types of BLACS routines: point-to-
    point communication, broadcast,
    combine operations and support
    routines.



             http://www.netlib.org/scalapack   102
BLACS -- Communication Routines
  Send/Receive
   Send (sub)matrix from one process to another:
     – _xxSD2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, RDEST,
                  CDEST)
     – _xxRV2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, RSRC,
       CSRC)
     _ (Dat a t ype)           xx (Mat rix t ype)
     I: Int eger,              GE: General rect angular
     S: Real,                  mat rix,
     D: Double Precision,      TR: Trapezoidal mat rix.
     C: Complex,
     Z: Double Complex.
                   http://www.netlib.org/scalapack        103
BLACS -- Point to Point
CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL )

IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN
  CALL DGESD2D( ICTXT, 5, 1, X, 5, 1, 0 )
ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.0 ) THEN
  CALL DGERV2D( ICTXT, 5, 1, Y, 5, 0, 0 )
END IF




                   http://www.netlib.org/scalapack   104
BLACS -- Communication Routines
  Broadcast
     Send (sub)matrix to all processes or subsection of
      processes in SCOPE, using various distribution
      patterns (TOP):
      – _xxBS2D(ICTXT, SCOPE, TOP, [UPLO, DIAG], M, N, A,
        LDA)
      – _xxBR2D(ICTXT, SCOPE, TOP, [UPLO, DIAG], M, N, A,
        LDA, SCOPE     TOP
                  RSRC, CSRC )
             ‘Row’         ‘ ‘ (def ault )
             ‘Column’      ‘Increasing Ring’
             ‘All’           -t
                           ‘1 ree’ ...

                    http://www.netlib.org/scalapack         105
BLACS -- Broadcast
  CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW,
     MYCOL )
  IF( MYROW.EQ.0 ) THEN
    IF( MYCOL.EQ.0 ) THEN
      CALL DGEBS2D( ICTXT, ‘Row’, ‘ ‘, 2, 2, A, 3 )
    ELSE
      CALL DGEBR2D( ICTXT, ‘Row’, ‘ ‘, 2, 2, A, 3, 0, 0 )
    END IF
  END IF




                  http://www.netlib.org/scalapack           106
BLACS -- Combine Operations
  Global combine operations
   Perform element-wise SUM, |MAX|, |MIN|, operations
    on triangular matrices
      – _GSUM2D(ICTXT, SCOPE, TOP, M, N, A, LDA, RDEST,
        CDEST)
      – _GAMX2D(ICTXT, SCOPE, TOP, M, N, A, LDA, RA, CA,
        RCFLAG,
                   RDEST, CDEST)
      – _GAMN2D(ICTXT, SCOPE, TOP, M, N, A, LDA, RA, CA,
        RCFLAG,
                   RDEST, CDEST)
     RDEST = -1 indicates that the result of the operation
                     on all processes selected by SCOPE,
      should be left http://www.netlib.org/scalapack      107
BLACS -- Combine operations
     For |MAX|, |MIN|, when RCFLAG = -1, RA
      and CA are not referenced; otherwise, these
      arrays are set on output with the coordinates
      of the process owning the corresponding
      maximum (resp, minimum) element in
      absolute value of A. Both arrays must be at
      least of size M X N and their leading
      dimension is specified by RCFLAG.


                  http://www.netlib.org/scalapack   108
BLACS -- Combine
  CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW,
     MYCOL )
  IF( MYROW.EQ.0 ) THEN
    K00 = MYCOL
    CALL IGSUM2D( ICTXT, ‘Row’, ‘ ‘, 1, 1, K00, 1, 0, 0 )

   KALL = MYCOL
   CALL IGSUM2D( ICTXT, ‘Row’, ‘ ‘, 1, 1, KALL, 1, -1, 0 )
  END IF




                     http://www.netlib.org/scalapack         109
BLACS -- Combine
     EPS = 1.0D+0
  10 CONTINUE
     EPS = EPS / 2.0D+0
     IF( ( 1.0D+0 + EPS ) .GT. 1.0D+0 ) GO TO 10
     CALL DGAMX2D( ICTXT, ‘All’, ‘ ‘, 1, 1, EPS, 1, IRSRC,
    ICSRC, -1, -1, 0 )




                    http://www.netlib.org/scalapack          110
BLACS -- Advanced Topics
  The BLACS context is the BLACS mechanism for
      partitioning communication space. A defining
      property of a context is that a message in a context
      cannot be sent or received in another context. The
      BLACS context includes the definition of a grid, and
      the process coordinates within it.
  It allows the user to
   create arbitrary groups of processes,
   create multiple overlapping and/or disjoint grids,
   isolate each process grid so that grids do not
      interfere with each other.
                                     MPI communicator
  BLACS contexthttp://www.netlib.org/scalapack             111
BLACS -- Advanced Topics
     The BLACS provide “globally-blocking”
      point-to-point receive, broadcasts and
      combines. The BLACS point-to-point send is
      “locally-blocking”.
      – Globally blocking: The return from the operation
        implies that the buffer is available for re-use. The
        operation may not complete unless the
        complement of the operation is called.
      – Locally blocking: The return from the send
        implies that the buffer is available for re-use. The
        send will complete regardless of whether the
        corresponding receive is posted.
                   http://www.netlib.org/scalapack         112
BLACS -- Advanced Topics
  CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW,
    MYCOL )

  IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN
    CALL DGESD2D( ICTXT, 5, 1, X, 5, 1, 0 )
    CALL DGERV2D( ICTXT, 5, 1, X, 5, 1, 0 )
  ELSE IF( MYROW.EQ.1 .AND. MYCOL.EQ.0 ) THEN
    CALL DGESD2D( ICTXT, 5, 1, X, 5, 0, 0 )
    CALL DGERV2D( ICTXT, 5, 1, X, 5, 0, 0 )
  END IF



                 http://www.netlib.org/scalapack     113
BLACS -- Advanced Topics
     The topology parameter allows the user to
      vary the cost per process of the messages
      involved in a distributed (broadcast or
      combine) operation. There are two main
      classes of topologies:
      – Pipelining topologies (ring-based),
      – Non-pipelining topologies (tree-based).
                 Topologies enhance
      performance

                   http://www.netlib.org/scalapack   114
BLACS -- Advanced Topics
  A  routine is repeatable iff it is
    guaranteed to give the same answer if
    called multiple times with the same
    parallel configuration and input.
   A routine is coherent iff all processes
    selected to receive an answer get
    identical results.


              http://www.netlib.org/scalapack   115
BLACS -- Advanced Topics
   Repeatability    and coherence do not
    affect correctness. A routine may be
    both incoherent and nonrepeatable, and
    still give the correct output,
   It is just that inaccuracies in floating
    point calculations may cause the
    routine to return differing values, all of
    which are equally valid.

              http://www.netlib.org/scalapack   116
BLACS -- Advanced Topics
   Examples:
    – Coherence: ring-based sum reduction,
    – Repeatability: race conditions in
      message-passing may influence the order
      of floating point computations.




              http://www.netlib.org/scalapack   117
BLACS -- Advanced Topics
     Homogeneous Coherency: All processes selected
      to possess the result receive the exact same answer
      if:
        – Communication does not change the value of the
          communicated data,
        – All processes perform floating point arithmetic
          exactly the same.
     Heterogeneous Coherency: All processes will
      receive the exact same answer if communication
      does not change the value of the communicated
      data.
                   http://www.netlib.org/scalapack     118
BLACS -- Advanced Topics
     Examples:
      – One does not want to use a stopping criteria
        based on incoherent results,
      – Repeatability makes debugging easier.
     The BLACS assume the communication is
      coherent. For combine operations, the
      BLACS allow to set flags enforcing combines
      to be repeatable and/or heterogeneous
      coherent. Setting these flags may result in
      loss of performance.
                  http://www.netlib.org/scalapack      119
BLACS -- Example Programs
     HELLO is one of the simplest BLACS programs.
      This program takes the available processes, forms
      them into a process grid, and then has each process
      check in with the process at (0,0) in the process grid.
     PI calculates the value of pi, using numerical
      integration with parallel processing, and clocks the
      solution time.
     PROCMAP illustrates how to map processes to a
      grid using BLACS_GRIDMAP. This is a fundamental
      example using simultaneous multiple (possibly
      overlapping) grids.

                    http://www.netlib.org/scalapack        120
BLACS -- References
     BLACS software and documentation can be obtained
      via WWW and anonymous ftp:
       – http://www.netlib.org/blacs/
       – (anonymous) ftp ftp.netlib.org

     Comments/questions to blacs@cs.utk.edu.
     R. C. Whaley, Basic Linear Algebra Communication Subprograms:
      Analysis and Implementation Across Multiple Parallel Architectures, UT
      Tech Report CS-94-234, LAPACK Working Note #73, 1994.
     J. Dongarra and R. C. Whaley, A User’s Guide to the BLACS v1.1, UT
      Tech Report CS-95-281, LAPACK Working Note #94, 1995 (updated
      May, 1997)



                        http://www.netlib.org/scalapack                  121
Parallel Basic Linear Algebra
Subprograms (PBLAS)




           http://www.netlib.org/scalapack   122
PBLAS -- Introduction
  Parallel Basic Linear Algebra
   Subprograms for distributed-memory MIMD
   computers
  Do both the communication and computation.
  Simplification of the parallelization:
   especially when BLAS-based,
  Modularity: gives programmer larger building
   blocks,
  Portability: machine dependencies are
   confined to the BLAS and BLACS.
                http://www.netlib.org/scalapack   123
Scope of the PBLAS
 Level 1 PBLAS
  Vector-Vector                                +
                                                       *
  operations
 Level 2 PBLAS
                                                   *
  Matrix-Vector
  operations
 Level 3 PBLAS                                +           *
  Matrix-Matrix
  operations http://www.netlib.org/scalapack                   124
Scope of the PBLAS
   No vector rotations,
   No dedicated subprograms for banded
    matrices,
   Matrix transposition available.
   Prototype version of packed storage
    PBLAS
      – http://www.netlib.org/scalapack/prototype/


                 http://www.netlib.org/scalapack   125
PBLAS -- Naming Conventions
       The PBLAS naming scheme follows the
        conventions of the BLAS, with necessary
        adaptations: P_XXYYY
  _ (Dat a t ype)        xx (Mat rix t ype)
  S: Real,               GE: All mat rix operands are General
  D: Double Precision,   rect angular,
  C: Complex,            HE: One of t he mat rix operands is
  Z: Double Complex.     HErmit ian,
                         SY: One of t he mat rix operands is
                         SYmmet ric,
                         TR: One of t he mat rix operands is
                         TRiangular.
                         http://www.netlib.org/scalapack        126
PBLAS -- Naming Conventions
     The   Level 1 routines use meaningful
        abbreviations, for example, PSCOPY,
        PSDOT, ...
  YYY The matrix transposition routine is called P_TRANx.
   
       (Operat ion)
  MM     Mat rix-mat rix product ;
  MV     Mat rix-vect or product ;
  R      Rank-1updat e of a mat rix;
  R2     Rank-2 updat e of a mat rix;
  RK     Rank-k updat e of a symmet ric / Hermit ian mat rix;
  R2K    Rank-2k updat e of a symmet ric / Hermit ian mat rix;
  SM     Solves a syst em of linear eqns. f or a mat rix of rhs;
  SV                   em of linear eqns. f
         Solves a syst http://www.netlib.org/scalapackor a rhs vect or.   127
PBLAS
  Similar to the BLAS in functionality and
   naming.
  Built on the BLAS and BLACS
  Provide global view of matrix
   CALL DGEXX ( M, N, A( IA, JA ), LDA,... )

   CALL PDGEXX( M, N, A, IA, JA, DESCA,... )


             http://www.netlib.org/scalapack   128
PBLAS -- Syntax
      Global view of the matrix operands, allowing
       global addressing of distributed matrices (hiding
       complex local indexing),
                 JA
                              N_

                              N
  IA
                   A(IA:IA+M-1,JA:JA+N-1)
       M_      M




                     http://www.netlib.org/scalapack       129
Data Distributions
   Global  view of the matrix operands,
    allowing global addressing of distributed
    matrices (hiding complex local
    indexing),
   Fits with the distribution patterns for
    HPF
   Load balance maintained



              http://www.netlib.org/scalapack   130
PBLAS -- Storage Conventions
   An M_ -by-N_ matrix is block-partitioned
    into MB_-by-NB_ blocks and distributed
    according to the two-dimensional block-
    cyclic scheme
             load balanced computations,
    scalability
   Locally, the scattered columns are stored
    contiguously (FORTRAN “Column Major”)
      – re-use of the BLAS (leading dimension LLD_)

                  http://www.netlib.org/scalapack     131
Distribution and Storage
          Matrix is block-partitioned & maps blocks
          Distributed 2-D block-cyclic scheme
           5x5 matrix partitioned in 2x2 blocks          2x2 process grid point of view


 A11        A12      A13     A14     A15          A11      A12     A15     A13     A14

 A21        A22      A23     A24     A25          A21      A22     A25     A23     A24

 A3 1       A3 2     A3 3    A3 4    A3 5         A5 1     A5 2    A5 5    A5 3    A5 4

 A4 1       A4 2     A4 3    A4 4    A4 5         A3 1     A3 2    A3 5    A3 3    A3 4

 A5 1       A5 2     A5 3    A5 4    A5 5         A4 1     A4 2    A4 5    A4 3    A4 4


          Routines available to distribute/redistribute data.                      132
PBLAS -- Storage Conventions
     Descriptor DESC_: 9-Integer array describing the
      matrix layout, containing DTYPE_, CTXT_, M_, N_,
      MB_, NB_, RSRC_, CSRC_, LLD_, where
      (RSRC_,CSRC_) are the coordinates of the process
      owning the first matrix entry in the grid specified by
      CTXT_.
     Ex: M_=N_5, MB_=NB_=2, RSRC_=CSRC_=0,
      LLD_ >= 3 (in process row 0), and 2 (in process row
      1).



                    http://www.netlib.org/scalapack       133
PBLAS -- Auxiliary Subprograms
     Using the BLACS topologies for efficiency
      and scalability:
      SUBROUTINE PTOPSET(ICTXT, OP, SCOPE, TOP)
      SUBROUTINE PTOPGET(ICTXT, OP, SCOPE, TOP)
      INTEGER            ICTXT
      CHARACTER*1        OP,SCOPE, TOP
     PTOPSET assigns the BLACS topology TOP to be
      used in the communication operations OP along the
      scope specified by SCOPE. PTOPGET retrieves the
      topology.
     These routines should be called by every process in
      the grid identified by the BLACS context ICTXT.
                   http://www.netlib.org/scalapack     134
PBLAS -- Auxiliary Subprograms
   Disposing of the PBLAS buffer
   allocated in every process’s dynamic
   memory:
   SUBROUTINE PBFREEBUF()




                http://www.netlib.org/scalapack   135
PBLAS -- Rationale
   Software reusability: BLAS and PBLAS
    interfaces similar,
   Level 1 PBLAS: Output scalar is only correct
    in operand scope,
   Level 3 PBLAS uses “shaped adaptive”
    procedure to perform the computations,
   The PBLAS routines follow an SPMD
    programming model (every process in the
    grid must call the routine).

               http://www.netlib.org/scalapack   136
PBLAS -- Rationale contd
   The  BLACS context provides the ability
    to have separate “universes” of
    message passing:
   Safe co-existence with other parallel
    libraries,
   The PBLAS do not perform “inter-
    context” operations.


              http://www.netlib.org/scalapack   137
PBLAS -- Examples
  INTEGER IAM, ICTXT, INFO, LDA, LDB, LDC, NMAX, NPROCS
  PARAMETER ( NMAX = 3, LDA = NMAX, LDB = NMAX, LDC = NMAX )
  INTEGER DESCA( 9 ), DESCB( 9 ), DESCC( 9 )
  DOUBLE PRECISION A( NMAX, NMAX ), B( NMAX, NMAX ), C( NMAX,
    NMAX )

  CALL BLACS_PINFO( IAM, NPROCS )
  IF( NPROCS.LT.1 ) THEN
   NPROCS = 4
   CALL BLACS_SETUP( IAM, NPROCS )
  END IF
  CALL BLACS_GET( -1, 0, ICTXT )
  CALL BLACS_GRIDINIT( ICTXT, ‘Row’, 2, 2 )
                    http://www.netlib.org/scalapack        138
PBLAS -- Examples
  . . .
  CALL DESCINIT( DESCA, 5, 5, 2, 2, 0, 0, ICTXT, LDA, INFO )
  CALL DESCINIT( DESCB, 5, 5, 2, 2, 0, 0, ICTXT, LDB, INFO )
  CALL DESCINIT( DESCC, 5, 5, 2, 2, 0, 0, ICTXT, LDC, INFO )

  .. .
  CALL PDGEMM( ‘No transpose’, ‘No transpose’, 4, 4, 4, 1.0D+0,
  $               A, 1, 1, DESCA, B, 1, 1, DESCB, 0.0D+0, C, 1, 1,
     DESCC )

  . . .
  CALL PBFREEBUF()
  CALL BLACS_GRIDEXIT( ICTXT )
  CALL BLACS_EXIT( 0 )
                     http://www.netlib.org/scalapack                 139
PBLAS -- Example Programs
   PBLAS     shows how to set the matrix
    descriptors and call the Level 1, 2, and
    3 PBLAS routines.
   SUMMA illustrates the use of the BLAS
    and the BLACS to implement an
    efficient and scalable matrix-matrix
    multiply routine.


              http://www.netlib.org/scalapack   140
PBLAS -- Example Programs
     This implementation of SUMMA was first
      suggested by:
      – R. Agarwal, F. Gustavson, and M. Zubair, A High
        Performance Matrix Multiplication Algorithm on a
        Distributed-Memory Parallel Computer, Using Overlapped
        Communication, IBM Journal of Research and
        Development, Vol. 38, No. 6, pp. 673--681, 1994.
     For a scalability analysis of this algorithm
      see:
      – R. van de Geijn, and J. Watts, SUMMA: Scalable Universal
        Matrix Multiplication Algorithm, UT Tech Report CS-95-286,
        LAPACK Working Note #96, 1995.
                     http://www.netlib.org/scalapack             141
Features of PBLAS V2
ALPHA
   Software  is backward compatible with
    current version 1.5.
   Improved ease-of-use:
    – Previous alignment restrictions have all
      been removed. Re-alignment is performed
      on-the-fly and only when necessary.
    – General rectangular block cyclic
      distribution of the operands is supported
      for all routines.
               http://www.netlib.org/scalapack   142
Features of PBLAS V2
ALPHA
   – Support for replicated operands
     » Increased functionality is truly usable as
       illustrated by the packed storage ScaLAPACK
       prototype routines
     » Enable the expression of currently missing
       algorithms in the ScaLAPACK dense library
       such as constrained linear least squares
       solvers.
     » Simply the expression of complex algorithms
       such as divide and conquer algorithms (sign
       function).
              http://www.netlib.org/scalapack    143
Features of PBLAS V2
ALPHA
   Efficient for a wider range of possible
    distribution parameters:
    – Algorithmic blocking techniques used
      throughout Level 2 and3PBLAS.
       » Enable the use of such techniques at the
         ScaLAPACK level.
   Testing  and timing programs upgraded
    to test the above new functionalities.

                http://www.netlib.org/scalapack     144
Performance of PBLAS V2
ALPHA
   Specific (and more efficient) algorithms
    have been implemented for large
    numbers of right-hand-sides, e.g.,
    triangular solve, triangular matrix-
    multiply, symmetric or Hermitian matrix-
    multiply, rank-k and rank-2k updates.
   Efficiency improved on a wider range of
    problem sizes and machine
    configurations than version 1.5.
               http://www.netlib.org/scalapack   145
Performance of PBLAS V2
ALPHA
   Fine performance tuning is ongoing
    work, in particular , profiling and precise
    timing analysis of each component for
    various distributed-memory concurrent
    computers.
   Documentation as well as software
    design description will be made
    available with the final release of
    PBLAS V2 (early 1998).
               http://www.netlib.org/scalapack 146
   PBLAS V2 will replace current version
PBLAS -- References
     J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D.
      Walker, and R. C. Whaley, A Proposal for a Set of
      Parallel Basic Linear Algebra Subprograms, UT Tech
      Report CS-95-292, LAPACK Working Note 100,
      1995.
   PBLAS    software and documentation can be
      obtained via the WWW and anonymous ftp:
      – http://www.netlib.org/scalapack
      – (anonymous) ftp ftp.netlib.org
     Comments/suggestions to
      scalapack@cs.utk.edu.
                   http://www.netlib.org/scalapack     147
Design of ScaLAPACK




         http://www.netlib.org/scalapack   148
ScaLAPACK Structure
              ScaLAPACK

                                                      PBLAS
Global
Local
         LAPACK                                      BLACS



            BLAS
                                                 PVM/MPI/...

                   http://www.netlib.org/scalapack             149
Goals - Port LAPACK to Distributed-
Memory Environments.
    Efficiency
      – Optimized compute and communication engines
      – Block-partitioned algorithms (Level 3 BLAS) utilize memory hierarchy and yield good
        node performance

    Reliability
      – Whenever possible, use LAPACK algorithms and error bounds.
    Scalability
      – As the problem size and number of processors grow
      – Replace LAPACK algorithm that did not scale; New ones into
        LAPACK
    Portability
      – Isolate machine dependencies to BLAS and the BLACS
    Flexibility
      – Modularity: Build rich set of linear algebra tools: BLAS, BLACS,
                         http://www.netlib.org/scalapack              150
        PBLAS
Object-Based Design in Fortran77
    Each global data object is assigned an
     array descriptor.
    The array descriptor
      – Contains information required to establish
          mapping between a global array entry and its
          corresponding process and memory location.
        – Is differentiated by the DTYPE_ (first entry) in the
          descriptor.
        – Provides a flexible framework to easily specify
          additional data distributions or matrix types.
      Using concept of context
                  http://www.netlib.org/scalapack           151
Array Descriptors
   Array   descriptors currently supported
   for:
    – dense matrices
    – band and tridiagonal matrices
    – out-of-core matrices
   User must distribute all global arrays
   prior to the invocation of a ScaLAPACK
   routine.
                http://www.netlib.org/scalapack   152
Choosing a Data Distribution
     The two main issues in choosing a data
      layout for dense matrix computations are:
      – load balance, or splitting the work reasonably
        evenly among the processors throughout the
        algorithm, and
      – use of the Level 3 BLAS during computations on
        a single processor to utilize the memory hierarchy
        on each processor.



                   http://www.netlib.org/scalapack      153
Possible Data Layouts
 1D   block and cyclic column distributions




 1D  block-cycle column and 2D block-cyclic
  distribution
 2D block-cyclic used in ScaLAPACK for dense 154
                 http://www.netlib.org/scalapack

  matrices
Two-dimensional Block-Cyclic Distribution


            5x5 matrix partitioned in 2x2 blocks          2x2 process grid point of view



  A11        A12      A13     A14     A15          A11      A12     A15     A13     A14

  A21        A22      A23     A24     A25          A21      A22     A25     A23     A24

  A3 1       A3 2     A3 3    A3 4    A3 5         A5 1     A5 2    A5 5    A5 3    A5 4

  A4 1       A4 2     A4 3    A4 4    A4 5         A3 1     A3 2    A3 5    A3 3    A3 4

  A5 1       A5 2     A5 3    A5 4    A5 5         A4 1     A4 2    A4 5    A4 3    A4 4

           Need redistribution routines to go from one
            distribution to another.                                                 155
Two-dimensional Block-Cyclic Distribution

    Ensure good load balance -->
     Performance and scalability,
    Encompasses a large number of (but not all)
     data distribution schemes,
    Need redistribution routines to go from one
     distribution to the other.




                http://www.netlib.org/scalapack   156
    Array descriptor for Dense
    Matrices
DESC_()   Symbolic   Scope        Def init ion
          Name

1         DTYPE_A    (global)     Descript or t ype DTYPE_A=1f or dense mat rices.
2         CTXT_A     (global)     BLACS cont ext handle.
3         M_A        (global)     No. of rows in global array A
4         N_A        (global)     No. of cols. In global array A
5         MB_A       (global)     Blocking f act or used t o dist ribut e t he rows of
                                  array A.
6         NB_A       (global)     Blocking f act or used t o dist ribut e t he columns
                                  of array A.
7         RSRC_A     (global)     Process row over which t he f irst row of t he array
                                  A is dist ribut ed.
8         CSRC_A     (global)     Process column over which t he f irst column of
                                  t he array A is dist ribut ed.
9         LLD_A      (local)      Leading dimension of t he local array.

                          http://www.netlib.org/scalapack                          157
Narrow Band and Tridiagonal
Matrices
     The ScaLAPACK routines solving narrow-
      band and tridiagonal linear systems assume
      – the narrow band or tridiagonal coefficient matrix to
        be distributed in a block-column fashion, and
      – the dense matrix of right-hand-side vectors to be
        distributed in a block-row fashion.
     Divide-and-conquer algorithms have been
      implemented because they offer greater
      scope for exploiting parallelism than the
      corresponding adapted dense algorithms.
                   http://www.netlib.org/scalapack       158
    Array descriptor for Narrow
    Band Matrices
DESC_()   Symbolic   Scope        Def init ion
          Name

1         DTYPE_A    (global)     Descript or t ype DTYPE_A=5 0 1f or 1x Pc
                                  process grid f or band and t ridiagonal mat rices
                                  block-column dist ribut ed.
2         CTXT_A     (global)     BLACS cont ext handle.
3         N_A        (global)     No. of cols. In global array A
4         NB_A       (global)     Blocking f act or used t o dist ribut e t he columns
                                  of array A.
5         CSRC_A     (global)     Process column over which t he f irst column of
                                  t he array A is dist ribut ed.
6         LLD_A      (local)      Leading dimension of t he local array. For t he
                                  t ridiagonal subrout ines, t his ent ry is ignored.
7                                 Unused, reserved.


                          http://www.netlib.org/scalapack                            159
    Array descriptor for Right Hand Sides
    for Narrow Band Linear Solvers
DESC_()   Symbolic   Scope        Def init ion
          Name

1         DTYPE_B    (global)     Descript or t ype DTYPE_B=5 0 2 f or Pr x 1
                                  process grid f or block-row dist ribut ed mat rices .
2         CTXT_B     (global)     BLACS cont ext handle.
3         M_B        (global)     No. of rows in global array B
4         MB_B       (global)     Blocking f act or used t o dist ribut e t he rows of
                                  array B.
5         RSRC_B     (global)     Process row over which t he f irst row of t he array
                                  B is dist ribut ed.
6         LLD_B      (local)      Leading dimension of t he local array. For t he
                                  t ridiagonal subrout ines, t his ent ry is ignored.
7                                 Unused, reserved.



                          http://www.netlib.org/scalapack                           160
Error Handling
   Driverand Computational routines
    perform global and local input error-
    checking.
    – Global checking -- synchronization
    – Local checking -- validity
   No  input error-checking is performed on
    the auxiliary routines.
   If an error is detected in a PBLAS or
    BLACS routine, program execution is
                http://www.netlib.org/scalapack 161
    stopped.
Application Debugging Hints
   Look  at ScaLAPACK example
    programs
   Always check the value of INFO on exit
    from a ScaLAPACK routine.
   Query for size of workspace, LWORK =
    -1
   Link to the Debug Level 1 BLACS
    (specified by BLACSDBGLVL=1 in
    Bmake.inc) http://www.netlib.org/scalapack 162
ScaLAPACK Implementation
     BLAS (Performance and Portability)
      – Blocked data access (Level 3 BLAS) yields good
        local performance,
      – Portability: standardization of efficient kernels.
     BLACS (Performance and Portability)
      – Correct level of notation: communication of
        matrices,
      – Efficiency: identify frequent linear algebra
        operations which can be optimized on various
        computers.
                    http://www.netlib.org/scalapack          163
Functionality
Problem t ype         SDrv EDrv Fact or    Solve      Inv Cond It er
Ax = b                                                    Est Ref in
Triangular                                   X         X   X     X    Timing and
                                                                       Testing
SPD                    X     X     X         X        X   X    X
SPD Banded             X           X         X                         routines for
SPD Tridiagonal        X           X         X                         almost all
General                X     X     X         X        X   X    X      This is a
General Banded         X           X         X                         large
General Tridiagonal    X           X         X
                                                                       component
Least squares          X           X         X
GQR                                X
                                                                       of the
GRQ                                X                                   package
Ax = x or Ax = Bx SDrv Edrv Reduct      Solut ion                   Prebuilt
Symmet ric             X     X     X         X                         libraries
General                +           X         +                         available for
Generalized BSPD                   X         X                         SP, PCA,
SVD                    +     X     X         +                                   164
                                                                       O2K, PGON,
Functionality continued
   Orthogonal/unitary transformation routines
   Prototype Codes
      –   PBLAS (version 2.0 ALPHA)
      –   Packed Storage routines for LLT, SEP, GSEP
      –   Out-of-Core Linear Solvers for LU, LLT, and QR
      –   Matrix Sign Function for Eigenproblems
      –   SuperLU and SuperLU_MT
      –   HPF Interface to ScaLAPACK



                    http://www.netlib.org/scalapack        165
Parallelism in ScaLAPACK
    Level 3 BLAS block                            Task parallelism
     operations                                      – Sign function eigenvalue
     – All the reduction routines                      computations

    Pipelining                                    Divide and Conquer
     – QR Algorithm, Triangular                      – Tridiagonal and band
       Solvers, classic                                solvers, symmetric
       factorizations                                  eigenvalue problem and
                                                       Sign function
    Redundant
                                                   Cyclic reduction
     computations
                                                     – Reduced system in the
     – Condition estimators
                                                            band solver
    Static work assignment
     – Bisection

                          http://www.netlib.org/scalapack                         166
Documentation, Test Suites,
Example Programs, ...
   Documentation
    – ScaLAPACK Users’ Guide
      http://www.netlib.org/scalapack/slug/scalapack_slug.html
    – Installation Guide for ScaLAPACK
    – LAPACK Working Notes
   TestSuites for ScaLAPACK, PBLAS,
    BLACS
   Example Programs
   http://www.netlib.org/scalapack/examples/

   Prebuilt   ScaLAPACK libraries on netlib167
                 http://www.netlib.org/scalapack
Commercial Use
     ScaLAPACK has been incorporated into the
      following commercial packages
      –   Fujitsu
      –   Hewlett-Packard/Convex
      –   Hitachi
      –   IBM Parallel ESSL
      –   NAG Numerical PVM (and MPI) Library
      –   Cray LIBSCI
      –   NEC Scientific Software Library
      –   Sun Scientific Software Library
      –   Visual Numerics (IMSL)
                    http://www.netlib.org/scalapack   168
User Applications




           http://www.netlib.org/scalapack   169
Impact -- Applications
   Large scale solvers are permitting new range of problems
    to be addressed
   Technology transfer of software to the commercial
    marketplace
   New computations in electronic structures is having
    impact in nanoscale technology
   Open region electromagnetic scattering is important to
    remote atmospheric sensing and cross-section aircraft
   Eigensolvers important in many areas (data compression,
    reactive scattering, etc.)
   Sparse solvers used in structural mechanics,
    electromagnetics, and optimization.                    170
ScaLAPACK in ASCI Application
Amounts To Saving of $1.1M -
$5.4M
   MP-Quest is a simulation code for materials modeling
    – Surface structures at SNL. (Pb-Zn-Ti)
   MP-Quest will account for 10% of all cycles on ASCI-Red
    (presently at 20-30%)
    – Symmetric Eigensolver accounts for 25% of cycles in MP-Quest
    – Hence 2.5% of Asci Red cycles in MP-Quest's symmetric
      eigensolver.
   ScaLAPACK’s Sym Eig solver is 10x faster than MP-
    Quest's
    – 90% savings in cycles by using ScaLAPACK
   ASCI Red cost roughly $50 million
    – $50M x 2.5% x 90% = $1.125M today.(may be closer to $5 million)
                                                                     171
   Right now large problems are too slow because of the
    Interaction with ASCI at
    Caltech
   Quantum Mechanical
    Calculations                         Caltech ASCI Level 1 Alliance Center
                                                     “A Virtual Shock Physics Test Facility”
     Most time intensive level of
      theory                                                 NO 2



     All Caltech Materials Properties
                                                         N
                                        O2N
                                               N
                                   HMX                                                                                               Ta (BCC)
      Ultimately are Based upon QMExplosive
                                                              N
                                                                                                                                      Metal
                                                                    NO 2
                                                     N


     Two time intensive steps                O2N




        » Formation of Fock Operator
          O(N4)                                          Compute Materials Properties
                                                                                                Compute Materials Properties
                                                            for High Explosives
                                                                                                   for Solid Dynamics
        » Diagonalization of Fock
          Operator O(N3)
               Ax  
    – Linear Algebra Bx                                                                                                        172

      Requirements
   Ocean Circulation Model
 OCEANS
  – Uses second-order finite
    differences with an explicit
    leapfrog time stepping scheme
    on a logically rectangular grid.
     » A. Wallcraft, H. Hurlburt, R. Rhodes, & J. Shriver, NRL -
       Stennis
 540%   speedup over the previous parallel
  software. (64*145.6 Mflop/s) = 9.3 Gflop/s
 Translated into a run that was taking 2 hours is
  now 25 minutes.
 ScaLAPACK has also improved accuracy of                          173

  results.
ScaLAPACK Performance




         http://www.netlib.org/scalapack   174
Target Machines for ScaLAPACK
     Dedicated distributed-memory machines
      (MIMD)
      –   IBM SP series
      –   Intel series (Paragon)
      –   Cray T3 series
      –   SGI Power Challenge Array
      –   SGI Origin 2000
      –   HP/Convex Exemplar
   Networks of workstations or PCs
   Any system for which MPI or PVM is
    available http://www.netlib.org/scalapack   175
Scalability -- Introduction
   How  well does a parallel algorithm
    allow a large number of processors
    to be efficiently utilized?




             http://www.netlib.org/scalapack   176
Scalability -- Introduction
   Ifthe ratio Mem(n)/p is held constant,
    then an algorithm is scalable if
     – E(p,u,n(p)) = constant.
   In other words, maintaining memory
    use per node constant allows efficiency
    to be maintained,
   In practice a slight degradation is
    acceptable.
                http://www.netlib.org/scalapack   177
Scalability
   For dense matrix computations, an
    implementation is said to be scalable if
    the parallel efficiency is an increasing
    function of N**2/P, the problem size per
    node.
   The algorithms implemented in
    ScaLAPACK are scalable in this sense.


              http://www.netlib.org/scalapack   178
Achieving High Performance
   Chapter 5, ScaLAPACK Users’ Guide
   ScaLAPACK achieves high
    performance on distributed-memory
    computers, such as the SP2, T3E, O2K,
    and Paragon.
   ScaLAPACK can achieve high
    performance on some networks of
    workstations or PCs.
              http://www.netlib.org/scalapack   179
Achieving High Performance on a
Distributed-Memory Computer
   Use   the right number of processors
    – Rule of thumb: P=MxN/1,000,000 for an
      MxN matrix. This provides a local matrix
      of size approximately 1000-by-1000.
    – Do not try to solve a small problem on too
      many processors.
    – Do not exceed physical memory.



               http://www.netlib.org/scalapack   180
Achieving High Performance on a
Distributed-Memory Computer
   Use   an efficient data distribution.
    – Block size (I.e., MB,NB) = 64.
    – Square processor grid, Pr = Pc.
   Use  efficient machine-specific BLAS
    (not the Fortran77 reference
    implementation from netlib) and BLACS
    (nondebug, BLACSDBGLVL=0 in
    Bmake.inc)
                http://www.netlib.org/scalapack   181
Achieving High Performance on a
Network of Workstations
   The bandwidth per node, if measured in
    Megabytes per second per node, should be
    no less than one tenth the peak floating-point
    rate as measured in megaflops/second/node.
   The underlying network must allow
    simultaneous messages, that is, not standard
    ethernet and not FDDI.
   Message latency should be no more than
    500 microseconds.

                http://www.netlib.org/scalapack   182
Achieving High Performance on a
Network of Workstations
   All processors should be similar in
    architecture and performance. ScaLAPACK
    will be limited by the slowest processor. Data
    format conversion significantly reduces
    communication performance.
   Dedicated use of processors
   No more than one process should be
    executed per processor.


                http://www.netlib.org/scalapack   183
Obtaining High Performance
   Use the best BLAS and BLACS libraries
    available.
   Start with a standard data distribution.
      – A square processor grid (Pr=Pc) if P >= 9
      – A one-dimensional processor grid (Pr=1,Pc=P) if
        P<9
      – Block size (NB) = 64
   Determine whether reasonable performance
    is being achieved.
   Identify the performance bottlenecks, if any.
                  http://www.netlib.org/scalapack     184
Mflop/s    LU/Solve on PII 300 MHz Cluster
 1400

 1200

 1000

  800
                                                                                    1
  600                                                                               2
                                                                                    4
  400
                                                                                    6
  200                                                                               8
                                                                                    10
     0                                            0                                 12


                                                               0


                                                                       0


                                                                               0
     00


            00


                   00


                            00


                                       00

                                                00


                                                             00


                                                                     00


                                                                             00
   10


          30


                 50


                          70


                                     90

                                              11


                                                           13


                                                                   15


                                                                           17
                        Order of Matrix
                         http://www.netlib.org/scalapack                           185
Details of Cluster timings
   Fast ethernet
   MPICH 1.1
   NB = 36
   DGEMM obtained from ATLAS
   Peak DGEMM = 189 Mflops/proc
   Obtain approximately 62% DGEMM
    performance per processor

            http://www.netlib.org/scalapack   186
Performance
   LU fact+solve on IBM SP2 thin nodes
             8000

             7000
             6000
             5000
    Mflops




             4000
             3000
             2000                                                                             64 nodes
             1000                                                                           32 nodes
                                                                                          16 nodes
                0                                                                       8 nodes
                    1000
                           2000
                                  3000
                                         4000




                                                                                      4 nodes
                                                5000


                                                       7500
                                                              10000
                                                                      12500
                                                                              15000

                            ro
                           P blem size
                                    http://www.netlib.org/scalapack                                      187
Details of SP2 timings
   Performance     increases with the number
    of nodes
   Double the number of nodes, double
    the performance           memory
    saturated
   DGEMM (thin node) = 180 Mflops/node
   8 Gflops/64 = 125/180 = 70% DGEMM
    performance
             http://www.netlib.org/scalapack   188
Performance

            LU fact.+solve on Intel Paragon
            4500
            4000
            3500
            3000
   Mflops




            2500
            2000
            1500
            1000
                                                                                                                  128 nodes
             500                                                                                                64 nodes
                   0                                                                                          32 nodes
                                                                                                            16 nodes
                       1000
                              2000
                                      3000
                                             4000
                                                    5000




                                                                                                          8 nodes
                                                           7500
                                                                  10000
                                                                          12500
                                                                                  15000
                                                                                          17500
                                                                                                  20000
                                     Problem siz e

                                                http://www.netlib.org/scalapack                                               189
Details of Paragon timings
   Performance    increases with the number
    of nodes
   DGEMM = 45 Mflops/node
   Achieve 80% DGEMM performance per
    node
   Double the number of nodes, double
    the performance        memory
    saturated
            http://www.netlib.org/scalapack   190
Performance
   LU fact+solve on Sparc Ultra 1 (167 Mhz)
    connected via 155 Mbps switched ATM

             1200

             1000

              800
    Mflops




              600

              400

              200

                    0                                                                                              12 nodes
                                                                                                                 8 nodes
                        1000
                               2000
                                      3000
                                             4000
                                                    5000




                                                                                                               4 nodes
                                                           6000
                                                                  7000
                                                                         8000

                                                                                9000

                                                                                       10000

                                                                                               11000

                                      Problem size                                                     12000


                                             http://www.netlib.org/scalapack                                                  191
LU Performance (Mflop/s) on
32 nodes Intel XP/S MP
Paragon
 300

 25 0

 20 0

 15 0

 10 0

  50

   0
        10 0 0   20 0 0        3000            4000   5000

                  http://www.netlib.org/scalapack            192
Performance of LU fact. + solve
on the Intel MP XP/S Paragon
(Mflop/s) (2 computational processors
per node)
   4500
   4000
    3500
                                                                         1x4
   3000
                                                                         1x8
    25 0 0
                                                                         2 x8
    20 0 0
                                                                         4 x8
    15 0 0
                                                                         4 x16
    10 0 0
     500
        0
             5000   7500    10000      12500     15000   20000   22500




                      http://www.netlib.org/scalapack                       193
ScaLAPACK Example
Programs




        http://www.netlib.org/scalapack   194
ScaLAPACK Example
Program #1
   Http://www.netlib.org/scalapack/examples/example1.f




                 http://www.netlib.org/scalapack         195
ScaLAPACK Example
Program #2
   Http://www.netlib.org/scalapack/examples/scaex.shar




                 http://www.netlib.org/scalapack         196
Issues of Heterogeneous
Computing




          http://www.netlib.org/scalapack   197
Heterogeneous Computing
 Software intended to be used in cluster
  computing context
 Difficulties arise with the following issues:
    – Communication of ft. pt. numbers between
      processors
    – Repeatability and Coherency
    – Machine precision and other machine specific
      parameters
    – Different versions of compilers
    – Checking global floating-point arguments
    – Iterative convergence across clusters of processors 198
Heterogeneous Computing
   Defensive  programming required to
    address these issues.
   Details of ScaLAPACK experiences can
    be found in:
    – http://www.netlib.org/lapack/lawns/lawn112.ps




                http://www.netlib.org/scalapack   199
Homogeneous Versus
Heterogeneous
   Issue  #1: Hardware
   What is a homogeneous machine?
   A homogeneous machine is one which
    satisfies (1), and guarantees that the
    result of a particular sequence of
    floating point operations is the same no
    matter which processor is used.
   Is this sufficient?

              http://www.netlib.org/scalapack   200
Homogeneous Versus
Heterogeneous
   Issue  #2: Communication layer
   What is a homogeneous network?
   A homogeneous network is a collection
    of machines which satisfy (1) and (2).
    Floating point numbers are
    communicated exactly between
    homogeneous machines.
   Is this sufficient?

              http://www.netlib.org/scalapack   201
Homogeneous Versus
Heterogeneous
   Issue #3: Software (operating system,
    compiler, ...)
   What is a homogeneous computing
    environment?
     A homogeneous computing environment is a network
      which satisfies (1), (2), and (3). The operating
      system, compiler, and compiler options do not alter
      the representation of floating point values.
     If any of these requirements is not met, the
      system is heterogeneous.
                   http://www.netlib.org/scalapack    202
Heterogeneous Computing
Issues
   Issues   to be considered:
    – Communication of floating point values
      between processors
    – Machine precision and other machine
      parameters
    – Checking global floating-point arguments
    – Algorithmic Integrity
    – Deadlock

                http://www.netlib.org/scalapack   203
Communicating on IEEE
Machines
   The IEEE standard specifies how machines
    should represent floating point numbers.
    XDR (External Data Representation) uses
    the IEEE representation.
   PVM uses XDR and thus will communicate
    numbers without change on IEEE machines
    (Denormalized numbers?)
   MPI suggests, but does not mandate, the use
    of XDR.

               http://www.netlib.org/scalapack   204
Machine Precision
   Smallest  floating point number u such
    that 1 + u > 1.
   Used in:
    – Convergence tests -- has an iteration
      converged?
    – Tolerance tests -- is a matrix numerically
      singular?
    – Error analysis -- how does floating point
      arithmetic affect results?
               http://www.netlib.org/scalapack   205
Heterogeneous Machine
Precision
   Return  the maximum relative machine
    precision over all the processors
   LAPACK --> DLAMCH
   ScaLAPACK --> PDLAMCH
   Note that even on IEEE machines, the
    machine precision can vary slightly. For
    double precision: e = 2**-53 + q where,
    q=2**-105 or q=2**-64 + 2**-105.
              http://www.netlib.org/scalapack   206
Other Machine Parameters
     A number of other parameters are used to
      characterize a machine. E.g.,
      – underflow threshold
      – overflow threshold
      – smallest number that can be safely reciprocated
     – . . .
   For small numbers -- use the largest over all
    the processors.
   For large numbers -- use the smallest over all
    the processors.
                  http://www.netlib.org/scalapack         207
Heterogeneous Networks --
Arithmetic Issues
     Even on IEEE machines results may differ
      between machines, compilers and compiler
      switches
      – Global variables may have different values
      – Eigenproblem for a tridiagonal matrix -- run QR on
        each processor and each processor finds k
        eigenvectors. But each processor may compute a
        different QR sequence
      – Stopping criteria for iterative methods -- may not
        be satisfied on each processor simultaneously

                   http://www.netlib.org/scalapack     208
Algorithmic Integrity --
Examples
   Convergence   of iterative methods
   Scaling vectors and matrices
   Bisection and inverse iteration for a
    symmetric tridiagonal matrix
   Solving the eigenproblem for a
    symmetric tridiagonal matrix by the QR
    algorithm

              http://www.netlib.org/scalapack   209
QR Algorithm for a Tridiagonal
Matrix
     Computing the eigenvalues is O(n**2) arithmetic, but
      computing the eigenvectors is O(n**3)
     Run the QR algorithm on each processor
     Save the plane rotations on each processor
     Let each processor compute k of the eigenvectors
     QR is backward stable, but not forward stable, so
      sequence of plane rotations may be different
     Have one processor run the QR algorithm and
      broadcast the plane rotations


                   http://www.netlib.org/scalapack      210
Heterogeneous Conclusions
   Defensive  programming
   Machine parameters
   Communication and representation
    between processors
   Controlling processor
   Additional communication
   Testing strategies


                http://www.netlib.org/scalapack   211
HPF Interface to ScaLAPACK




          http://www.netlib.org/scalapack   212
HPF Version
  Interface   provided for a subset of
   routines
  Linear solvers (general, pos. def., least
   squares)
  Eigensolver (pos. def.)
  matrix multiply, & triangular solve




                                               213
HPF Version
    Given these declarations:
      REAL, DIMENSION(1000,1000) :: A, B, C
      !HPF$ DISTRIBUTE (CYCLIC(64), CYCLIC(64)) :: A, B, C

    Calls could be made as simple as:
       CALL LA_GESV(A, B)
      CALL LA_POSV(A, B)
      CALL LA_GELS(A, B)
      CALL LA_SYEV(A, W, Z)
      CALL LA_GEMM(A, B, C)
      CALL LA_TRSM(A,B)

    Fortran 90 version follows these ideas
                                                             214
HPF Interface -- Note
   No  presently available compiler
    supports the full features of HPF.
   Determining the minimal features
    required to effectively support an HPF
    library which calls a process local
    message passing library



              http://www.netlib.org/scalapack   215
HPF -- Redistribution
   Expensive    in both Space (memory)
    and Time,
   Users often utilize all available memory,
   ScaLAPACK performance is not very
    sensitive for a large subset of all
    supported cyclic mappings.



              http://www.netlib.org/scalapack   216
HPF -- Redistribution
   Use !HPF$ INHERIT for most general
    purpose libraries:
    – Data will be left in place across
      subroutines calls,
    – Whenever possible data passed to
      ScaLAPACK without redistribution,
    – Otherwise, redistribute to an optimal
      distribution.

               http://www.netlib.org/scalapack   217
Determining Distribution
     1. HPF_DISTRIBUTE from HPF_LIBRARY
      returns:
       – Distribution type (CYCLIC,BLOCK,…)
       – Row and column blocking factors (MB,NB),
       – Size of the process grid,
     of the ultimate align target. Use
      HPF_ALIGNMENT from HPF_LIBRARY to
      see how this information affects our matrix
      operand.
                 http://www.netlib.org/scalapack    218
Determining Distribution
     2. Find on what processor particular blocks
      of the matrix reside:
      – GLOBAL_TO_LOCAL from
        HPF_LOCAL_LIBRARY, or
      – HPF$ ALIGN and a mapping array.
     If redistribution is necessary, use
     !HPF$ PROCESSORS PROC(P,Q)
     !HPF$ DISTRIBUTE(CYCLIC(MB),CYCLIC(NB))
      ONTO PROC


                  http://www.netlib.org/scalapack   219
Calling Fortran77 From HPF
   IfFortran77 is called directly from HPF,
    one cannot determine the leading
    dimension of the local arrays,
   --> Intermediate layer (HPF --> F77)
   LAYER 1: User callable wrapper
    functions, and their HPF tools (written in
    strict HPF):


              http://www.netlib.org/scalapack   220
Calling Fortran77 From HPF
     Determine if input data distribution is
      acceptable:
      – All matrix operands can be expressed as some
        variant of CYCLIC(K),
      – All matrix operands are distributed across the
        same process grid,
      – Process grid is one or two dimensional,
      – Certain routine-dependent alignment restrictions
        are met.
     Re-map to an acceptable redistribution.
                  http://www.netlib.org/scalapack      221
Calling Fortran77 From HPF
     LAYER 2: Transition layer
      – Extrinsic with local view of the matrix
         » HPF_LOCAL (layer 3 F77_LOCAL or F90_LOCAL)
         » F90_LOCAL (layer 3 F77_LOCAL or F90_LOCAL)
      – Translate HPF’s assumed shape array to
        Fortran77’s assumed size array,
      – Determine leading dimension of assumed size
        array so Fortran77 routines can do index
        computation.
   LAYER      3: ScaLAPACK message passing
      layer
      – Extrinsic F77_LOCAL or F90_LOCAL
                   http://www.netlib.org/scalapack      222
HPF Interface -- Summary
   !HPF$ INHERIT to avoid unneeded
    redistribution across subroutine calls,
   HPF_DISTRIBUTION from HPF_LIBRARY
    to find out the distribution of the matrix’s
    ultimate align target,
   HPF_ALIGNMENT from HPF_LIBRARY to
    determine how the distribution of the ultimate
    align target affects the actual matrix
    distribution,

                http://www.netlib.org/scalapack   223
HPF Interface -- Summary
     !HPF$ PROCESSORS PROC(P,Q), where P and Q
      are dummy arguments to a routine. When one has
      determined redistribution must occur, this command
      ensures that all matrices passed to the routine are
      distributed on the same process grid.
     !HPF$ DISTRIBUTE A(CYCLIC(MB),CYCLIC(NB))
      ONTO PROC, with MB and NB parameters to the
      routine and PROC from above,
       – to redistribute data according to the two-
         dimensional block cyclic scheme,


                   http://www.netlib.org/scalapack      224
HPF Interface -- Summary
     GLOBAL_TO_LOCAL from HPF_LOCAL_LIBRARY, or
     !HPF$ ALIGN M(I,J) WITH A(1+(I-1)*MB, 1+(J-1)*NB),
     with MB and NB parameters to the routine,
      – to label the processes allowing for system
        independent process grid setup,
     EXTRINSIC(F90_LOCAL) or,
     EXTRINSIC(HPF_LOCAL) + EXTRINSIC(F77_LOCAL)
      – to call Fortran77 from HPF.




                    http://www.netlib.org/scalapack        225
HPF Performance for LU on 12-node
Cluster Sun Ultra using PGI Compiler
                    10 0 0
  Time in Seconds



                    800
                                                       HPF Call
                    600
                                                       ScaLAPACK
                    400
                                                       Wrapper
                     20 0
                         0                 0


                                                   0
                         0


                                 0


                                          0


                                                   0
                      0


                                 0


                                       0


                                               0
                    10


                             0
                             3


                                      5


                                               8


                                     Mat rix Order
                                                                  226
Future Directions




           http://www.netlib.org/scalapack   227
ScaLAPACK -- Ongoing Work
   Increased   flexibility and usability
    – Incorporate PBLAS (version 2.0 ALPHA)
      into library
    – Algorithmic blocking and removal of
      alignment restrictions at top-level routines
   Increased   functionality
    – Divide-and-Conquer SEP and SVD
   C++   and Java interface to ScaLAPACK
   Iterative Solvers
                http://www.netlib.org/scalapack   228
Conclusions




          http://www.netlib.org/scalapack   229
ScaLAPACK Summary
  Follow-on   to LAPACK -- Designed for
   MPP’s
  Goals attained through the promotion
   and development of standards (BLAS,
   PBLAS)
  Object-based design provides a
   framework for additional data
   distributions and matrix types
  Software available today
               http://www.netlib.org/scalapack 230
ScaLAPACK Summary
 Portabilityacross wide range of
  architectures
  – will work on a heterogeneous platform
 Alreadyin use by SGI/Cray, IBM, HP-
  Convex, Fujitsu, NEC, NAG, & Visual
  Numerics (IMSL)
  – Tailor performance & provide support
 Software,     examples, prebuilts available on
  netlib
ScaLAPACK Team
   Susan Blackford, UT                   Sven Hammarling, NAG
   Jaeyoung Choi, Soongsil               Greg Henry, Intel
    University                            Osni Marques, LBNL/NERSC
   Andy Cleary, LLNL                     Caroline Papadopoulos,
   Ed D'Azevedo, ORNL                     UCSD
   Jim Demmel, UC-B                      Antoine Petitet, UT
   Inder Dhillon, IBM ( UC-B)            Ken Stanley, UC-B
   Jack Dongarra, UT/ORNL                Francoise Tisseur, UMan
   Ray Fellers, UC-B                     David Walker, Cardiff U
                                          Clint Whaley, UT
     http://www.netlib.org/scalapack
                                                                   232
                                            scalapack@cs.utk.edu
ScaLAPACK -- References
     L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I.
      Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K.
      Stanley, D. Walker, R. C. Whaley, ScaLAPACK Users’ Guide,
      Second Edition, SIAM, Philadelphia, PA, 1997.
     ScaLAPACK software and documentation can be
      obtained via WWW and anonymous ftp:
       – http://www.netlib.org/scalapack,
       – http://www.netlib.org/lapack/lawns,
       – (anonymous) ftp ftp.netlib.org
     Comments/suggestions to scalapack@cs.utk.edu

                       http://www.netlib.org/scalapack              233

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:24
posted:10/26/2012
language:Unknown
pages:233