Docstoc

MUMPS a MUltifrontal Massively Parallel Solver

Document Sample
MUMPS a MUltifrontal Massively Parallel Solver Powered By Docstoc
					                                                         MUMPS
MUMPS: a MUltifrontal
Massively Parallel Solver


MUMPS team,          Lyon-Grenoble, Toulouse, Bordeaux

                   e        ´
Workshop CIRA “Syst` mes Lineaires”
November 23, 2010
Outline




   General presentation of MUMPS



   Recent Research and Developments



   Concluding Remarks




                                           e        ´
            MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
2/27                                                                2010
Outline




   General presentation of MUMPS



   Recent Research and Developments



   Concluding Remarks




                                           e        ´
            MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
3/27                                                                2010
What is MUMPS
   Initially funded by LTR (Long Term Research) European project

   PARASOL (1996-1999)

   MUMPS (MUltifrontal Massively Parallel sparse direct Solver)
   solves sparse systems of linear equations Ax = b in three phases :
       1. Analysis : matrix is preprocessed to improve its structural
          properties (A x = b with A = Pn P Dr ADc QP t )
  2. Factorization : matrix is factorized as A = LU or LDLT
   3. Solve : the solution x is computed by means of forward and
      backward substitutions




                                                  e        ´
                   MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
4/27                                                                       2010
MUMPS (MUltifrontal Massively Parallel Solver)

   http://graal.ens-lyon.fr/MUMPS and http://mumps.enseeiht.fr
   Platform for research
   • Research projects
   • PhD thesis
   • Hybrid methods
   Competitive software package used worldwide
   • Co-developed by Lyon-Toulouse-Bordeaux
   • Latest release : MUMPS 4.9.2, Nov. 2009, ≈ 250 000 lines of C
       and Fortran code
   • 1000+ downloads per year from our website, half from industries :
       Boeing, EADS, EDF, Petroleum industries, etc.
   • Integrated within commercial and academic packages (Samcef
       from Samtech, Actran from Free Field Technologies, Code Aster or
       Telemac from EDF, IPOPT, Petsc, Trilinos, . . . ).
                                               e        ´
                MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
5/27                                                                    2010
Download Requests from the MUMPS website
User’s distribution map
   1000+ download requests per year




                                              e        ´
               MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
7/27                                                                   2010
Main functionalities/features (I)

   Fortran, C, Matlab and Scilab interfaces
   Input formats

       • Assembled format
       • Distributed assembled format
       • Sum of elemental matrices
       • Sparse, multiple right-hand sides, distributed solution


   Type of matrices

       • Symmetric (positive definite, indefinite), Unsymmetric
       • Single/Double precision with Real/Complex arithmetic


                                                  e        ´
                   MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
8/27                                                                       2010
Main functionalities/features (II)


   Preprocessing and Postprocessing

       • Reduce fill-in : symmetric orderings interfaced : AMD, QAMD,
         AMF, PORD, (par)METIS, (pt)SCOTCH
       • Numerical preprocessing : unsymmetric orderings and scalings
       • Iterative refinement and backward error analysis


   Numerical pivoting

       • Partial pivoting and two-by-two pivots (symmetric)
       • Static pivoting and ”Null” pivot detection, null-space basis



                                                  e        ´
                   MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
9/27                                                                       2010
Main functionalities/features (III)



   Solving larger problems

    • Hybrid scheduling
    • Out-of-core
    • Parallel analysis


   Miscellaneous
    • Partial factorization and Schur complement (for hybrid solvers)
    • Inertia, determinant, entries of A−1 , discard factors . . .




                                                e        ´
                 MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
10/27                                                                    2010
MUMPS vs other sparse direct solvers

        Address wide classes of problems
        • Good numerical stability (dynamic pivoting, preprocessing,
          postprocessing, error analysis)
        • Wide range of numerical features


        Management of parallelism
        • MPI-based
        • Asynchronous approach (distributed dynamic schedulers)
                                                                                                                                                                                                                                                                                       9.28s             9.3s             9.32s
                                                                               8.9s                                8.95s                                                           9.0s                                   9.05s
                                                                                                                                                                                                                                                                         MPI
                                                                                                                                                                                                                                                                                                                                                 MPI
                                                                                                                                                                                                                                                                         Application                                                             VT_API
                                                                                                                                                                                                                                             Process 0                                                                                           Comm
                       Process 0   5 5   5         5                     4     4       5       108   5       5      5          5       5       5           5       Facto_L1                       4   5   5       5         5       5       5   5   5        5




                                                                                                                                                                                                                                             Process 1                      80         80      80   80    80    80   80    80          80
                       Process 1                   108         4                   4       108   5       108 5      5      5       5       5       5           5       Facto_L1               4            5      5         5       5       5   5   5        5   5




                       Process 2                   108         4                       4 108     5           5     5       5       5       5       5           5           108            5   108 5       5       5             5       5
                                                                                                                                                                                                                                             Process 2
                                                                                                                                                                                                                                               5   5     5       4
                                                                                                                                                                                                                                                                            80         80      80   80   80     80   80    80     80   80   80




                       Process 3   5     5     5                          4            108       5   5     4 108    5      5       5       5       5           5               4    108   5           5   5       5             5       5    Process 3
                                                                                                                                                                                                                                               5   5     5




                       Process 4         4             108                             5         5   4       5     5           5       5       5           5               108            5   108 5       5       5             5       5    Process5
                                                                                                                                                                                                                                               5     4       5




                       Process 5   4         4 4         5         5                   4 108     5   5       5     5           5       5           5           5       2   2              2           2    2          2         2       2    Process 5
                                                                                                                                                                                                                                               2                     4      80         80      80   80    80    80   80    80     80   80



                       Process 6               4       4       108                  108          5   108    5       5      5       5       5           5           5                108 5             5   4 108   5         5       5       5Process 5
                                                                                                                                                                                                                                                     6
                                                                                                                                                                                                                                                5            5              80         80      80   80   80     80   80    80     80   80   80



                       Process 7             108           4           4 108   2    2            2   2       2         2       2       2                                       4    108   5           5    5      5             5       5    Process 7
                                                                                                                                                                                                                                                5   5    5



                                                                                L




                                                                   MUMPS                                                                                                                                                                                 SuperLU DIST

                                                   e        ´
                    MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
11/27                                                                       2010
Dynamic Scheduling
    • Tasks graph = tree (results from matrix structure and ordering heuristic)
    • Each task = partial factorization of a dense matrix
    • Most parallel tasks mapped at runtime (typically 80 %)
                                      P0 P1 P0     2D static decomposition
                                      P2 P3 P2

                             P0       P0 P1 P0
                P0


                     P0                                           P3
   TIME




          P0                             P2
               P0                P2
                                                            P3



          P0                P1                P2      P3               P3

                                                                             : STATIC



                                  SUBTREES



                                                         e        ´
                          MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
12/27                                                                             2010
Dynamic Scheduling
    • Tasks graph = tree (results from matrix structure and ordering heuristic)
    • Each task = partial factorization of a dense matrix
    • Most parallel tasks mapped at runtime (typically 80 %)
                                      P0 P1 P0     2D static decomposition
                                      P2 P3 P2

                             P0       P0 P1 P0
                P0
                P3
                P1
                     P0                                           P3
                                                                  P0
                                                                  P1
   TIME




                                                                  P2
          P0                             P2
          P1   P0                P2      P3
          P2                             P0                 P3



          P0                P1                P2      P3               P3

                                                                             : STATIC
                                                                             : DYNAMIC

                                  SUBTREES



                                                         e        ´
                          MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
12/27                                                                             2010
Dynamic Scheduling
    • Tasks graph = tree (results from matrix structure and ordering heuristic)
    • Each task = partial factorization of a dense matrix
    • Most parallel tasks mapped at runtime (typically 80 %)
                                      P0 P1 P0     2D static decomposition
                                      P2 P3 P2

                             P0       P0 P1 P0
                P0
                P3
                P1
                     P0                                           P3
                                                                  P0
                                                                  P1
   TIME




                                                                  P2
          P0                             P2
          P1   P0                P2      P3                                         P3    P0
          P2                             P0                 P3                 P2
                                                                             1D pipelined factorization
                                                                             P3 and P0 chosen by P2 at runtime
          P0                P1                P2      P3               P3

                                                                                                : STATIC
                                                                                                : DYNAMIC

                                  SUBTREES



                                                         e        ´
                          MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
12/27                                                                             2010
MUMPS Team


    Permanent members in 2010
        Patrick Amestoy (N7-IRIT, Toulouse)


        Jean-Yves L’Excellent (INRIA-LIP, Lyon)


        Abdou Guermouche (LABRI, Bordeaux)


              c
        Bora U¸ar (CNRS-LIP, Lyon)


        Alfredo Buttari (CNRS-IRIT, Toulouse)




                                               e        ´
                MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
13/27                                                                   2010
MUMPS Team

    • Recent post-docs :
        Indranil Chowdhury (May 2009–March 2010)
        Alfredo Buttari (Jan. 2008-Oct. 2008)
               c
        Bora U¸ar (Jan. 2007-Dec. 2008)

    • Recent PhD Students :
        Emmanuel Agullo (ENS Lyon, 2005-2008)
        Mila Slavova(CERFACS, Toulouse, 2005-2009)
            c
        Fran¸ois-Henry Rouet (INPT-IRIT, Toulouse, 2009-)
          e
        Cl´ment Weisbecker (INPT-IRIT and EDF, Toulouse, 2010-)

    • Engineers Aur´lia F`vre (INRIA, 2005-2007)
                   e     e
        Philippe Combes (CNRS, Dec. 2007-Dec. 2008)
                   e
        Maurice Br´mond (INRIA, Oct. 2009-Oct. 2012)
        Guillaume Joslin (INRIA, Oct.2009-Oct. 2011)
                                                e        ´
                 MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
14/27                                                                    2010
Recent projects and contracts

   Recent projects and contracts

    • France-Berkeley project (2008-2009)
    • Collaboration with the SEISCOPE consortium (2006-2008)
    • Contracts with Samtech S.A. (2005-2006, then 2008-2010)
    • French-Israeli Multicomputing project (2009-2010)
    • ANR Solstice project (2007-2010), partners : INRIA, CERFACS,
        INPT-IRIT, CEA-CESTA, EADS IW, EDF, CNRS-CNRM-LA.
    • INRIA ”Action of Technological Development” (2009-2012)
    • EDF contrat d’encadrement de th`se de Cl´ment Weisbecker
                                     e        e
        (2010-2013).
    • Strong connections with GridTLSE project,
        http://gridtlse.org

                                                e        ´
                 MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
15/27                                                                    2010
Outline




    General presentation of MUMPS



    Recent Research and Developments



    Concluding Remarks




                                            e        ´
             MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
16/27                                                                2010
Recent Research and Developments
  Out-of-core : use disks when memory not big enough

  Two PhD completed

  • Emmanuel AGULLO (ENS Lyon, 2005-2008) On the Out-of-core
    Factorization of Large Sparse Matrices
  • Mila Slavova (CERFACS, Toulouse, 2005-2009) Parallel triangular
    solution in the out-of-core multifrontal approach


  • Task scheduling, prefetching, low-level I/O’s, synchronous vs
    asynchronous approaches . . .
  • Lots of developments by MUMPS team
  • Core memory requirements for “Epicure” matrix (3D problem, 1
    million equations, EDF Code Aster)
     ◦ Total memory (in-core factors) = 20.8 GBytes
     ◦ Core memory (out-of-core factors) = 3.7 GBytes
Out-of-core related work : entries of A−1


  Goal : compute a set of entries (A−1 )ij = eT A−1 ej
                                              i
  PhD F.-H. Rouet, 2009- (extension of PhD work of M. Slavova)
  • x = L−1 ej : Exploit sparsity of right-hand sides (RHS)
  • (A−1 )ij = eT U −1 x Exploit sparsity of requested entries of solution
                i
  • Combinatorial problem : how to partition the right-hand-sides ?
  • Illustration on an application from astrophysics (CESR, Toulouse)

                  No exploit sparsity of RHS    43 396 sec
                  Natural ordering                 721 sec
                  Postordering                     199 sec
         time to compute ALL diagonal of A−1 , OOC factors, N=148k
  • On-going work : hypergraph partitionning, in-core, parallel aspects
Out-of-core related work : entries of A−1


  Goal : compute a set of entries (A−1 )ij = eT A−1 ej
                                              i
  PhD F.-H. Rouet, 2009- (extension of PhD work of M. Slavova)
  • x = L−1 ej : Exploit sparsity of right-hand sides (RHS)
  • (A−1 )ij = eT U −1 x Exploit sparsity of requested entries of solution
                i
  • Combinatorial problem : how to partition the right-hand-sides ?
  • Illustration on an application from astrophysics (CESR, Toulouse)

                  No exploit sparsity of RHS    43 396 sec
                  Natural ordering                 721 sec
                  Postordering                     199 sec
         time to compute ALL diagonal of A−1 , OOC factors, N=148k
  • On-going work : hypergraph partitionning, in-core, parallel aspects
Recent Research and Developments (cont’)

    • Parallel analysis (Buttari et al.)
      ◦ Critical when graph of matrix cannot be centralized on one processor
      ◦ Use parallel graph partitioning tools PT-Scotch (LaBRI, Bordeaux) or
        ParMETIS (Univ. Minnesota)

                                                        (k)     (k)
    • Parallel scalings (Ruiz, U¸ar et al.) : A(k+1) = Dr A(k) Dc
                                c
               (k)                (k)           (k)                (k)
        where Dr     = diag(   ||Ai∗ ||p )−1 , Dc     = diag( ||A∗i ||p )−1

                           Flops          # of entries in factors (×106 )
                          (×106 )           estimated         effective
          scaling     OFF      ON        OFF      ON       OFF      ON
          a0nsdsil    7.7      2.5       0.42     0.42     0.57     0.42
           C-54       281      209       1.42     1.42     1.76     1.58

                                                  e        ´
                   MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
19/27                                                                      2010
Multicores : mixing MPI and OpenMP

  (work started with 1-year postdoc Indranil Chowdhury, funded by ANR)
  Main goal : understand limits of MPI + threaded BLAS + OpenMP
  • Use TAU and Intel Vtune profilers
  • Insert OpenMP directives in critical kernels of MUMPS
  • Experiment with various matrices on various architectures

  Difficulties :
  • Understand speed-down problems (use OMP IF directives)
  • Side effects of threaded BLAS libraries (MKL ok)
  • Dependence on OpenMP implementations
  • Threaded BLAS within OpenMP parallel regions
Multicores : mixing MPI and OpenMP

  Initial results :
  • good compromise : use 4 threads per MPI process
    typical speed-up : 6 on 8 cores
  • unsymmetric and Cholesky versions more efficient than LDLt


  • Huge gains in LDLt by rewriting panel factorization
  • Work in progress

  Recent illustrative results :
  • 96-core machine, INRIA Bordeaux Sud-Ouest
  • Runs on a medium-size problem (ULTRASOUND80)
    ◦ Order : 531 441, non-zeroes : 330 ×106
    ◦ Factors : 981 × 106 entries, 3.9 × 1012 flops
Experimental results


                                                 MPI only
                                                 MPI+Thread MKL (4 threads per process)
                                                 MPI+Thread MKL + OMP (4 threads per process)
                                      800

                                      700

                                      600
        Execution time (in seconds)




                                      500

                                      400

                                      300

                                      200

                                      100

                                       0
                                             1             4            8           16          32   64




                                                                      e        ´
                                       MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
22/27                                                                                          2010
Remark on memory usage


                                               MPI only
                                               MPI + Thread MKL + OMP (4 threads per process)
                                        20

                                        18

                                        16
           Total memory usage (in GB)




                                        14

                                        12

                                        10

                                        8

                                        6

                                        4

                                        2

                                        0
                                               1           4            8           16          32   64



   More threads per MPI process is good for memory
                                                                        e        ´
                                         MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
23/27                                                                                            2010
Software issues



   • MUMPS : 15-year old research code
     ◦ Software engineering/validation/maintenance/support not so simple
     ◦ Combination of options difficult to maintain (asynchronous pipelined
       factorizations + out-of-core + many pivoting strategies + various
       types of matrices and input formats + dynamic distributed
       datastructures . . . )
   • Recent initiatives :
     ◦ 1-year engineer funded by CNRS (2008) : P. Combes
     ◦ Action of Technological Development funded by INRIA (2009-2012)
         • 2-year funding for a young engineer (G. Joslin)
         • part-time of a permanent INRIA engineer (M. Br´mond)
                                                             e
         • Objective : ensure durability and evolutivity of MUMPS




                                               e        ´
                MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
24/27                                                                   2010
Outline




   General presentation of MUMPS



   Recent Research and Developments



   Concluding Remarks




                                           e        ´
            MUMPS team, Workshop CIRA “Syst` mes Lineaires”, November 23,
25/27                                                               2010
Concluding remarks

  Theoretical limits on regular 2D and 3D problems
                                 2D             3D
                                 n × n grid     n × n × n grid
   Nonzeros in original matrix   O(n2 )         O(n3 )
   Nonzeros in factors           O(n2 log n)    O(n4 )
   Floating-point ops            O(n3 )         O(n6 )

  2D problems : direct solvers often preferred
  3D problems : current (increasing) limit ≈ 100 million equations ?
                not exploiting all power of emerging platforms !
  Research issues : parallelism, mapping irregular datastructures,
  scheduling for memory/for performance, memory scalability,
  out-of-core storage, reduce cost of solve, exploit sparsity in
  RHS/solution, functionalities for new types of applications, . . .
  Progress in sparse direct solver technology will naturally benefit to
  hybrid direct-iterative solvers
Concluding Remarks


  • Main originalities of MUMPS :
    ◦ wide range of functionalities
    ◦ numerical stability in distributed-memory environments



  • Evolution of platforms and application needs ⇒ considerable
    efforts on algorithms and code development


  • More information :
      http://graal.ens-lyon.fr/MUMPS
      http://mumps.enseeiht.fr
      https://listes.ens-lyon.fr/sympa/arc/mumps-users

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:43
posted:1/13/2011
language:English
pages:30