Docstoc

PaStiX solver

Document Sample
PaStiX solver Powered By Docstoc
					   PaStiX : how to reduce memory
              overhead
                       ASTER meeting
                   Bordeaux, Nov 12-14, 2007



PaStiX team
LaBRI, UMR CNRS 5800, Université Bordeaux I
Projet ScAlApplix, INRIA Futurs
PaStiX solver

• Current development team (SOLSTICE)‫‏‬
      •   P. Hénon (researcher INRIA)‫‏‬
      •   F. Pellegrini (assistant professor LaBRI/INRIA)‫‏‬
      •   P. Ramet (assistant professor LaBRI/INRIA)‫‏‬
      •   J. Roman (professor, leader of ScAlApplix INRIA project)‫‏‬
• PhD student & engineer
      • M. Faverge (NUMASIS project)‫‏‬
      • X. Lacoste (INRIA)‫‏‬
• Others contributors since 1998
      • D. Goudin (CEA-DAM)‫‏‬
      • D. Lecas (CEA-DAM)‫‏‬
• Main users
      • Electomagnetism & structural mechanics codes at CEA-DAM CESTA
      • MHD Plasma instabilities for ITER at CEA-Cadarache (ASTER)‫‏‬
      • Fluid mechanics at MAB Bordeaux
PaStiX solver
• Functionnalities
       •   LLt, LDLt, LU factorization (symmetric pattern) with supernodal implementation
       •   Static pivoting (Max. Weight Matching) + It. Raff. / CG / GMRES
       •   1D/2D block distribution + Full BLAS3
       •   Support external ordering library (provided Scotch ordering)‫‏‬
       •   MPI/Threads implementation (SMP node / Cluster / Multi-core / NUMA)‫‏‬
       •   Simple/Double precision + Float/Complexe operations
       •   Require only C + MPI + Posix Thread
       •   Multiple RHS (direct factorization)‫‏‬
       •   Incomplete factorization ILU(k) preconditionner
• Available on INRIA Gforge
       •   All-in-One source code
       •   Easy to install on Linux or AIX systems
       •   Simple API (WSMP like)‫‏‬
       •   Thread safe (can be called from multiple threads in multiple MPI communicators)‫‏‬
• Current works
       •   Use of parallel ordering (PT-Scotch) and parallel symbolic factorization
       •   Dynamic scheduling inside SMP nodes (static mapping)‫‏‬
       •   Out-of Core implementation
       •   Generic Finite Element Assembly (domaine decomposition associated to matrix
           distribution)‫‏‬
pastix.gforge.inria.fr

• Latest publication : to appear in Parallel Computing : On finding
  approximate supernodes for an efficient ILU(k) factorization
• For more publications, see : http://www.labri.fr/~ramet/
MPI/Threads implementation for SMP
clusters
•   Mapping by processor                     •   Mapping by SMP node
    Static scheduling by processor               Static scheduling by thread

•   Each processor owns its local part of    •   All the processors on a same SMP
    the matrix (private user space)              node share a local part of the matrix
                                                 (shared user space)‫‏‬

•   Message passing (MPI or                  •   Message passing (MPI) between
    MPI_shared_memory) between any               processors on different SMP nodes
    processors
                                                 Direct access to shared memory
                                                 (pthread) between processors on a
                                                 same SMP node

•   Aggregation of all contributions is done •   Aggregation of non local contributions
    per processor                                is done per node

•   Data coherency insured by MPI            •   Data coherency insured by explicit
    semantic                                     mutex
MPI only
           • Processors 1 and 2 belong to the
           same SMP node
           • Data exchanges when only MPI
           processes are used in the
           parallelization
MPI/Threads

              • Thread 1 and 2 are created by one
              MPI process
              • Data exchanges when there is one
              MPI process per SMP node and one
              thread per processor


               much less MPI communications
                (only between SMP nodes)‫‏‬
               no aggregation for modifications of
                all blocks belonging to processors
                inside of a SMP node
 AUDI : 943.103 (symmetric)‫‏‬




requires MPI_THREAD_MULTIPLE !
How to reduce memory resources
• Goal: we want to adapt a (supernodal) parallel direct solver
  (PaStiX) to build an incomplete block factorization and
  benefit from all the features that it provides:

  Algorithmic is based on linear algebra kernels (BLAS)

  Load-balancing and task scheduling are based on a fine
   modeling of computation and communication

  Modern architecture management (SMP nodes) : hybrid
   Threads/MPI implementation
Main steps for incomplete factorization
algorithm
1. find the partition P0 induced by the supernodes of A
2. compute the block symbolic incomplete
   factorization Q(G,P0)k=Q(Gk,P0)‫‏‬
3. find the exact supernode partition in Q(G,P0)k
4. given a extra fill-in tolerance α, construct an
   approximated supernode partition Pα to improve the
   block structure of the incomplete factors
5. apply a block incomplete factorization using the
   parallelization techniques developed for our direct
   solver PaStiX

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:10
posted:3/21/2012
language:English
pages:22