; LG
Learning Center
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>



  • pg 1
									                   AMA 2011
 Architectures massivement parallèles:
quelques questions algorithmiques ouvertes

Luc Giraud (INRIA Bordeaux - Sud-Ouest, équipe HiePACS, Laboratoire

Xavier Vasseur (CERFACS/Algorithmes Parallèles et équipe HiePACS)

                      8 FEVRIER 2011
    IESP Roadmap – main questions

      • Which programming models and tools to use on petascale
        and exascale heterogeneous platforms ?

      • How to deal with huge amount of data [I/O operations,
        visualization, scientific data management] ?

      • How to cope with faults and errors [at both hardware,
        software levels] ?

"The International Exascale Software Roadmap," Dongarra, J., Beckman, P. et al., to appear in Volume 25, Number 1,
              2011,International Journal of High Performance Computer Applications, ISSN 1094-3420.
      IESP Roadmap – challenges related to

                                  Develop algorithms that
    • are suited to heterogeneous architectures [architectural-
    • manage errors and faults [resilience]
    • minimize power use [energy-aware algorithms]
    • minimize communications [communication-avoiding]

"The International Exascale Software Roadmap," Dongarra, J., Beckman, P. et al., to appear in Volume 25, Number 1,
              2011, International Journal of High Performance Computer Applications, ISSN 1094-3420.
     New or improved algorithms for petascale and
    exascale platforms required at each level of the
               hierarchical architecture

•    Algorithms at the intra-node level
•    Algorithms at the inter-node level
•    Algorithms at the global level
•    Mid-term and long-term projects
      Intra-node level : open challenges
• The new generation of node can be
  seen as a heterogeneous architecture of
  multicore and accelerators

• Current hybrid programming model
  based on MPI, OpenMP

  How to derive efficient algorithms with
  heterogeneous components in a given
  node ?

  How to exploit all computing units
  simultaneously ?
     Intra-node level: new methodology
• Represent the algorithm as a collection of
  multiple tasks and dependencies
• Associate each task with a high-level
  performance implementation [CPU, GPU]
• Properly schedule the tasks' execution over
  the available multicore and accelerator
  components using a runtime system
• Data management and coherency handled by
  the dynamic task scheduler

  Idea: distribute dynamically tasks on the
 most appropriate units according to priority
             and memory affinity
    Intra-node level: recent libraries using
       this hybridization methodology
•   Quark/DAGUE [U.Tennessee, U. Berkeley]
•   DPLASMA [U.Tennessee, U. Berkeley]
•   ThreadPool [Sandia, Trilinos project]
•   StarPU [INRIA, Runtime]
•   SMPsuperscalar [BSC]

      Resulting hybrid algorithms are often more
      efficient than homogeneous algorithms designed
      exclusively for either GPUs or homogeneous
      multicore CPUs
          Intra-node level: an example in
          dense linear algebra [Cholesky]

     8 Intel Nehalem X5550 CPU cores + 3 NVIDIA FX5800 GPU
         INTEL MKL library on CPU + MAGMA kernel on GPU
                       StarPU runtime system
E. Agullo et al., Faster, cheaper, better – a hybridization methodology to develop linear algebra software for GPUs,
                                             “GPU Computing Gems” (2010)
Inter-node level: communication
      avoiding algorithms
Communication (data movement either between levels of
  a memory hierarchy or between processors) is costly,
                 computation is cheap

  Algorithms should minimize communication, even if that may
          require some redundant arithmetic operations
  Inter-node level: communication avoiding
     algorithms (s-step Krylov methods)
                    GMRES algorithm: CA-GMRES

                             • Break data dependencies in Krylov subspace
                             • Main goal: amortize cost of each kernel over s steps
                             • Compute
                               i.e. s matrix-vector products for the cost of one
                               matrix-vector product [using overlapping ghost zones
                               and redundant computation]
                             • Orthogonalize s+1 vectors for the cost of one
 Successfully applied to sparse linear algebra and dense linear
                        algebra in 2010

Chronopoulos and Gear, s-step iterative methods for symmetric linear systems, J. Comp. Appl. Math.,
                                       25, 1989, pp. 153-168

         Hoemmen, Communication-avoiding Krylov subspace methods, PhD thesis, 2010.
  Inter-node level: improve algorithms
     through interactions with tools

• Use advanced analysis and diagnostic tools to identify
  potential performance bottlenecks related to
  communication and synchronization
• Detect wait states due to load or communication imbalance
  [especially important on applications on irregular and/or
  dynamic domains]

• Use processor virtualization for dynamic balancing to
  obtain a better overlap between computation and
  communication [Charm++ runtime system,
M. Geimer et al., The Scalasca performance toolset architecture, Concurrency and Computation: practice
                                  and experience, 2010, 22, pp 702-719.
Rodrigues et al, Optimizing an MPI weather forecasting model via processor virtualization, Proceedings of
                     International Conference on High Performance Computing (2010)
        Global level : resilience

• Future platforms made of hundreds of millions of
  heterogeneous cores

• Mean time to failure will be about 60 seconds on exascale
  computers: faults will be ALMOST continuous !
     Developing new fault- and error-tolerant
     algorithms is thus a key issue

     Resilience is one of the most critical issues
     according to the IESP roadmap
  Global level : resilience – existing
• Rollback recovery aims at partly re-running the
  simulation from a certain point before the given failure(s)

• Rollback recovery is based on:
  checkpointing [memory dump]
  message logging [keep trace of all communications]

• Unfortunately a complete rollback recovery is too
  expensive on petascale or exascale platforms !

• Development of hierarchical rollback recovery (cluster of
  nodes by cluster of nodes) in progress
         Global level : resilience – possible
            ideas at the algorithm level
        • Illustration when solving a large linear system of equations A x = b
          with iterative methods assuming that dynamic data have been lost on a
          given processor f
        • Recover partially lost data through local interpolation [Langou et al,

          Only approximate data on processor f are obtained, may impact the
        • Regenerate exactly the lost data on processor f through ABFT
          algorithms [Huang et al., 1984, Bosilca et al., 2009]
          Rigorous but costly !
Huang and Abraham, Algorithm-based fault-tolerance for matrix operations, IEEE Trans. On Comp. 33, 518-528, 1984.
Langou et al., Recovery patterns for iterative methods in a parallel unstable environments, SIAM SISC, 30-1, pp.102-
116, 2007
Bosilca et al., Algorithmic based fault tolerance applied to high performance computing, Journal of Parallel and
Distributed computing, 69, pp. 410-416, 2009
     Global level : resilience – possible
        ideas at the algorithm level
    • Illustration when solving a large linear system of equations A x = b
      with iterative methods assuming that dynamic data have been lost on a
      given processor f

    •   Use of fixed-point schemes or classical overlapping Schwarz-type
        methods [domain decomposition paradigm]

    • Regenerate a local iterate from available data on neighbor interface
      and continue the iterative process [demo on the 1D-Laplace problem]

    • Extensions to nonoverlapping situations and nonlinear problems

New ideas to be implemented and tested into numerical libraries
Global level : resilience – idea at the
    programming model level
• Parallel programming models may have to take into
  account directly both faults and errors
• Each task is interpreted as a transaction and the correct
  completion of each task is checked

• Future integration into XcalableMP

• Future integration into Star superscalar
    Summary and ongoing projects

 Strategy A: refine existing algorithms
 suited to heterogeneous architectures and
 being able to manage faults

 Strategy B: develop completely new algorithms
based on different mathematical models to reach
   the same goals and address the question of
reproducibility of results on exascale platforms
                    CERFACS HPC 2020 in climate modelling

     HPC technical impacts
Individual performance and scalability of component codes
      • Decadal prediction: large ensemble of short simulations at (very) high resolution
             Ensemble and space parallelisation
      • Long data assimilation windows to take into account large sets of observations:
             Parameterisation of model error
             Time parallelisation
      • Limited code scalability
             OpenMP/MPI hybrid parallelisation
             New dynamical cores, new grids (e.g. icosahedral)
      • Lack of robustness of long MPP simulations
             Resilient (fault tolerant) libraries

Overall performance of the multi-physics coupled system
     • Coupling of legacy codes developed by independent groups
            Compromise between integrated vs multi-executable approach
     • Load imbalance and multiscale component coupling
            Improve flexibility in the deployment of the components
     • Increase of the workflow complexity
            Improve integrity and flexibility in the execution of the different tasks on different platforms

Data management
     • Exponential increase in Input/Output
           Efficient parallel I/O
     • Exponential increase in the amount of data to pre/post-process
           On-line treatment, new techniques for data compression                                       19
            ANR BLANCHE - RESCUE

• Duration: 48 months, starting October 2010
• Partners: INRIA Saclay (GRAND LARGE) , INRIA Rhône-
  Alpes (ROMA) and INRIA Bordeaux-Sud-Ouest (HiePACS)

• Main topics:
  Resilience for exascale applications
  Development of protocols for exascale fault-tolerance
  Development of performance and execution models
  Development of new fault-tolerant algorithms

• Access to Blue Waters at Urbana-Champaign (USA)
    G8 initiative-ECS: Enabling climate
        simulation at extreme scale
• Duration: xx months, starting yy 2011
• Partners: Academic partners in Canada, France,
  Germany, Japan, Spain and USA

• Main topics:
  Resilience for climate simulations at exascale
  Node-level performance optimization
  Climate codes involved: CESM (NCAR),
  NICAM (Jamstec, U. Tokyo)                        Source: IPSL

•    Access to Blue Waters (USA), Jugene
    (Germany), Tsubame (Japan)
               EESI (European Exascale Software Initiative)

•   Industrial and engineering applications [P. Ricoux]
•   Weather, climatology and earth sciences [G. Aloisio]
•   Fundamental sciences [G. Sutmann]
•   Life science and health [M. Orozco]
•   Hardware roadmaps, links with vendors [H. Huber]
•   Software eco-system [F. Cappelo]
•   Numerical libraries, solvers and algorithms [I. Duff]
•   Scientific software engineering [M. Ashworth]
     Related summer schools
• CEA/EDF/INRIA summer school:
 “Toward petaflop numerical simulation on parallel
 hybrid architectures”, June 6-10, 2011, Sophia-

• CEMRACS summer school:
 “Numerical methods and algorithms on petaflop
 platforms”, July 16- August 24, 2012, Marseille

To top