Compiler Assisted Application Level Fault Tolerance in Distributed Systems by cyq14250

VIEWS: 0 PAGES: 34

More Info
									Fault Tolerance & PetaScale Systems:
 Current Knowledge, Challenges and
             Opportunities

                   Franck Cappello
                           INRIA
                          fci@lri.fr




      Keynote @ EuroPVM/MPI, September 2008, Dublin Ireland

                                                              1
             Agenda

•Why Fault Tolerance is Challenging?
•What are the main reasons behind failures?
•Rollback Recovery Protocols?
•Reducing Rollback Recovery Time?
•Rollback Recovery without stable storage?
•Alternatives to Rollback Recovery?
•Where are the opportunities?
             Agenda

•Why Fault Tolerance is Challenging?
•What are the main reasons behind failures?
•Rollback Recovery Protocols?
•Reducing Rollback Recovery Time?
•Rollback Recovery without stable storage?
•Alternatives to Rollback Recovery?
•Where are the opportunities?
 Why Fault Tolerance in Peta scale
systems for HPC app. is challenging?
A)   FT a difficult research area:
•    Related to many other issues: scalability, programming models,
     environments, communication libs., storage, etc.
•    FT is not considered as first class issue
     (Many vendors are simply not providing software to tolerate the failure of their systems)

•    Software solutions have to work on large variety of hardware
•    New results need strong efforts (old research discipline)
•    Etc.

B) We will reach soon a situation where the “Classic” Rollback-Recovery
    will simply not work anymore!
    --> let see why
                Classic approach for FT:
                  Checkpoint-Restart                                                   QuickTime™ and a
                                                                             TIFF (Uncompressed) decompre ssor
                                                                                are neede d to see this picture.


Typical “Balanced Architecture” for PetaScale Computers
      Compute nodes
                                     40 to 200 GB/s
                                                      Parallel file system   RoadRunner
Total memory:
                                                      (1 to 2 PB)
100-200 TB
                   Network(s)

                                  I/O nodes
                                                                             TACC Ranger
                1000 sec. < Ckpt < 2500 sec.
      Systems             Perf.       Ckpt time          Source
RoadRunner  Without optimization, Panasas
                1PF     ~20 min.
LLNL BG/L       500 TF  >20 min.  LLNL
            Checkpoint-Restart needs about 1h!                                  LLNL BG/L
Argonne BG/P    500 TF  ~30 min.  LLNL
                 minutes each) estimation
            (~30100 TF  ~40 min.
Total SGI Altix
IDRIS BG/P             100 TF        30 min.          IDRIS
          Failure rate and #sockets
 In Top500 machine performance X2 per year (See Jack slide on top500)
 --> more than Moore’s law and increase of #cores in CPUs
 If we consider #core X 2, every 18, 24 and 30 months AND fixed Socket
 MTTI:
                                                                     SMTTI ~ 1/(1− MTTI^n )




                                       QuickTime™ and a                                       Figures from
                              TIFF (Uncompressed) decompre ssor
                                 are neede d to see this picture.
                                                                                              Garth Gibson



RR TR
                                                                    1h. wall

          We may reach the 1h. wall as soon as in 2012-2013
          Another projection from CHARNG-DA LU gives similar results

        It’s urgent to optimize Rollback-Recovery for PetaScale
        systems and to investigate alternatives.
                Agenda

•Why Fault Tolerance is Challenging?
•What are the main reasons behind failures?
•Rollback Recovery Protocols?
•Reducing Rollback Recovery Time?
•Rollback Recovery without stable storage?
•Alternatives to Rollback Recovery?
•Where are the opportunities?
           Understanding Approach:
                 Failure logs
• The computer failure data repository
  (CFDR)
 http://cfdr.usenix.org/
 From 96 until now…
 HPC systems+Google

•   failure logs from LANL, NERSC, PNNL,
    ask.com, SNL, LLNL, etc.                                               QuickTime™ and a
                                                                  TIFF (Uncompressed) decompressor
                                                                    are need ed to see this picture.

    ex: LANL released root cause logs for:
    23000 events causing apps. Stop on 22
    Clusters (5000 nodes), over 9 years


                                     QuickTime™ and a
                            TIFF (Uncompressed) decompre ssor
                               are neede d to see this picture.
               Analysis of failure logs
•   In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most
    number of outages (59-84 percent), and take the shortest time to repair (0.6-1.5
    hours). Hardware problems, albeit rarer, need 6.3-100.7 hours on the average to
    solve.”
•   In 2007 (Garth Gibson, ICPP Keynote):
                                                            Hardware
                                                                                                    50%
                                                                       QuickTime™ and a
                                                              TIFF (Uncompressed) decompressor
                                                                 are need ed to see this picture.

•   In 2008 (Oliner and J. Stearley, DSN Conf.):

                          QuickTime™ an d a
                 TIFF (Uncompressed) decompressor
                    are need ed to see this p icture .




Conclusion1: Both Hardware and Software failures have to be considered
Conclusion2: Oliner: logging tools fail too, some key info is missing, better
filtering (correlation) is needed
     FT system should cover all causes of failures
     (Rollback Recovery is consistent with this requirement*)
                Agenda

•Why Fault Tolerance is Challenging?
•What are the main reasons behind failures?
•Rollback Recovery Protocols?
•Reducing Rollback Recovery Time?
•Rollback Recovery without stable storage?
•Alternatives to Rollback Recovery?
•Where are the opportunities?
   Rollback-Recovery Protocols
Coordinated Checkpoint                            restart
(Chandy/Lamport)                              detection/
Saves SnapShots (consistent global states),   global stop
from which the distributed execution can be       failure
restarted. It uses maker messages to
“Flush” the network and coordinate the              Ckpt
processes. (checkpoint the application when        Sync
there is no in transit messages)
                                                            Nodes
Need global coordination and rollback

Uncoordinated Checkpoint
No global coordination (scalable)                restart
Nodes may checkpoint at any time              detection
(independently of the others)                     failure
Need to log undeterministic events:
In-transit Messages
                                                    Ckpt
                                                            Nodes
 Roll-Back Recovery Protocols:
Which one to choose from (MPI)?
                                                                        Automatic                                                            Semi Auto
            Checkpoint                                                  Log based
              based                                                                                                                Other
                                       Optimistic log                     Causal log             Pessimistic log

             Cocheck                   Optimistic recovery                      Manetho
        Independent of MPI            In distributed systems                     n faults
              [Ste96]                 n faults with coherent checkpoint
                                                   [SY85]                        [EZ92]

              Starfish                                                                                                                                  FT-MPI
                                                                                                                                                 Modification of MPI routines
            Enrichment of MPI                                                                                                                      User Fault Treatment

               [AF99]                                                                                                                                    [FD00]

                 Clip                                                 Egida
        Semi-transparent checkpoint
                                                                                                                MPI/FT
               [CLP97]                                                [RAV99]                                Redundance of tasks
                                                                                                                [BNC01]
                                         Pruitt 98
                                      2 faults sender based                        OpenMPI-V
                                          [PRU98]
              LAM/MPI                                                                                            MPI-FT
                                                                                         MPICH-V                                    LA-MPI
                                      Sender based Mess. Log.                                N faults
                                                                                                                   N fault
                                               1 fault sender based                                           Centralized server   Communications
                                                                                       Distributed logging   RADIC
                                                    [JZ87]                                                       [LNLE00]            rerouting
                                                                                                             Europar’08

    Level
 The main research domains (protocols):
 a) Removing the blocking of Processes. in coordinated ckpt.
 b) Reducing the overhead of message logging protocols
                                                                                  Fig. from
    Improved Message Logging                                                      Boutieller

•Classic approach (MPICH-V)                        Bandwidth of OpenMPI-V compared to others
implements Message Logging at the
device level: all messages are copied
•High speed MPI implementations use
Zero Copy and decompose Recv in:
a) Matching, b) Delivery

•OpenMPI-V implements Mes. Log. within
MPI: different event types are managed
differently, distinction between determ. and
non determ. events, optimized mem. copy
                                                      OpenMPI-V Overhead on NAS (Myri 10g)



        Coordinated and message logging protocols have
                       QuickTime™ and a

        been improved --> improvements are probably still
             TIFF (Uncompressed) decompressor
                are neede d to see this picture.




        possible but very difficult to obtain!
                Agenda

•Why Fault Tolerance is Challenging?
•What are the main reasons behind failures?
•Rollback Recovery Protocols?
•Reducing Rollback Recovery Time?
•Rollback Recovery without stable storage?
•Alternatives to Rollback Recovery?
•Where are the opportunities?
           Reduce the Checkpoint time
          --> reduce the checkpoint size
Typical “Balanced Architecture” for PetaScale Computers

      Compute nodes
                                                     40 to 200 GB/s
                                  100-200 TB
Total memory:                                                                   QuickTime™ and a
                                                                      TIFF (Uncompressed) decompre ssor
                                                                         are neede d to see this picture.
100-200 TB
                   Network(s)

                                   10-50 TB?
                                               I/O nodes
                                Reduce the size of
                                the data saved and
                                restored from
                                remote file system
         1)     Incremental Checkpointing
         2)     Application level Checkpointing
         3)     Compiler assisted application level Checkpointing
         4)     Restart from local storage
        Reducing Checkpoint size 1/2
                                                Fraction of Memory Footprint Overwritten during Main Iteration
• Incremental Checkpointing:
                                                100
                                                 90
                                                                                           Full memory footprint
                                                                    Below the full
                                                 80                memory footprint
A runtime monitor detects memory regions that
                                                 70
have not been modified between two adjacent      60
CKPT. and omit them from the subsequent CKPT.    50
OS Incremental Checkpointing uses                40
                                                 30
the memory management subsystem                  20
to decide which data change                      10
                                                  0
between consecutive checkpoints                        Sage
                                                      1000M B
                                                                Sage
                                                                500M B
                                                                         Sage
                                                                         100M B
                                                                                  Sage
                                                                                   50M B
                                                                                           Swe e pd3D   SP   LU    BT   FT




                                                                                Fig. from J.-C.
• Application Level Checkpointing                                                   Sancho
“Programmers know what data to save and when to save the state of the execution”.
Programmer adds dedicated code in the application to save the state of the execution.
Few results available:
Bronevetsky 2008: MDCASK code of the ASCI Blue Purple Benchmark
Hand written Checkpointer eliminates 77% of the application state
Limitation: impossible to optimize checkpoint interval (interval should be well
chosen to avoid large increase of the exec time --> cooperative checkpointing)

         Challenge (not scientific): establish a base of codes with
         Application Level Checkpointing
     Reducing Checkpoint size 2/2
Compiler assisted application level checkpoint
•From Plank (compiler assisted memory exclusion)                                                                 Fig. from G.
•User annotate codes for checkpoint                                     100%                                    Bronevetsky
•The compiler detects dead data (not
modified between 2 CKPT) and omit them
from the second checkpoint.
•Latest result (Static Analysis 1D arrays)
                                                                                 QuickTime™ and a
                                                                       TIFF (Uncompressed) decompre ssor
                                                                          are neede d to see this picture.




excludes live arrays with dead data:
--> 45% reduction in CKPT size for                 22%
mdcask, one of the ASCI Purple
benchmarks


•Inspector Executor (trace based) checkpoint




                                                    Memory addresses
(INRIA study)
Ex: DGETRF (max gain 20% over IC)
Need more evaluation
     Challenge: Reducing checkpoint size (probably one of                                           Execution Time (s)

     the most difficult problems).
                Agenda

•Why Fault Tolerance is Challenging?
•What are the main reasons behind failures?
•Rollback Recovery Protocols?
•Reducing Rollback Recovery Time?
•Rollback Recovery without stable storage?
•Alternatives to Rollback Recovery?
•Where are the opportunities?
  Remove bottleneck of the I/O nodes
   and file system --> Checkpointing
         without stable storage
Typical “Balanced Architecture” for PetaScale Computers

         Compute nodes
                                    40 to 200 GB/s
   Total memory:                                                    QuickTime™ and a
                                                          TIFF (Uncompressed) decompre ssor
                                                             are neede d to see this picture.
   100-200 TB
                   Network(s)

                                I/O nodes
                                No checkpoint will
                                cross this line!

       1) Add storage devices in compute nodes and/or as extra “non
          computing” nodes
       2) Diskless Checkpointing
      Store ckpt. on SSD (Flash mem.)
        Compute nodes
                                                                                        SSD
                             SSD                                            QuickTime™ et u n
                                                                       décompresseur TIFF (LZW)
                                                                sont requis pour visio nner cette image.
                                                                                                           40 to 200 GB/s
     Total memory:                                                                                                          Parallel file system
     100-200 TB                                                                                                             (1 to 2 PB)
                                                Network(s)
               SSD

                                                                               SSD I/O nodes
                                     SSD
                                 QuickTime™ et u n
                            décompresseur TIFF (LZW)
                     sont requis pour visio nner cette image.




•Current practice --> checkpoint on local disk (1 min.) and then move
asynchronously CKPT images to persistent storage
 -->Checkpoint still needs 20-40 minutes
•Recent proposal: use SSD (Flash mem.) in nodes or attached to network
--> Increase the cost of the machine (100 TB or flash memory) + increase
power consumption + Need replication of ckpt image on remote nodes (if
SSD on nodes) OR add a large # of components in the system.
      Challenge: Integrate the SSD technology at a
      reasonable cost and without reducing the MTTI
Diskless Checkpointing 1/2
Principle: Compute a checksum of the processes’ memory and
store it on spare processors
Advantage: does not require ckpt on stable storage.

P1    P2   P3   P4           4 computing processors      Images from
                                                         George Bosilca

 P1   P2   P3   P4      Pc   Add fifth “non computing”
                                      processor
                                                         A) Every process
                                                         saves a copy of its
P1    P2   P3   P4      Pc      Start the computation    local state of in
                                                         memory or local disc
P1 + P2 + P3 + P4 = Pc          Perform a checkpoint
                                                         B) Perform a global
                                                         bitstream or floating
P1    P2   P3   P4      Pc   Continue the computation
                                                         point operation on all
                     ....                                saved local states
P1    P2   P3   P4      Pc             Failure             All processes
                                                           restore its local
P1         P3   P4      Pc       Ready for recovery        state from the one
                                                           saved in memory
      P2 = Pc - P1 - P3 - P4 Recover P2 data               or local disc
                                                                              Images from
   Diskless Checkpointing 2/2                                                 CHARNG-DA LU

•Could be done at application and system levels
•Process data could be considered (and encoded)
either as bit-streams or as floating point numbers.
Computing the checksum from bit-streams uses operations            QuickTime™ and a
                                                          TIFF (Uncompressed) decompressor
such as parity. Computing checksum from floating point      are need ed to see this picture.


numbers uses operations such as addition

•Can survive multiple failures of arbitrary patterns
Reed Solomon for bit-streams and weighted checksum for
floating point numbers (sensitive to round-off errors).

•Work with with incremental ckpt.
•Need spare nodes and double the memory occupation (to survive failures
during ckpt.) --> increases the overall cost and #failures
•Need coordinated checkpointing or message logging protocol
•Need very fast encoding & reduction operations
•Need automatic Ckpt protocol or program modifications

       Challenge: experiment more Diskless CKPT and
       in very large machines (current result are for ~1000 CPUs)
                Agenda

•Why Fault Tolerance is Challenging?
•What are the main reasons behind failures?
•Rollback Recovery Protocols?
•Reducing Rollback Recovery Time?
•Rollback Recovery without stable storage?
•Alternatives to Rollback Recovery?
•Where are the opportunities?
              Avoid Rollback-Recovery

Typical “Balanced Architecture” for PetaScale Computers

          Compute nodes
                                    40 to 200 GB/s
   Total memory:                                                    QuickTime™ and a
                                                          TIFF (Uncompressed) decompre ssor
                                                             are neede d to see this picture.
   100-200 TB
                    Network(s)

                                 I/O nodes

                                      No checkpoint
                                      at all!

      •    System monitoring and Proactive-Operations
                               Proactive Operations
•Principle: predict failures and trigger
preventive actions when a node is suspected
•Many researches on proactive operations
assume that failure could predicted.
                                                                                                               QuickTime™ et un



Only few papers are based on actual data.
                                                                                                          décompresseur TIFF (LZW)
                                                                                                   sont requis pour visionner cette image .



                                                                                        More than 50 measurement
•Most of researches refer 2 papers published                                            points monitored per Cray XT5
in 2003 and 2005 on a 350 CPUs cluster and                                              system blade
    Traces from either a rather small system (350 CPUs) or
and BG/L prototype (100 days, 128K CPUs)
    the first 100 days of a large system not yet stabilized
 BG/L prototype
                                                         Graphs from
                                                                                                                          Node cards
                                                         R. Sahoo                                                   switch


                                                                                                       APP-IO                                   Memory
               QuickTime™ et un                           QuickTime™ et u n
          décompresseur TIFF (LZW)                   décompresseur TIFF (LZW)
 A lot of fatal failures
   sont requis pour visionner cette imag e.
                                                  Everywhere in
                                              sont requis pour visionner cette image.
                                                                                                          From many
 (up to >35 a day!)                                the system                                              sources
                                                                                                                                              Network
              Proactive Migration
•Principle: predict failures and migrate processes before failures
•Prediction models are based on the analysis of correlations between
non fatal and fatal errors, and temporal and spatial correlations between
failure events.
•Results on the 100 first days of BlueGene/L demonstrate good failure
   Proactive migration may could have been predicted (based
predictability: 50% of I/O failureshelp to significantly increase on
trace analysis). Note that Memory failures are much less predictable!
  the checkpoint interval.
•Bad prediction has a cost (false positives and negatives have an impact
on performance) --> false negatives impose to use rollback-recovery.
  Results are lacking concerning real time predictions and
  actual has a cost (need to checkpoint and log or delay
•Migrationbenefits of migration in real conditions messages)
•What to migrate?
    •Virtual Machine, Process checkpoint?
    •Only application state (user checkpoint)?
•What to do with predictable software failures?
    Challenge: Analyze more traces, Identify more
    Migrate OR keep safe software and replace dynamically the
    correlations, Improve predictive algorithms
    software that is predicted to fail?
                Agenda

•Why Fault Tolerance is Challenging?
•What are the main reasons behind failures?
•Rollback Recovery Protocols?
•Reducing Rollback Recovery Time?
•Rollback Recovery without stable storage?
•Alternatives to Rollback Recovery?
•Where are the opportunities?
                          Opportunities
May come from a strong modification of the problem statement:
                                  Failures
Exceptions                                                           Normal events

From system side: “Alternatives FT Paradigms”:
    –Replication (mask the effect of failure),
    –Self-Stabilization (forward recovery: push the system towards a legitimate state),
    –Speculative Execution (commit only correct speculative state modifications),

From applications&algorithms side: “Failures Aware Design”:
    –Application level fault management (FT-MPI: reorganize computation)
    –Fault Tolerance Friendly Parallel Patterns (Confine failure effects),
    –Algorithmic Based Fault tolerance (Compute with redundant data),
    –Naturally Fault Tolerant Algorithms (Algorithms resilient to failures).

Since these opportunities have received only little attention (recently),
they need further explorations in the context of PetaScale systems.
    Does Replication make sens?
                                                                 Slide from
                                                                 Garth Gibson




                                    QuickTime™ and a
                           TIFF (Uncompressed) decompressor
                              are neede d to see this picture.




Need investigation on the processes slowdown with high speed networks
Currently too expensive (double the Hardware & power consumption)

•Design new parallel architectures with very cheap and low power nodes
•Replicate only nodes that are likely to fail --> failure prediction
               Fault Tolerance Friendly
                Parallel Patterns 1/2
Chandy-Lamport Algorithm assumes a « worst case » situation all processes may
communicate with all other ones --> these communications influence all processes.
                                                                            Ideas from
Not necessarily true for all parallel programming/execution patterns.       E. ELNOZAHY

1) Divide the system into recovery domains --> failures in one domain are confined to
the domain and do not force further failure effects across domains.
May need some message logging (interesting only if few inter-domain coms.)

2) Dependency based recovery: limit the rollback to those nodes that have acquired
dependencies on the failed ones. In typical MPI applications, processes exchange
messages with a limited numbers of other processes (be careful: domino effect).

FT Friendly Parallel Patterns still needs fault detection and correction

Examples of Fault Tolerant Friendly Parallel Patterns: Master Worker,
Divide&Conquer (Barnes-Hut, Raytracer, SAT solver, Tsp, Knapsack)
-->SATIN (D&C framework of IBIS): transparent FT strategy dedicated to D&C pattern
                  FT Friendly Parallel Patterns 2/2
                                   A simple example (IPDPS 2005): Fibonacci
  •Divide&Conquer                  with SATIN (IBIS)
                                                                                        1            processor 1
  Many non trivial parallel         Figure by.
  applications:                     G. Wrzesińska               2                                          3
  Barnes-Hut, Raytracer,
  SAT solver,                                         4                     5                    6                     7
                                      processor 3
  Tsp, Knapsack
                                              8           9           10           11       12             13
  (cobinatory optimization
  problems)...                                                       processor 2
                                         14          15
                                                                    Processor3 disconnected &
                                                                    Orphans Jobs

                                                  processor 1                           1

                   Processor3 broadcast its                     2                                      3
              9    list of Orphans
                    (9, cpu3)
                                                      4                    5                     6                 7
       15
                    (15,cpu3)
processor 3                                               9                                 12         13
                                                                When Processor1 re-computes taks2 and
                                                    15          then task4, it reconnects to processor3
                                              processor 3       and Orphan jobs are recovered.
 “Algorithmic Based Fault Tolerance”
In 1984, Huang and Abraham, proposed the ABFT to detect and correct errors in
some matrix operations on systolic arrays.

ABFT encodes data & redesign algo. to operate on encoded data. Failure are
detected and corrected off-line (after execution).                      From G.
                                                                                Bosilca
ABFT variation for on-line recovery (runtime detects failures + robust to failures):
•Similar to Diskless ckpt., an extra processor is
added, Pi+1, store the checksum of data:                 P1    P2    P3   P4      Pc
(vector X and Y in this case)                            X1    X2    X3    X4     Xc
Xc = X1 +…+Xp, Yc = Y1 +…+Yp.                        +   Y1    Y2    Y3    Y4     Yc
Xf = [X1, …Xp, Xc], Yf = [Y1, …Yp, Yc],
• Operations are performed on Xf and Yf             = Z1 Z2 Z3 Z4 Zc
instead of X and Y : Zf=Yf+Zf
                                      Works for many Linear Algebra operations:
• Compared to diskless                Matrix Multiplication: A * B = C -> Ac * Br = Cf
checkpointing, the memory             LU Decomposition: C = L * U -> Cf = Lc * Ur
AND CPU of Pc take part of            Addition:              A + B = C -> Af + Bf = Cf
the computation):                     Scalar Multiplication: c * Af = (c * A)f
• No global operation for Checksum! Transpose:              AfT = (AT)f
• No local checkpoint!                Cholesky factorization & QR factorization
   “Naturally fault tolerant algorithm”
Natural fault tolerance is the ability to tolerate failures through the mathematical
properties of the algorithm itself, without requiring notification or recovery.
                                                                                                             Figure from
The algorithm includes natural compensation for the lost information.                                        A. Geist

For example, an iterative algorithm may require more iterations to converge, but it
still converges despite lost information

Assumes that a maximum of 0.1% of tasks may fail                                           QuickTime™ et u n
                                                                                      décompresseur TIFF (LZW)
                                                                               sont requis pour visio nner cette image.



Ex1 : Meshless iterative methods+chaotic relaxation
(asynchronous iterative methods)                  Meshless formulation of 2-D
                                                            finite difference application

Ex2: Global MAX (used in iterative methods to determine convergence)
                                                     This algorithm share some features
                                                     with SelfStabilization algorithms:
                      QuickTime™ et un               detection of termination is very hard!
                                                     it provides the max « eventually »…
                 décompresseur TIFF (LZW)
          sont requis pour vision ner cette image.

                                                     BUT, it does not tolerate Byzantine
                                                     faults (SelfStabilization does for
                                                     transient failures + acyclic topology)
                     Wrapping-up
Fault tolerance is becoming a major issue for users of large
scale parallel systems.

Many Challenges:
•Reduce the cost of Checkpointing (checkpoint size & time)
•Design better logging and analyzing tools
•Design less expensive replication approaches
•Integrate Flash mem. tech. while keeping cost low and MTTI high
•Investigate scalability of Diskless Checkpointing
•Collect more traces, Identify correl., new predictive algo.


Opportunities may come from Failure Aware application
Design and the investigation of Alternatives FT Paradigms,
in the context of HPC applications.

								
To top