Document Sample


 Al Geist
 Corporate Fellow
 Oak Ridge National Laboratory

    Architectures I Workshop
    Palo Alto, CA
    August 3, 2011

            Research supported by DOE ASCR
     Monster in the Closet

     As a child the monster in the closet was this thing you
     never saw, but were afraid it would come out in the dark
     and get you.

        Our monster in the closet
        is the resilience of an
        exascale system.

                                   Resilience and power
                                   are cornerstone gaps

    Yes I can solve the exascale power problem, just ignore resilience.
    A machine that is “down” consumes no power.

2   Computer Science and Mathematics Division
Lesson’s Learned: Resilience will bite you
           The monster in the closet has peaked out

    • 2006 University of Virginia builds “Big MAC” out of 1,100 Apple
      G5 motherboards. Claimed 3rd fastest system in the world.
         – Reality – system would not stay up during the day; only at night.
         – Reason – PC motherboards don’t use ECC memory. Cosmic rays
           from the Sun caused too many single-bit memory faults
         – After few months decommissioned and sold for parts on eBay
    • This is not an isolated case
    • LANL at over 5,000’ altitude gets significantly more cosmic rays
      than at sea level.
         – Supercomputers there see increased gates flip spontaneously in
           memory and processor chips
    • SNL has seen bits flip inside data as it flowed from node to node
      in Red (was a HW problem not cosmic rays)
         – Was very difficult to detect and track down this rare, transient fault

3    Computer Science and Mathematics Division
    Permanent Error                             Transient Error                  Undetected Error
    Hard Failure of HW or                       Soft Failure of HW or SW         Silent errors can be
    SW that is permanent.                       that is a blip. eg. bit flips,   permanent or transient
    requires repair                             or mis-computes                  that go undetected

    Holistic Fault Tolerance                                    Continuous Failure
    • Fault awareness exists throughout                           • Faults are continuous and occur
      the stack from hardware, OS,                                  in all levels of the stack
      runtime, to the application                                 • Continuous in the sense that
    • Fault detection built into each                               recovery can not complete
      level of the stack                                            before a new fault occurs
    • Fault notification available to any                         • Forces a paradigm shift.
      components in the stack                                       Applications and system
    • Coordinated recovery across the                               software must be designed to
      levels and components of the                                  produce the correct results
      stack                                                         despite continuous faults

4   Computer Science and Mathematics Division
Continuous Failure is already being seen
    Transient error bit flips in memory are the first
    place we are experiencing continuous failure.
    Thankfully, ECC memory continuously fixes the
    single bit errors, but not double bit errors.

    Double bit error rate
         – Jaguar has a lot of memory (362 TB)
         – It endures a constant stream of single bit errors
         – It has a double bit error about every 24 hours (beats DRAM FIT rate)
         – Chip kill allows the system to run through double bit errors but DRAM
           can not correct double bit errors
    What is double bit error rate of exascale system?
         – Exascale system target is 128 PB of memory (354 times Jaguar)
         – Translates into a double bit error about every 4 minutes
         – Frequent enough to need something better than chip kill at Exascale

5   Computer Science and Mathematics Division
Nature has already had to address
Continuous Failure on billions of processors
    Human Brain
    Laid out flat – 0.25 m2 (not 3D) 2mm thick with 6 layers
    wire diameter 10 nm, neuron diameter 4 microns
    100 billion neurons (1 x 1011),
    0.15 quadrillion connections (1.5 x 1014)
    150,000 km of wiring
    Wt 1.3 Kg
    Frequency 30 hz
                                 System architects
    Power 25 W
                                                eat your heart out

             Continuous failure – 85,000 die/day, millions misfire/sec
             Errors in calculations – make mistakes (examples visual illusions)
             Self correction – sensor fusion, redundant calculation
             Self healing – reprogram around physical damage
6   Computer Science and Mathematics Division
    What happened to all the doom and gloom
    about Petascale system failure rates?
    They came true: Faults on Petascale systems are already “continuous”

    Jaguar stays up about 6 days, but there are many faults during this time
    that the RAS system handles without system interrupt.
    But some of those faults kill the application(s) that were running on the
    affected resources.

    ORNL developed a real-time Event Stream
    Monitor with root cause analysis and fault
    prediction modules. Some results:
    • Jaguar event streams over 72 hours (3 days)
      averaged over 20 faults/hour
        − Node heartbeat fault every 3 minutes
        − 861 machine exceptions
        − 12 kernel panics                                                   Offline
                                                                           analysis of
    • Fault prediction cluster analysis found no                            archives
      small sets of predictive events. Smallest had
      14 events. Cause or effect?
7   Computer Science and Mathematics Division
    Resilience isn’t just about bit flips
       Failures can come from any hardware or software component.

    We often consider the failures due to growing complexity of multi-
    core CPUs and GPUs with millions of transistors. Here is a warning
    story it can be any little thing:
    Jaguar’s most recent resilience challenge was a simple voltage
    regulator module. Single failure takes out interconnect
            –There are over 18 thousand of these in Jaguar
    Analysis – failure randomly occurs AFTER job finishes not during
           – try poll vs idle to avoid temp variation (cooling down of part)
    Worked for a few weeks but failures started again
           – Tweak to lower voltage (less than a volt)
           – Jaguar stable for over a month now
    Resilience is constant struggle

8   Computer Science and Mathematics Division
    System Resilience and Usability
                             Exascale Resilience target of 1-6 days MTBAI*
                                   Where did this range come from ?

    • Stabilizing Large Systems takes days/weeks
         – Over past three generations stabilizes to around 6 days between reboots
         – Users told me uptime less than a day is not worth running on
    • Japan Tsunami and follow-on rolling power blackouts
         – Seeing resilience problems with their supercomputers – no time to stabilize.
                         Typical behavior                                            System not stabilizing
    Reboots per week                                                        Reboots per week
4                                                                       7
3                                                                       5
2                                                                       3
1                                                                       2

9   Computer Science and Mathematics Division
                                                * Mean time between application interrupt requiring user intervention
Today’s Fault Tolerance
         From application perspective why bother?

• Faults appear to be rare events (ORNL’s Jaguar stays up
  for about a week)
• Application is not notified when a fault is detected
• MPI is not perceived as being fault tolerant – so limited
  recourse to recover even if notified
• The application is happily running one second…
• The next thing it knows several hours have passed and it
  finds itself starting up again – this time with the “restart”
  flag set.

               Application has no chance to adapt to faults and
               keep running so the developers don’t even try.
               They stick to checkpointing with restart on failure

 10   Computer Science and Mathematics Division
     Exponential growth in scale is
     driving fault rates towards continuous
     • Fundamental assumptions of
       applications and system software
       design did not anticipate exponential
       growth in parallelism
     • Number of system components
       increasing faster than component
     • Mean time between failures of
       minutes or seconds for exascale built
     • Undetected error rates increasing
     • Checkpoint/restart overhead too large
       for future systems
     • Apps must change to run through faults
11   Computer Science and Mathematics Division
        Need to develop rich methodology for
        apps to “run through” faults

      Taxonomy for Applications to Handle Faults
                    Restart – from checkpoint file [large apps today]
 Some state saved

                    Restart from diskless checkpoint                                   in disk
                     [Avoids stressing the IO system and causing more faults]
                                                                                Store chkpt
                    Recalculate lost data from distributed in-memory info       in memory
                     [eg, RAID in neighbors memory]
                    Specify reliable memory regions and execution sections

                    Lossy recalculation of lost data                 Our EASI Math/CS
 No chkpt

                    Recalculate lost data from remaining data     Institute is developing
                    Replicate computation across system             resilient algorithms
                                                                  using several methods
                    Reassign lost work to another resource
                    Use natural fault tolerant algorithms
12       Computer Science and Mathematics Division
                       One Solution for Continuous Failure
         Naturally Fault Tolerant Algorithms
               Can parallel algorithms be developed that are scalable and
               naturally fault tolerant, I.e. failure of tasks can be ignored
               (no monitoring, no notification, no recovery) yet the
               algorithm still gets the correct answer?

                   What classes of problems can be made naturally fault tolerant?

           Our approach to answer this question:
           – Develop an Extreme-scale system Simulator
           – Develop classes of non-trivial problems with
             Natural Fault Tolerance
             (i.e. not easy embarrassingly parallel case)

13   Computer Science and Mathematics Division
 Examples: Natural Fault Tolerant algorithms

      Demonstrated that the scale invariance and natural fault
      tolerance can exist for local and global algorithms where 100’s
      of failures happen across million processes during execution

       Finite Difference (Christian Engelman)                       local
       – Demonstrated natural fault tolerance w/ chaotic
         relaxation, meshless, finite difference solution of
         Laplace and Poisson problems
       Global maximum (Kasidit Chancio)
       – Demonstrated natural fault tolerance in global
         max problem based on random, directed graphs              global
       Gridless Multigrid (Ryan Adams)
       – Combines the fast convergence of multigrid with
         the natural fault tolerance property. Hierarchical
         implementation of finite difference above.
       – Three different asynchronous updates explored

       Theoretical analysis (Jeffery Chen)
     Computer Science and Mathematics Division
                Projected Exponential Growth in
                     Failure Rate 2006-2018
                       Based on scale increase and circuit size decrease

                                                                   Trends in VLSI components

                                                                     Number of dopant atoms per
                                                                     transistor dropping

                                                                                                        2011 2013

     The expected growth in failure rate                  Very little energy needed to flip latches at
     assuming that the number of cores                    these resolutions and voltages.
     per socket grows by a factor of 2                    Observed fault rates increasing 8% per
     every 18, 24 and 30 months, and the                  year in VLSI components
     number of sockets increases to stay                   S. Borkar, IEEE Micro, vol. 25, no. 6, pp. 10- 16, Dec. 2005.
     on TOP500 curve
     G. Gibson Journal of Physics: Conference Series 78
15   (2007) 012022
     Computer Science and Mathematics Division
        Hardware Challenges
        Making the Resilience Worse
      Number of components both memory and processors will increase by an
      order of magnitude which will increase hard and soft errors

      Smaller circuit sizes, running at lower voltages to reduce power
      consumption, increases the probability of switches flipping spontaneously due to
      thermal and voltage variations as well as radiation, increasing transient errors

      Power management cycling significantly decreases the components
      lifetimes due to The thermal and mechanical stresses

      Resistance to add additional HW detection and recovery logic right
      on the chips to detect silent errors. Because it will increase power consumption
      by 15% and increase the chip costs.

      Heterogeneous systems make error detection and recovery even harder, for
      example, detecting and recovering from an error in a GPU can involve hundreds
      of threads simultaneously on the GPU and hundreds of cycles in drain pipelines
      to begin recovery.
16   Computer Science and Mathematics Division
     Software Challenges making
     the Resilience Worse

     Existing fault tolerance techniques (global checkpoint/global restart)
     will be impractical at Exascale.
     There is no standard fault model, nor standard fault test suite or
     metrics to stress resilience solutions and compare them fairly.
     Errors, fault root causes, and propagation are not well
     understood . Hard to solve something that isn’t understood
     Present programming models and languages do not offer a
     paradigm for resilient programming. A failure of a single task often
     leads to the killing of the entire MPI application.
     Present Applications (and most System software) is not fault
     tolerant nor fault aware and is not designed to confine errors/faults, to
     avoid or limit their propagation, and to recover from them in a holistic
     fashion. (perfect opportunity for co-design)

17   Computer Science and Mathematics Division
     Paradigm Shift with Continuous Failures
            When an application experiences faults faster than it can recover,
            then a completely new paradigm to fault tolerance must be created.
            It must be able to progress even though faults can not be fixed.

          Paradigm shift: Put less effort in eliminating failure and more
          effort into software that can adapt and run through faults.
          Software at all levels of the stack not just applications

                 Failures in all Levels
                     Hardware                      What is the first step?

18   Computer Science and Mathematics Division
        We need a Standard Fault Model

          A “Standard Fault Model” formally spells out exactly what
          detection, notification, and recovery features will be portably
          supported across exascale systems.

           Before application developers and runtime software developers can
           begin creating software that can dynamically adapt to faults in an
           exascale system, the developers need to understand:
           • What are the types of faults/degradations that are likely
           • What are the methods of notification about faults
           • What are the system features available to enable dynamic
             adaption and recovery
           The answers to these questions form a “Fault Model”

19   Computer Science and Mathematics Division
        We also need a framework to test the
        resilience of algorithms
                Much like we compare performance today, we need a
               systems. “fault test suite” or standard metrics to stress
                resilience solutions and compare them fairly.

           Behavior of self healing algorithms during faults is harder to
           quantify than performance because:
           • Behavior can be a function of when the faults happen (time)
           • Where the fault occurs out of the million node system (space)
           • The rate that failures occur and how many simultaneously (rate)
           • Dynamic adaption method used by the algorithm
           • Size of the problem and size of the computer

20   Computer Science and Mathematics Division
     To test scaling and resilience ORNL has
     Developed Exascale System Simulator
      X-Sim Developed at ORNL as part of the DOE Institute for Advanced Architecture project
       – In June 2011 simulated an MPI app running on 100 million processor system
             (Breaking previous record of 38 million task MPI simulation by Vinod Tipparaju on Jaguar)
      Simulator is a parallel application
        – Runs on a Linux Cluster
      Adjustable Topology
       – Configured at startup
      Provides very flexible simulation (injection) of failures
       – Single and groups of components can be set to fail, distributed, local, at rates
      Supports Fortran, and C applications and MPI
       – To provide scientists an experimental platform
       – To explore scalability of their algorithms
       – To consider fault tolerance

21   Computer Science and Mathematics Division
      Challenge: Validation in the Presence of Faults
          “Success from the perspective of the application is
          a correct solution in a timely fashion.”– John Daly

                                       Result could be caused by bug in the “rewritten” application
       If solution
     is not correct                    Result could be error in the more complex models
                                       Result could be caused by numerical stability issues
                                                                                (eg. Jaguar broke Linpack)

     In the presence of faults we must also consider
           Result caused by undetected fault                                     I’ll just keep running
           Fault recovery introduces perturbations                                the job till I get the
                                                                                       answer I want
           Result may depend on which nodes fail
           Result may depend on when nodes fail
           Result looks reasonable but is actually wrong
           Can’t afford to run every job three times
22   Computer Science and Mathematics Division
     Conclusion: Paradigm Shift is Coming

     Fault rate is growing exponentially therefore faults will eventually
     become continuous.
     Faults will be continuous and across all levels from HW to Apps
     (no one level can solve the problem -- solution must be holistic)
     Expectations should be set accordingly with users and developers
     Self-healing system software and application codes needed
     Development of such codes requires a fault model and a
     framework to test resilience at scale
     ORNL has created real-time fault-stream analyzer and extreme-
     scale simulator to explore failure behavior of algorithms
     Validation in the presence of faults is critical for scientists to have
     faith in the results generated by exascale systems

23   Computer Science and Mathematics Division

24   Computer Science and Mathematics Division

Shared By: