Fault Tolerance & Failure Containment

Shared by: HC120918014149
Categories
Tags
-
Stats
views:
0
posted:
9/17/2012
language:
Unknown
pages:
16
Document Sample
scope of work template
							     Fault Tolerance & Failure Containment

• After extensive Quality Engineering activities of
  Inspections, Testing, Formal Correctness Proof, etc.
  there may still be some defects remaining in the
  software system.

• To keep the software systems operational we need
  to consider the strategies of:
   – Tolerating “local” faults
   – Containing the failure damage from spreading
                  General Concept

• Both Fault Tolerance and Failure Containment are
  practiced in many other disciplines:

  – Mechanical systems use duplicative systems to back-up
    and replace any faulty equipment – the cost is the
    duplicate hardware which replaced the faulty hardware

  – Chemical systems use containment mechanism when
    there is a leakage or melt-down damage – the cost is the
    containment “wall” or “closure” employed to limit the
    damage.
              Examples in Software Systems
• Network (single node virus) may be contained/tolerated:
  – via network re-routing algorithms that provide alternative
    routes to the destination --- albeit slower
     • Duplication of route ( multiple and possibly duplicative connections)
     • Containment of failure (slower/reduced-performance but eventually gets
       there)
• Device control mechanism fault may be tolerated
  – Via a back-up mechanism that captures the check-pointed
    information and restart the processing from the last check-point
    ---- albeit duplicating some processing
     • Duplication of processing (re-processing from the last check point – of
       data)
     • Containment of failure (everything processed prior to the check-point
       stands good)
             Two Important Assumptions
1. Rare Event : The failure is rare and the probability of failure
   is very low; thus it is impossible to anticipate and thus needs
   fault tolerance and failure containment considerations. (e.g.
   elevator control software goes into stop state whenever
   “anything” goes wrong.)

2. Failure Independence: Different components of the system
   fail independently of one another and can be localized; thus
   the localized mechanism may be replaced and/or the failure
   may be contained. (e.g. different valve control programs in a
   process control software --- if one fails, then we may want to
   shut down the whole plant until the specific failing control
   program is fixed. If these control programs are all linked or
   coupled, then we may have a bigger problem )

    ** Note that the second assumption is also why we
    promote loose coupling in software design
               Techniques Classification
• Fault Tolerance techniques:
   1. Duplication:
      • We may use multiple, parallel processing and picking a consensus
        solution – (n-version programming)
   2. Backup/Recovery:
      • We have a software (algorithm and db) running with regular
        checkpoints to backup the information processed – when a
        problem is encountered, then the software may recover by going
        back to the last check pointed data and reprocess
      • We have a primary and secondary software (programs) and when
        the primary software fails, the secondary (less functional)
        program may be swapped in to bring the processing to some
        degraded state.
               Techniques classification

• Failure Containment techniques
  1. Failure analysis for containment:
     • We focus on analysis of potential, preconditions of
       failure/damage and set different mechanisms for accident
       reduction/ containment/ control once a failure does occur
  2. Damage control:
     • We focus more on how to limit the damage and severity of the
       accidents once the accident occurred; since the damage and
       severity is domain specific, the containment mechanism is also
       domain specific (e.g. chemical hazards, mechanical safety, etc.)
Fault Tolerance Based on “Multiple Computation”
• The notion of fault-tolerance via employing multiple-
  computation (duplication) is used by both hardware
  and software systems. The general notion of multiple
  computation includes backup/recovery and touches
  upon 3 domains:
   – Time
   – Hardware
   – Software
• The general notation (from Avizienes) is - nT/nH/nS
   – We may repeat the processing n-Times
   – We may duplicate processing on n-Hardware
   – We may use n-versions-of-Software
                      More on nT/nH/nS
• Consider nT/1H/1S:
  – This is the situation where processing is performed several times
    on one hardware, using the same version of software.
     • Back-up the information and recover by reprocessing from the last
       checkpoint is an example of multiple processing over time.
• Consider 1T/nH/1S:
  – This is the situation where multiple, duplicative hardware is used
    with the same version of the software and comes in two
    “flavors”
     • replication where two or more hardware(may not be the same kind) is
       running the same software in parallel and some algorithm is used to pick
       the “correct” output such as the “majority-vote” in Triple-Modular-
       Redundancy
     • redundancy where multiple identical instances of the same system is
       provided but only one is running and switching to another when the
       processing one fails.
                 More on nT/nH/nS
• Consider 1T/nH/nS:
  – This is the case where we have multiple versions of software
    running on multiple hardware providing possibly different
    outputs. The key is the algorithm that will determine what is
    the “right” output.
  – This case will also need a sophisticated operating system or
    runtime tool to gather the outputs fro the multiple, possibly
    different hardware and software.
• Consider 1T/1H/nS:
  – This is the case where we have multiple versions of software
    running on the same hardware, providing possibly different
    outputs. The key here, again, is the algorithm that
    determines the “correct” output.
  – This requires the compiler and runtime tool that will
    facilitate multiple, parallel processing
               N-version Programming

• N-version programming is a fault tolerant technique
  that was introduced by Avizienes and Chen based
  on the notion of “multiple computing.” The general
  scheme works as follows:
   1. There are multiple, n, independent versions of program
      that performs the identical functionality
   2. The same input is distributed to all n versions
   3. The individual outputs from all n versions are fed to a
      decision “box”
   4. The decision “box,” using some algorithm, chooses the
      appropriate answer as the output
                    N-version Programming


                      Version 1


                      Version 1
                                         Decision
                                          “Box”
Input                    .                            Output
                         .
                         .

                      Version n




        Note: that this may be 1T/1H/nS or 1T/nH/nS
                    N-version Programming
• The Decision “box” algorithm is an important factor in
  this approach.
   – The decision algorithm is often based on the assumption that
     the faults in n-versions are independent (the earlier
     mentioned “failure independence”)
      • This assumption says that if the faults are independent then it is likely
        that any one fault is local to a version and the other versions may be
        processing correctly with respect to this one fault, even though other
        versions may have other faults of their own.
   – One popular algorithm is the “simple-majority” rule to use
     the answer of majority.
      • Note that this assumption of majority is correct is not a “guarantee” ---
        --- what if the majority were wrong!?
        Facilitating N-version Programming

• A way to ensure and make N-version Programming
  more reliable is to get to fault independence
  through version independence:

  1. Use diverse people to develop the different versions
  2. Use different development processes to develop the
     different versions
  3. Use different technology, tools, programming languages,
     methodologies,etc. to develop the different versions
- So far, N-version Programming is found to be quite costly!
- What is a reasonable N?

***can we use N-version Programming for security attack tolerance?***
               Failure Containment

• With all the fault prevention and fault tolerant
  techniques, unfortunately, we will still have faults.
  In that case, can we (a) “prevent accidents” and can
  we (b) “reduce the damage of the accidents”?

• We already know that we can not prevent all
  accidents. But we can analyze the hazards of an
  accident and hopefully contain or limit the damage.
                  Fault Tree Analysis
1. list the set of events that cause the “accident” or “failure”
2. build a upside down tree that logically connects the events
to the failure
                             Security
                            Break in F1


                                    AND

              Log-in
             Granted                      Access to
                                             F1
                                           “bug”
                 Or



                         Log-in                 - top event is the “accident”
      Password
                       validation               - circles are primary events
      Exposed
                         “bug”                  - AND/OR are logical conditions
                     Containment
• We use the fault-tree to analyze and understand the
  cause of the “accident.” We may use it for:

   – Accident elimination
   – Accident reduction
   – Accident control


• The actual containment is a solution that is domain
  dependent and requires “domain specific”
  knowledge.

						
Related docs
Other docs by HC120918014149
2011 Advice to AP students next Year
Views: 2  |  Downloads: 0
Diapositive 1
Views: 0  |  Downloads: 0
Juristische Definition S
Views: 9  |  Downloads: 0
Home Buying Process
Views: 0  |  Downloads: 0
a Industrial Chemistry Research Institute
Views: 3  |  Downloads: 0
Teachers Notes - Download Now DOC
Views: 1  |  Downloads: 0
CH Atomkraft
Views: 1  |  Downloads: 0
visual basic ile programlama
Views: 128  |  Downloads: 0