Fault Tolerance & Failure Containment
Shared by: HC120918014149
-
Stats
- views:
- 0
- posted:
- 9/17/2012
- language:
- Unknown
- pages:
- 16
Document Sample


Fault Tolerance & Failure Containment
• After extensive Quality Engineering activities of
Inspections, Testing, Formal Correctness Proof, etc.
there may still be some defects remaining in the
software system.
• To keep the software systems operational we need
to consider the strategies of:
– Tolerating “local” faults
– Containing the failure damage from spreading
General Concept
• Both Fault Tolerance and Failure Containment are
practiced in many other disciplines:
– Mechanical systems use duplicative systems to back-up
and replace any faulty equipment – the cost is the
duplicate hardware which replaced the faulty hardware
– Chemical systems use containment mechanism when
there is a leakage or melt-down damage – the cost is the
containment “wall” or “closure” employed to limit the
damage.
Examples in Software Systems
• Network (single node virus) may be contained/tolerated:
– via network re-routing algorithms that provide alternative
routes to the destination --- albeit slower
• Duplication of route ( multiple and possibly duplicative connections)
• Containment of failure (slower/reduced-performance but eventually gets
there)
• Device control mechanism fault may be tolerated
– Via a back-up mechanism that captures the check-pointed
information and restart the processing from the last check-point
---- albeit duplicating some processing
• Duplication of processing (re-processing from the last check point – of
data)
• Containment of failure (everything processed prior to the check-point
stands good)
Two Important Assumptions
1. Rare Event : The failure is rare and the probability of failure
is very low; thus it is impossible to anticipate and thus needs
fault tolerance and failure containment considerations. (e.g.
elevator control software goes into stop state whenever
“anything” goes wrong.)
2. Failure Independence: Different components of the system
fail independently of one another and can be localized; thus
the localized mechanism may be replaced and/or the failure
may be contained. (e.g. different valve control programs in a
process control software --- if one fails, then we may want to
shut down the whole plant until the specific failing control
program is fixed. If these control programs are all linked or
coupled, then we may have a bigger problem )
** Note that the second assumption is also why we
promote loose coupling in software design
Techniques Classification
• Fault Tolerance techniques:
1. Duplication:
• We may use multiple, parallel processing and picking a consensus
solution – (n-version programming)
2. Backup/Recovery:
• We have a software (algorithm and db) running with regular
checkpoints to backup the information processed – when a
problem is encountered, then the software may recover by going
back to the last check pointed data and reprocess
• We have a primary and secondary software (programs) and when
the primary software fails, the secondary (less functional)
program may be swapped in to bring the processing to some
degraded state.
Techniques classification
• Failure Containment techniques
1. Failure analysis for containment:
• We focus on analysis of potential, preconditions of
failure/damage and set different mechanisms for accident
reduction/ containment/ control once a failure does occur
2. Damage control:
• We focus more on how to limit the damage and severity of the
accidents once the accident occurred; since the damage and
severity is domain specific, the containment mechanism is also
domain specific (e.g. chemical hazards, mechanical safety, etc.)
Fault Tolerance Based on “Multiple Computation”
• The notion of fault-tolerance via employing multiple-
computation (duplication) is used by both hardware
and software systems. The general notion of multiple
computation includes backup/recovery and touches
upon 3 domains:
– Time
– Hardware
– Software
• The general notation (from Avizienes) is - nT/nH/nS
– We may repeat the processing n-Times
– We may duplicate processing on n-Hardware
– We may use n-versions-of-Software
More on nT/nH/nS
• Consider nT/1H/1S:
– This is the situation where processing is performed several times
on one hardware, using the same version of software.
• Back-up the information and recover by reprocessing from the last
checkpoint is an example of multiple processing over time.
• Consider 1T/nH/1S:
– This is the situation where multiple, duplicative hardware is used
with the same version of the software and comes in two
“flavors”
• replication where two or more hardware(may not be the same kind) is
running the same software in parallel and some algorithm is used to pick
the “correct” output such as the “majority-vote” in Triple-Modular-
Redundancy
• redundancy where multiple identical instances of the same system is
provided but only one is running and switching to another when the
processing one fails.
More on nT/nH/nS
• Consider 1T/nH/nS:
– This is the case where we have multiple versions of software
running on multiple hardware providing possibly different
outputs. The key is the algorithm that will determine what is
the “right” output.
– This case will also need a sophisticated operating system or
runtime tool to gather the outputs fro the multiple, possibly
different hardware and software.
• Consider 1T/1H/nS:
– This is the case where we have multiple versions of software
running on the same hardware, providing possibly different
outputs. The key here, again, is the algorithm that
determines the “correct” output.
– This requires the compiler and runtime tool that will
facilitate multiple, parallel processing
N-version Programming
• N-version programming is a fault tolerant technique
that was introduced by Avizienes and Chen based
on the notion of “multiple computing.” The general
scheme works as follows:
1. There are multiple, n, independent versions of program
that performs the identical functionality
2. The same input is distributed to all n versions
3. The individual outputs from all n versions are fed to a
decision “box”
4. The decision “box,” using some algorithm, chooses the
appropriate answer as the output
N-version Programming
Version 1
Version 1
Decision
“Box”
Input . Output
.
.
Version n
Note: that this may be 1T/1H/nS or 1T/nH/nS
N-version Programming
• The Decision “box” algorithm is an important factor in
this approach.
– The decision algorithm is often based on the assumption that
the faults in n-versions are independent (the earlier
mentioned “failure independence”)
• This assumption says that if the faults are independent then it is likely
that any one fault is local to a version and the other versions may be
processing correctly with respect to this one fault, even though other
versions may have other faults of their own.
– One popular algorithm is the “simple-majority” rule to use
the answer of majority.
• Note that this assumption of majority is correct is not a “guarantee” ---
--- what if the majority were wrong!?
Facilitating N-version Programming
• A way to ensure and make N-version Programming
more reliable is to get to fault independence
through version independence:
1. Use diverse people to develop the different versions
2. Use different development processes to develop the
different versions
3. Use different technology, tools, programming languages,
methodologies,etc. to develop the different versions
- So far, N-version Programming is found to be quite costly!
- What is a reasonable N?
***can we use N-version Programming for security attack tolerance?***
Failure Containment
• With all the fault prevention and fault tolerant
techniques, unfortunately, we will still have faults.
In that case, can we (a) “prevent accidents” and can
we (b) “reduce the damage of the accidents”?
• We already know that we can not prevent all
accidents. But we can analyze the hazards of an
accident and hopefully contain or limit the damage.
Fault Tree Analysis
1. list the set of events that cause the “accident” or “failure”
2. build a upside down tree that logically connects the events
to the failure
Security
Break in F1
AND
Log-in
Granted Access to
F1
“bug”
Or
Log-in - top event is the “accident”
Password
validation - circles are primary events
Exposed
“bug” - AND/OR are logical conditions
Containment
• We use the fault-tree to analyze and understand the
cause of the “accident.” We may use it for:
– Accident elimination
– Accident reduction
– Accident control
• The actual containment is a solution that is domain
dependent and requires “domain specific”
knowledge.
Get documents about "