Fault Tolerance Jenn Wei Lin

Document Sample
Fault Tolerance Jenn Wei Lin Powered By Docstoc

                   Jenn-Wei Lin
Department of Computer Science and Information Engineering
                Fu Jen Catholic University

        Motivation and Introduction
                     Lecture Set 1
            General Information
• Textbook
   – Marin L. Shooman: Reliability of Computer Systems and
     Networks: Fault Tolerance, Analysis, and Design, John Wiley and
     Sons, 2002.
   – D.P. Siewiorek and R.S. Swarz: Reliable Computer Systems:
     Design and Evaluation, 3rd ed. A. K. Peters, 1999.
   – D. K. Pradhan, editor, Fault-Tolerant Computer System Design,
     Prentice-Hall, 1996. The book is out of print
• Paper
   – Dependable Computing Conference
• Grading Policy
   – Exam. 20%
   – Presentation 40% (four)
   – Term report & Project 40%

                      ECE 753 Fault Tolerant Computing                 2
•   Motivation
•   Introduction
•   Terminology
•   Fundamental Principles
•   Fault-Error-Failure concept

                 ECE 753 Fault Tolerant Computing   3
•   Informal Definition
•   Key Attributes
•   Who, What and Why Study
•   Examples

              ECE 753 Fault Tolerant Computing   4
• What is Fault-Tolerance?

 A “fault-tolerant system” is one that
 continues to perform at desired level of
 service in spite of failurs in some
 componetns that constitute the system.

             ECE 753 Fault Tolerant Computing   5
         Motivation (contd.)
• Who is concerned about fault-tolerance?
  – System Users
• Who is concerned at design stages?
  – Universities
     • R, d, and a (Research, development,
  – Industry
     • r, D, and A (research, Development,
               ECE 753 Fault Tolerant Computing   6
        Motivation (contd.)

• General Purpose Systems
  – PCs: RAMs with parity checks
  – Workstations: error detection (HW), occasional
    corrective action (SW), ECC (HW), keeping
    log (SW)

              ECE 753 Fault Tolerant Computing   7
          Motivation (contd.)

• Reliable Systems
  –   Telephone systems
  –   Banking systems e.g. ATM
  –   Stock market
  –   Football games display/ticketing

                 ECE 753 Fault Tolerant Computing   8
         Motivation (contd.)

• Critical and Life Critical Systems
  –   Manned and unmanned space borne systems
  –   Aircraft control systems
  –   Nuclear reactor control systems
  –   Life support systems

               ECE 753 Fault Tolerant Computing   9
        Motivation (contd.)

• Reliable -> Critical Systems
  – 911 telephone switching system
  – Traffic light control system
  – Automobile control system (ABS, Fuel
    injection system)

              ECE 753 Fault Tolerant Computing   10
– Historical perspective and major push
– Goals of fault-tolerance
– Applications of fault-tolerance

            ECE 753 Fault Tolerant Computing   11
          Introduction (contd.)
• Historical Perspective
   – not a new concept
   – first use by J. van Neumann 1956
• Major push
   – Space program
   – HW Fault tolerance - then
   – SW Fault tolerance later
   – Merge the two

                 ECE 753 Fault Tolerant Computing   12
         Introduction (contd.)
• Applications
  – Space borne system
     • long life system
  – Airplane control system
     • critical system
  – Transaction processing system
     • high availability system
  – Switching system
     • high availability over certain level of performance
                 ECE 753 Fault Tolerant Computing       13
• Reliability and concept of probability
  – R(t): conditional probability that a system provides continuous
    proper service in the interval [0,t] given that it provided desired
    service at time 0.
• Availability
  – The probability that an item is up at any point in time
  – Uptime/(Uptime+Downtime)
• Dependability
  – Property of computer system that allows
    reliance to be placed justifiably on service it
                     ECE 753 Fault Tolerant Computing                 14
        Fundamental Principles
• Dependability
• Impairments
  – Faults, errors, failures
• Means
  – Fault Avoidance, Fault Tolerance, Fault Removal, Fault
• Measures
  – Reliability, Availability, Maintainability
                    ECE 753 Fault Tolerant Computing    15
 Fundamental Principles (contd.)
• A set of methods, tools and solutions that enable
  development of dependable systems.
  - Fault Prevention: how to prevent fault occurrence or
  - Fault Tolerance: how to ensure a service up to fulfilling
  the system’s function in the presence of faults,
  - Fault Removal: how to reduce the presence (number
  seriousness) of faults,
  - Fault Forecasting: how to estimate the present number,
  the future incidence, and the consequences of faults

                    ECE 753 Fault Tolerant Computing        16
 Fundamental Principles (contd.)
• Fault Avoidance: To prevent by construction
  fault occurrence. E.g., nearly fault-free
  components, shielding against electromagnetic
   – Drawbacks:
      - Cost of near-perfect components high
      - Cost of maintenance personnel
• Fault Tolerance: To provide, by redundancy,
  service complying with specification in spite of
  faults occurring
                     ECE 753 Fault Tolerant Computing   17
 Fundamental Principles (contd.)
• Fault Removal: To minimize, by
  verification, the presence of faults. E.g. Am
  I building the right system? Concepts of
• Fault Forecasting: To estimate, by
  evaluation, the presence, occurrence and
  consequences of faults. E.g. For how long
  will the system be right ?

                ECE 753 Fault Tolerant Computing   18
 Fundamental Principles (contd.)
• Reliability: A measure of continuous delivery of
  proper service (or equivalently, of the time to
  failure) from a reference initial time
• Availability: A measure of the delivery of the
  proper service with respect to the alternation of
  delivery of proper and improper service
• Maintainability: A measure of continuous delivery
  of improper service (time to restoration or repair)

                  ECE 753 Fault Tolerant Computing   19
 Fundamental Principles (contd.)
• Hardware redundancy
     • Low level
     • High level
• Software Redundancy
• Time Redundancy
• Information Redundancy

                    ECE 753 Fault Tolerant Computing   20
Fundamental Principles (contd.)
• Hardware Redundancy - Low level
  – logic level
     • Example 1 - Self checking circuits
     • Example 2 - Arithmetic code
          A modular adder using the mathematical principle
          (A+B+|) mod k = ((A mod k) + (B mod k)) mod k

• Hardware Redundancy - High level
  – Triplicate or 5-copies as in space shuttle

                  ECE 753 Fault Tolerant Computing           21
Fundamental Principles (contd.)
• Software Redundancy
  – Use two different programs/algorithms
• Time Redundancy
  – Re-compute or redo the task and compare the results
  – May or may not use the same hardware/software
• Information Redundancy
  – backup information
  – Use of ECC
• Question - What kind of FT is achieved?
                ECE 753 Fault Tolerant Computing          22
      Fault-Error-Failure concept
•   Intuitive definitions
•   Origins of faults
•   Methods to break FEF chain
•   Attribute of faults

                  ECE 753 Fault Tolerant Computing   23
Fault-Error-Failure concept (contd.)
                 Intuitive definitions
• Fault -
  – An anomalous physical condition caused by a
    manufacturing problem, fatigue, external disturbance
    (intentional or un-intentional), desgin flaw, …
  – Causes
• Error - Effect of activation of a fault
• Failure - over-all system effect of an error
             Fault -> Error -> Failure
                 ECE 753 Fault Tolerant Computing      24
Fault-Error-Failure concept (contd.)
• Failure occurs when the delivered service
  deviates from the specified service; failures are
  caused by errors
• Error is the manifestation of a fault within a
  program or data structure
• Fault is an incorrect state of hardware or software
  resulting from failures of components, physical
  interferences from the environment, operator error
  or incorrect design

                ECE 753 Fault Tolerant Computing   25
Fault-Error-Failure concept (contd.)
                              Causes of faults
•   Specification mistakes
    –   Incorrect algorithms, architectures, or hardware and software design
•   Implementation mistakes
    –   Process of transforming hardware and software specifications into the
        physical hardware and the actual software
    –   Poor design, poor component selection, poor construction, software
        coding mistakes
•   Component defects
    –   Manufacturing imperfections, random device defects, and component
•   External disturbance
    –   Radiation, electromagnetic interference, operator mistakes, battle
        damage, and environmental extremes

                       ECE 753 Fault Tolerant Computing                        26
Fault-Error-Failure concept (contd.)
                  Causes of faults
 Specification     Software
   Mistakes         Faults
 External                                           Failures

 Component       Haredware
   Defects         Faults

                 ECE 753 Fault Tolerant Computing         27
Fault-Error-Failure concept (contd.)
                           Characteristics of faults
•    Fault nature
    – Specify the type of fault
        •   Is the fault a hardware or a software fault?
•    Fault duration
    – Specify the length of time that a fault is active
        •   Permanent fault
        •   Transient fault
             – Appear and disappear within a very short period of time
        •   Intermittent fault
             – Appear, disappear, and reappear repeatedly
•   Fault extent
    –   Fault is localized to a given hardware or software module or globally
        affects the hardware, the software, or both.
•   Fault value
    –   Determinate or indeterminate
        • Fault sensitive to either the data Computing
                       ECE 753 Fault Tolerant or time                       28

Shared By: