FAULT-TOLERANT Systems in a Nutshell by pptfiles

VIEWS: 8 PAGES: 30

									                         EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH
                              European Laboratory for Particle Physics
Large Hadron Collider Project




                    FAULT-TOLERANT SYSTEMS
                                IN A NUTSHELL




                                          by
                                        Manuel
                                Fault-Tolerant systems in a nutshell
Large Hadron Collider Project




                    Some real-world failures (SEN)
          -       An error in the automatic pilot implied the explosion of a 747 (China
                  Airlines) near to San Francisco (Feb 1985).
          -       The lunch of a MARINER-10 failed due to the FORTRAN program had
                  a full stop instead of a colon in a DO instruction (1962).
          -       Due to an error in a computer, a ghost train appeared in the
                  visualization system of the San Francisco train station (May 1983).
          -       The RADAR of the NORAD defense system misunderstood the moon
                  by an enemy missile (1979).
          -       A 767 plane (United Airlines) was frozen due to the system for fuel
                  saving was “too much” efficient (Aug 1983).
          -       An error in a soviet missile implied that the target was Hamburg instead
                  of the Artic ocean (Dec 1984).
          -       An error in a computer caused an overheating in a nuclear reactor in
                  Florida (Feb 1980).

              If this has happened in SIL 4 systems, why not in ours?
                                Fault-Tolerant systems in a nutshell
Large Hadron Collider Project




                                        Outline

             1. Introduction and Necessity
             2. A bit of Mathematics
             3. Reaching the required dependability
             4. Real-world fault tolerant examples
             5. Conclusions and Future Work
Large Hadron Collider Project
                                1. Introduction and Necessity

                        Some definitions

               Dependability: guarantee of working. It implies all or a
               set of the next specifications:
                   Reliability: that the system works without interruptions
                   Safety: that the system prevents catastrophic failures
                   Availability: that the system is ready the maximum
                   possible time
                   Maintainability: that the system is easily repairable
                   Security: that the system is prepared faced to external
                   intrusions
Large Hadron Collider Project
                                Introduction and Necessity

                    Necessity of high dependability systems

               First engineering systems:
                  Low reliability. Ex: vacuum valves computers
                  Reliability only in terms of fault-avoidance
                  To provide fault-tolerance:
                      First steps in redundancy (but very limited)
                      Error detecting and correcting codes
               During the 60´s: Real-Time, aeronautics and space appl.
                  Increased reliability, safety, availability, maintainability
                  First Fault-Tolerant systems
Large Hadron Collider Project
                                Introduction and Necessity

                    Necessity of high dependability systems

        Nowadays: Well established theory and practice on R&FT
                Defect-free engineering systems:
                   Test generation
                   Fault simulation
                   Design for testability
                Error detection and correction techniques for FT and self
                checking circuits
                Architectures of Fault-Tolerant systems
                Analytical and simulation techniques for fault prob. comp.
                Reliability in SW (similar to HW techniques)
Large Hadron Collider Project
                                Introduction and Necessity

                 Objectives of FTS: An application approach
           Long-Life systems: space-ships (Voyager), artificial satellites
                    Objective: High Reliability
           Critical application systems: Avionics, nuclear power stations
                    Objective: High Safety
           High Availability systems: banking, telephony
                    Objective: High Availability
           Complex maintainability systems: remote controlled systems
                    Objective: High Reliability and/or Availability
           Ours: BIC and PIC
                    Objective: High Reliability, Safety, Availability & Maintain.
Large Hadron Collider Project
                                    2. A bit of Mathematics

                        Fault-Rate distribution

               “Bath-tube” curve: depicts the failure characteristic of
               components over time (λ = fault rate or failures per u.t)
                   




                                I                    II                        III
                                                                                     t
                                    I: burn-in area: from months to 5 years
                                    II: useful life area: from 5 to 25 years
                                    III: wear-out area: more than 25 years
Large Hadron Collider Project
                                A bit of Mathematics

                  Functions for the evaluation of FTS

               Reliability R(t): probability that the system works during a
               given time t without interruptions
               T: Time until a device fault (random variable) 0  T  
               F(t): distribution function of T
                                                         
                                   F (t )  PT  t    f (t )dt
                                                        0

                                   R(t )  PT  t   1  PT  t   1  F (t )
                                                                                           R(t )  e   t 
                     For area II, λ is constant =>              f (t )    e    t
                                                                                                               
                                                                                           F (t )  1  e  t 

                                         1                      F(t)


                                                                R(t)      t
Large Hadron Collider Project
                                    A bit of Mathematics

                  Functions for the evaluation of FTS

               Instantaneous Fault rate Z(t): probability that a system
               which has not failed until a time t, fails at t+Δt; with Δt → 0,
               divided by Δt
                                                               P(t  T  t  t ) F (t  t )  F (t )
                                Pt  T  t  t T  t                         
                                                                    P(t  T )               R(t )
                                                                     d
                                                                        F (t )
                                               F (t  t )  F (t ) dt                    f (t )
                                Z (t )  lim                                   Z (t ) 
                                         t 0      t  R(t )        R(t )               R(t )

                Z(t) is constant for an exponential distribution:
                                             f (t )    e   t 
                                                           t 
                                                                       Z (t )  
                                             R(t )  e             
Large Hadron Collider Project
                                   A bit of Mathematics

                  Functions for the evaluation of FTS

               Working time WT(R): time over which the Reliability falls
               down a given value R

               Considering exponential distribution:

                                                   R(t )  e   t
                          1
                                                                      ln R R(WT ( R))  R 
                                                   WT ( R)                              
                          R             R(t)                              WT ( R(t ))  t 
                                               t
                                WT(R)
Large Hadron Collider Project
                                A bit of Mathematics

                  Functions for the evaluation of FTS

               Availability A(t): probability that a system works at time t.
               It does not inform about what has happened before.
                   Non-reparable system: R(t)=A(t)
                   Reparable system: R(t)<A(t)
               Safety S(t): probability that a system does not fail until a
               time t, or, if it has failed the failure was not catastrophic
               (safe failure)
Large Hadron Collider Project
                                    A bit of Mathematics

                  Functions for the evaluation of FTS

               Mean Time To Fault (MTTF): time expected of working until
               the first fault.
                                
                MTTF   t  f t   dt
                                0




                        F (t )  1  R (t )    R (t )
                     d            d                 d
                 f (t ) 
                     dt          dt                 dt
                                                                            
                MTTF    t  R (t )  dt    t  dR(t )  t  R (t )   R (t )  dt   R (t )  dt
                               d                                                              

                          0     dt               0                         0 0                 0


                                                                                                              1
                Considering exponential distribution: R(t )  e t  MTTF 
                                                                                                              
Large Hadron Collider Project
                                        A bit of Mathematics

                  Functions for the evaluation of FTS

               Mean Time To Repair (MTTR): average expected time to
               repair a system.
                                    1
               MTTR                    , where μ is the reparation rate
                                    
               Mean Time Between Faults (MTBF): MTBF  MTTF  MTTR
                                                           MTBF
                                        MTTF                MTTF
                                                                           t
                                0                MTTR              MTTR


                                               1st Fault      2nd Fault
Large Hadron Collider Project
                                    A bit of Mathematics

                  Computation of the Fault Rate

               Methods:
                       Experimental: difficult even sometimes impossible
                       Estimation: Ex: Standard MIL-HDBK-217 for IC
                                   L   Q  c1   T  c2   E    P
                        where:
                                      faults/ 106 hours
                                     L  Learning Factor (new design or not)
                                     Q  Quality Factor (militar, commercial)
                                     T  Temperature Factor
                                     E  Environment Factor ( surface, plane, missile)
                                     P  Number of pins
                                    c1 , c2  Complexity Factors (number of gates)
                                3. Reaching the required dependability
Large Hadron Collider Project




                        Chain: Fault – Error – Failure

               Fault: Defect or physical imperfection of HW or SW.
               It could generate errors
               Error: Internal wrong state of the system.
               It could generate a failure and is consequence of fault(s)
               Failure: Occurs when the system provides a service that is
               not the specified one. The users are aware of a malfunction
               It is a consequence of an error(s)
                                                                     Barrier of FTS
                                Physical level   Information level                    External level

                                   Fault              Error                            Failure
                                Reaching the required dependability
Large Hadron Collider Project




                        HW and SW faults

        HW:
                Caused by (in IC):
                  Surface defects (Si slice): impureness, contamination,..
                  Oxide defects: oxidation of gates, terminals,…
                  Diffusion defects in the creation of p & n zones
                  Defects in contacts: corrosion, metallization,…
                  I/O defects: overloads, open circuits,…
                  Volume defects: break of the Si slice,…
                  Technology: CMOS → 38% are surface faults
                                TTL → 51% are metallization faults
                                Reaching the required dependability
Large Hadron Collider Project




                        HW and SW faults

        HW:
                Manifested as:
                  Logic circuits: Stuck-at 0 or 1 (80%), open circuits,…
                  ROM: Bad addressing, bad selection, bad cell content,.
                  RAM: multiple writing, incorrect access time, transitory
                  cell content change, pattern sensibility, no refresh,…
                  PLD: Stuck-at, cross-point faults,…
                  μprocessors: Invalid program flow (40%), Incorrect
                  opcode address, Invalid read address, Non existent
                  memory,…
                                Reaching the required dependability
Large Hadron Collider Project




                        HW and SW faults

               SW:
                       Caused by:
                         HW faults
                         Bad specifications of the problem to solve
                         Bad coding
                         Faults in compilers
                         Faults in the operating system
                                Reaching the required dependability
Large Hadron Collider Project




                        HW and SW faults

               SW:
                       Manifested as:
                         The computer is down (bug in the OS or program)
                         Unlimited execution time of a program
                         Unexpected results of computation
                         Miss of deadlines in Real-Time applications
                         Real-World examples:
                                 Launch failure of Mariner I (1962)
                                 Destruction of a French meteorological satellite (1968)
                                 Problems in the Apollo mission (1970s)…
                                Reaching the required dependability
Large Hadron Collider Project




                      Barriers to prevent failures: FA & FT


                FAULTS                        FAULTS

                 ORIGIN           Barrier I CONSEQUENCES Barrier II       Barrier III
              DESIGN &
            SPECIFICATION
                                                                                        F
                                             SOFTWARE                 E                 A
             IMPLEMENT. &
              VALIDATION
                                              FAULTS                  R                 I
                INTERNAL                                              R                 L
                PHYSICAL                                              O                 U
                 CAUSES                                               R                 R
                EXTERNAL                     HARDWARE
                                              FAULTS                  S                 E
                PHYSICAL
                 CAUSES
                                                                                        S
            INTERACTION &
              OPERATION
                                Reaching the required dependability
Large Hadron Collider Project




                      FA & FT in HW

               Fault Avoidance (Barrier I): careful design using shielding,
               radiation protection, noise protection (filtering), etc…
               Fault Tolerance (Barriers II & III):
                  Static Redundancy: Fault masking
                  Barrier II. Ex: N-Modular redundant system
                  Dynamic redundancy: Detection, isolation, reconfig.,
                  and recovery. Barrier III. Ex: Reconfigurable
                  duplication system
                  Hybrid redundancy: combination of both. Ex: N-Modular
                  redundant with back-up system
                                Reaching the required dependability
Large Hadron Collider Project




                      FA & FT in SW

        Fault Avoidance (Barrier I): careful design and implementation
        Fault Tolerance (Barriers II & III):
           Static Redundancy: Fault masking
           Barrier II. Ex: N-Version programming
           Dynamic redundancy: Detection, isolation, reconfig.,
           and recuperation. Barrier III. Ex: Recovery blocks
           Hybrid redundancy: combination of the previous both
                                Reaching the required dependability
Large Hadron Collider Project




                      Validation media
            Fault prediction through:
                    Evaluation of theoretical models (compare architectures):
                            Markov chains
                            Stochastic Petri nets
                            Combinatory models and fault trees
                    Experimental injection of faults:
                            Logical: Simulation (VHDL)
                            Physical
            Fault elimination through their detection and correction: D,
            PODEM & FAN algorithms, LSSD, Boundary Scan,...
                                Reaching the required dependability
Large Hadron Collider Project




                      Improvements through FT in HW & SW

        Simple example T-Modular redundant (Static Redundancy)

                                                             Markov Reliability
                                Om1           R(t)
                         M1
                                Om2
       I                 M2           V   O           Rtmr

                                                     Rs

                         M3 Om3

                                                                                  t
                                                             ln 2
                                                              
                 Street example of N-Modular redundant system: traffic lights
                                4. Real-World FT examples
Large Hadron Collider Project




                        Sample Fault-Tolerant systems

               Long-Life Applications: Space applications
                   STAR: Self-Testing And Repairing computer
                   FTSC: Fault-Tolerant Spaceborne Computer
                   FTBBC: Fault-Tolerant Building Block Computer
               Critical-computation applications:
                   FTMP: Fault-Tolerant Multiprocessor
                   SIFT: SW Implemented Fault-Tolerance system
                   MMFCS: Honeywell Multi-Microprocessor Flight Control
                   System
                                Real-World FT examples
Large Hadron Collider Project




                        Sample Fault-Tolerant systems

               High-Availability Applications:
                   Tandem NonStop computer system
                   Stratus/32 system
                   ESS: Electronic Switching System
               Airbus example:
                   A310 Roll Control system
                   A320 Pitch Control system
                                5. Conclusions & Future Work
Large Hadron Collider Project




                        Conclusions
         Fault-Tolerance adds complexity and therefore inc. faults
         But faults always appear even under stringent fault-avoidance
         Fault-Tolerance improves SW and HW systems: Reliability,
         Safety, Availability and Security
         Previous conclusion has proofs:
                 Theoretical: Markov models for architectures comparison
                 Practical: Several Real-World examples
         But always remembering:
        “The more complex a system is, more faults could appear. But too much
        simplicity (in the sense of no Fault-Tolerance support) could reduce the
                                      Dependability”
                                 Conclusions & Future Work
Large Hadron Collider Project




                        Future work

        Could be interesting for the BIC system (of-the-self design):
          Computation through modelling of the λ for our design
          Evaluation of the dependability of our design
          Possible inclusion of techniques to increment dependability
          (but only if justified through previous studies):
                        Design for testability: Boundary Scan techniques ?
                        Redundancy, Availability & Safety: Static, Dynamic, Hybrid ?
                        Watch-dog timers (Stuck-at faults) ?
                        Maintainability: Modular design
                        Some Security media (Beam Energy computation)?
                                  Conclusions & Future Work
Large Hadron Collider Project




                        Future work
        Could be interesting for the PIC system:
                Computation of the λ of PLCs, interface boards & SW
                Evaluation of the dependability of the architecture
                Possible inclusion of techniques to increment dependability
                (only if justified through previous studies):
                        Redundancy & Availability: F series from SIEMENS &
                        SW fault-tolerance hybrid techniques (RAM degradation faults)
                        Safety: F series from SIEMENS + Special OBs programming
                        Security: Security module from SIEMENS, Firewall in CP,
                        Password protection in PVSS, Removal of PROFIBUS &
                        CPU switches,…
                        Maintainability: Modular design

								
To top