REST Representational State Transfer

Document Sample
REST Representational State Transfer Powered By Docstoc
					  System Dependability


Robert Wierschke
Seminar “Prozesssteuerung und Robotik”
14. Januar 2009
    Outline


2
    ■ Motivation
    ■ Dependability
       □ Definition
       □ Dependability attributes
       □ Attribute relevance
    ■ Threads
       □ Fault model
       □ Fault-error-failure
    ■ Attaining dependability
       □ Fault tolerance
       □ Redundancy
    ■ Software Dependability
    ■ Summary
    Motivation


3

    ■ Deliver correct communication and computation services
    ■ First generation computer used unreliable components
       □ Hardware concept
    ■ Consequences of system failure
       □ Economically
           ◊ Credit card authorization $2.6 million / hour of downtime
           ◊ Airline reservation $89.500 / hour of downtime
       □ Human life
           ◊ What happens if the board computer of an air plane
           crashes?
    Definition: Dependability 1|2


4

                                                  [Merriam-Webster Online]
            dependability: capable of being depended on : reliable
            reliable: suitable or fit to be relied on : dependable


     “the collective term used to describe the availability performance and
    its influencing factors : reliability performance, maintainability
    performance and maintenance support performance”                   [7]
         □ (Zuverlässigkeit)
         □ Focus: availability
             ◊ Strongly influenced by telecommunication industry


     ■ Evolves over time
     ■ Depends on problem domain
    Definition: Dependability 2|2


5


    “the trustworthiness of a computing system which allows reliance to
    be justifiably placed on the service it delivers”             [1]
         □ Behaves as specified
         □ Avoids hazards


     ■ Dependability Attributes
         □ Reliability (Funktionsfähigkeit)
         □ Availability (Verfügbarkeit)
         □ Safety (Sicherheit)
         □ Confidentiality (Vertraulichkeit)
         □ Integrity (Integrität)
         □ Maintainability (Wartbarkeit)
    Reliability 1|2


6

    ■ Continuity of service
    ■ Probability R(t) of a system/component to operate correctly
    during a time period t
    ■ Example:
        □ Space probe needs to operate correctly during the mission
        time.
    ■ Calculating reliability
        □ R(0) = 1, R(∞) = 0
        □ Failure probability Q(t) = 1 – R(t)
        □ Failure rate λ(t) number of failures
           during Δt
        □ For constant failure rate R(t) = e-λt
                                                                      [3]
    Reliability 2|2


7

    ■ Empirical values
    ■ Assuming independent components
       □ Series




       □ Parallel




                                        [3]
    Availability 1|2


8

    ■ Readiness for usage
    ■ Probability A of a system/component to operate correctly at any
    point in time
    ■ Example:
        □ Telecommunication
    ■ Availability vs. reliability
       □ Service that crashes often but restarts instantly has high A but
       low R(t).               1              90,0 %         36,5 d
    ■ Number of 9s             2              99,0 %         3,65 d
       □ Availability per year 3              99,9 %         8,76 h
                                     4        99,99 %       52,6 min
                                     5        99,999 %      5,26 min
                                     6        99,9999 %     31,5 s
                                     7        99,99999 %    0,3 s
    Availability 2|2


9

    ■ A = MTTF / (MTTF + MTTR)
       □ MTTF mean time to failure
           ◊ MTTF = 1/λ
       □ MTTR mean time to repair
           ◊ Shorter repair time leads to higher availability
       □ MTTB mean time between failures




                                                                [5]
     Safety


10

     ■ Avoidance of catastrophic consequences
     ■ Property of a system/component that it will not imperil equipment
     or human life.
     ■ Example: nuclear power plant
        □ If the reactor reaches temperature X, it must shut down within
        time Y.
     ■ Conflicting with reliability: a non working system is often save
        □ Fail-safe: system reaches a safe state
        □ Fail-operational: system provides a degraded service mode
              ◊ Example: spare tire
     Security


11

     ■ Combines confidentiality and integrity
     ■ Property of a system/component that it will prevent unauthorized
     access or alteration of data
     ■ example: control board in train
        □ Displays and switches are behind a glass door, thus values can
        be read be everyone but not modified.
     Maintainability


12

      ■ System can be repaired and modified   [2]

      ■ Repair rate: μ = 1 / MTTR
      ■ Hard to specify
     and measure
         □ Low maintainability
             ◊ e.g. Satellites
             ◊ Requires high
                reliability
     Attribute relevance


13

     ■ Depends on problem domain
        □ Economical: Are financial consequence acceptable?
        □ Ethical: Are risks for life or equipment acceptable?


     ■ Attributes might be conflicting
        □ Fail-safe state (reliability vs. safety)


     ■ Classes
        □ Uncritical Embedded Systems (e.g. mobile phone, Lego NXT)
        □ High-Integrity Embedded Systems (e.g. Satellites )
        □ Safety-Critical Systems (e.g. aircraft)
     Threads


14

     ■ Anything that is capable of decreasing the system dependability
     ■ A meaningful specification must state threads to relevant
     dependability attributes
     ■ Fault model




                                                                   [5]
     Fault-Error-Failure 1|3


15

     ■ Fault
        □ A defect within the system, that eventually leads to an error.
               ◊ Active f. if it causes an error, otherwise            [2]
               ◊ Dormant f.
        □ Fault classes
     ■ Error
        □ Part of system state
           that may lead to failure.
     Fault-Error-Failure 2|3


16

     ■ Failure
         □ Event that occurs when the delivered service deviates from
         correct service
         □ Service restoration: transition from incorrect to correct service
         □ Partial failure: a failure of a service may leave the system in
         degraded mode (e.g. Emergency service)
     ■ Fault/failure chain
         □ Failures are recognized at component boundaries, thus a
         failure can be considered a fault in a depending component




                                                                        [2]
     Fault-Error-Failure 3|3


17

     ■ Failure
         □ Typical failure rate
             ◊ Hardware: bath tube curve

                                                 [5]


             ◊ Software




                                           [5]
     Attain dependability


18

     ■ Fault prevention
        □ Avoid fault to be introduced into the system
        □ Development techniques
     ■ Fault tolerance
        □ Mechanisms that allow the system to operate correctly in case
        of certain failures
        □ Possibly degraded service mode
     ■ Fault removal
        □ Development or usage phase
        □ Maintenance
     ■ Fault forecasting
        □ Predicting likely faults
     Fault tolerance 1|4


19

     ■ Definition: Means to maintain service while faults are present.
     ■ Improves reliability and availability
     ■ Fault can only be tolerated if it was expected
     ■ Phases
         □ Error detection
         □ Damage assessment
         □ State restoration
         □ Continue service
             ◊ Degraded mode
     Fault tolerance 2|4


20

     ■ Error detection techniques
        □ Result comparison
            ◊ Compare results of redundant components
        □ Watchdog timers
            ◊ Assume failure if result is late
        □ Reasonableness
            ◊ Range checks
            ◊ Constraints (e.g. negative value)
        □ Information redundancy
            ◊ Checksums
        □ Functionality test
            ◊ Memory checks
     Fault tolerance 3|4


21

     ■ Error recovery
        □ Forward
            ◊ Discard computation
            ◊ Resume service from a error-free system state.
            ◊ Typical used for periodic tasks
        □ Backward
            ◊ Roll-back to know-good sate (checkpoints)
            ◊ Re-execution of failed task possible
     Fault tolerance 4|4


22

      ■ Replication
          □ Using multiple identical instances of a component
          □ Parallel task processing
          □ Voting/quorum
      ■ Diversity
          □ Tolerate systematic failures
          □ Using multiple different implementations of a component
          □ Otherwise use as replica
     “The most certain and effectual check upon errors which arise in the
     process of computation, is to cause the same computations to be
     made by separate and independent computers; and this check is
     rendered still more decisive if they make their computations by
     different methods”                               [1834; Lardner; 2]
      ■ Redundancy
     Redundancy 1|2


23

     ■ Using multiple identical instances of a component if component
     fails, switch to another (fail-over)
     ■ Types
        □ Space
               ◊ Use multiple components of the same type
        □ Time
               ◊ Send messages multiple times
               ◊ Execute computation multiple times
        □ Information
               ◊ Checksums, error correcting codes
     Redundancy 2|2


24

     ■ Active redundancy
        □ Parallel components, voting
        □ Voter is single point of failure
        □ Fail-silent
        □ N-modular-redundancy
            ◊ N >= 3 (TMR)                   [3]
            ◊ Tolerates (N-1)/2
              component failures
     ■ Passive redundancy
        □ Hot standby
            ◊ Operating in background
                                                   [3]
              to keep synchronized
        □ Cold standby
     Software Dependability 1|2


25

     ■ Fault model for software components
        □ Bohrbugs: easily reproducible
        □ Heisenbugs: complex event combination; transient faults; hard
        to reproduce
        □ “Aging”: fault accumulation (e.g. memory leaks)




                                                            [8]
     Software Dependability 2|2


26

      ■ Rejuvenation
     “proactive fault management technique aimed at cleaning up the
     system internal state to prevent the occurrence of more severe crash
     failures in the future”                                       [8]
          □ Heisenbugs, Aging
      ■ N-version Programming
          □ Different implementation of a software component (diversity)
          □ Bohrbugs (operational phase)
      ■ Retrying operations
          □ Heisenbugs, Aging
      ■ Fault masking/reconfiguring/fail over
          □ e.g. FT CORBA
          □ NMR
          □ Heisenbugs, Aging
     Summary 1|2


27

     ■ System Dependability is the ability of a system/component to
     operate as specified. Depending on the problem domain it subsumes
     a set of dependability attributes (reliability, availability, safety,
     security, maintainability, ...) with varying importance.


     ■ Threads
        □ Define possible degradation of system dependability
        □ Need to be specified
        □ Fault -> Error -> Failure --> Fault
     Summary 2|2


28

     ■ Fault tolerance
        □ Operate regardless of certain faults
        □ Tolerates only expected faults
        □ Strategies
            ◊ Replication
            ◊ Diversity
            ◊ Redundancy
                ● Space, time, information
                ● Active or passive
     References


29

     [1] ifip WG 10.4 on DEPENDABLE COMPUTING AND FAULT
     TOLERANCE, http://www.dependability.org/wg10.4/
     [2] Avižienis, Laprie and Randell, DEPENDABILITY AND ITS THREATS:
     A TAXONOMY, http://rodin.cs.ncl.ac.uk/Publications/avizienis.pdf
     [3] Giese, Software Engineering for Embedded Systems: II
     Foundations (WS 0708)
     [4] Bowen and Hinchey, High-Integrity System Specification and
     Design, http://www.jpbowen.com/pub/hissd99.pdf
     [5] Löwis, Fault Tolerance (SS 08)
     [6] Geib, Verlässliche Systeme,
     http://www.informatik.fh-wiesbaden.de/~geib/LVweb/VS_Kap_1.pdf
     [7] IEC, IEV 191-02-03,
     http://dom2.iec.ch/iev/iev.nsf/display?openform&ievref=191-02-03
     [8] http://srejuv.ee.duke.edu/