Document Sample
slides Powered By Docstoc
Introduction and Chapter 1,2,3

   Phil Varner and Dave Larochelle
                Failure of the Day
●   Voter News Service - updated computer
    system after 2000 election, now it doesn't
●   “It is brand-new and has never been test-
    –   implies testing is after completion
●   “This is the equivalent of a NASA space shot
    without doing any test runs.”
    –   hmm, maybe not the best analogy
●   “We are going to learn a few things
●   “Find a little, learn a lot”
    –   Law and Order Model
●   “Find a lot, learn nothing”
    –   Software Engineering Model
●   Dictionary
    –   “The producer of an effect result or
    –   “The one, such as a person, event, or condition,
        that is responsible for such an action”
●   Factor that contributed to events?
           Cause means Nothing
●   tempted to find things we can do nothing
●   often identify symptoms instead of problem
●   air of finality
●   implies singularity
●   usually doesn't help us prevent further
●   “Fate is just a lazy person's excuse for doing
              Accident Models
●   Spend time fitting data to model
●   “distracts from free-range thinking required to
    uncover the less obvious ways of preventing
    an accident” - huh?
●   ha, ha, your Crouching Lemur Formal
    Methods cannot defeat my Hidden Marmot
    Ad-Hoc Technique
              Introduction (cont.)
●   “chain of events”
    –   tells WHAT but not WHY
●   “can easily miss subtle and complex
    couplings ad interactions among failure
    events and omit entirely accidents involving
    no component failure”
●   misses “structural deficiencies in the
    organization, the safety culture in the
    industry, and management deficiencies”
●   Leveson - control systems theoretical model
●   “Mishap occurs when external disturbances
    are not adequately controlled”
    –   external meaning “the thing being controlled”
●   “dysfunctional interaction” among
    –   normal accidents
●   Missing, inadequate, or unenforced
●   Record all the facts
    –   what facts are important?
    –   shouldn't analysis drive investigation?
    –   “reports should not be too verbose or busy
        people will not read them”
    –   Oh, I'm sorry the nuclear plant meltdown, but I
        didn't have time to read your report
●   Technical oversight
    –   “Human error”
    –   End “causes” - symptoms
    –   buffer overflow, exception mishandled
●   Hazards that were not seen before
    –   meaning “we didn't recognize it”
    –   Specification, testing, review
    –   Switching software off when not necessary
        (Ariane 5)
    –   Using non-stack exploitable languages
●   Managerial failings
    –   improving management systems
    –   generally has to do with supervision, oversight
    –   training operators (programmers?)
    –   Forcing adequate specification, implementation,
        testing, verification, validation
●   Are these catagories adequate/rich enough?
●   Do they constrain thinking?
●   Leveson has similar categories, but more
                  Defen(c)e in Depth
●   Usually too much dependence on last line of
    –   Why accidents are usually attributed to human
         ●   “It was the pilot's fault he turned off the engines”
    –   Most effective actions are at start of chain
         ●   “Don't put the power switch next to the fuel control
    –   However, normal accidents mean we may not
        even know that a chain exists, much less where
        the start is
         Interesting Characteristics
●   Accident 1
    –   efficiency practice (polyethelene sheet)
         ●   Korean Air?
    –   sensor “failure” (gas detector)
         ●   TMI
●   Accident 2
    –   unauthorized modification (lid modified)
         ●   (Flixborough)
    –   poor quality modification (lid repairs)
         ●   (Flix)
●   Poorly designed and implemented
         Problems (from Johnson)
●   Supporting a systematic approach
    –   no approach is not an approach
●   Framing any analysis of software failure
    –   How far back do we chain the events?
    –   Why bolts? Why a pump? Why do we, as a
        society, require large petrochemical resources to
        maintain our generally unhappy lifestyles?
●   Making adequate recommendations
    –   Arianne, London examples
Ariane 5
             Events? Causes?
●   Self-Destruct          ●   Backup SRI dies
●   Laucher                ●   Exception thrown
    Disintegrates          ●   64 bit fp to 16 bit int
●   High angle of attack       conversion
●   Nozzles hard over      ●   Stupid French
●   Diag. data sent to         person?
●   Backup SRI dead
●   Primary SRI dead
              Ariane 5 Pre-events
●   Operand Error review
    –   7 vars, 4 protected
●   External reviews held
    –   what did they do, what did they find?
●   Simulations and test done using A4 test data
●   System test excluded SRI and used
    simulated output
●   SRI computer never tested using actual
    expected flight measurements
●   Official “Cause of the Failure”
    –   “ [the failure] was caused by the complete loss of
        guidance and attitude information... due to
        specification and design errors [in the SRI]”
●   Technical Cause
    –   “An Operand Error when converting the
        horizontal bias (BH), and the lack of protection of
        this conversion which caused the SRI computer
        to stop.”
●   Real Cause
●   4 technical, 7 hazard, 3 management(!)
●   One page long (of course, there is another
●   Do not allow any sensor, such as the SRI, to
    stop sending best effort data
●   Organize a specific sw qualification review
    for each piece of equipment w/ sw
●   Review all flight software (including
    embedded software), and in particular:
    –   identify all implicit assumptions...
    –   verify the range of values taken by any internal
        or communications variables in the software
●   Include trajectory data in specification and
    test requirements
●   A more transparent organization of the
    cooperation among the partners in the A5
    program must be considered
                   Chapter 2
●   Drying unit control panel located in a Zone 2
●   Electrical equipment was not flameproof or
●   Mounted in a metal cabinet
●   Cabinet purged with nitrogen
●   A pressure switch isolated the electrical
    supply if the pressure fell too low
            System operation
●   Young inexperienced graduate operates the
●   Switches on the system
               What went wrong?
●   Fuel
    –   Solvent leaked into the cabinet with the nitrogen
●   Air
    –   Leaked into the cabinet
    –   Leaked solvent may have weakened joints
●   Ignition source
    –   Electrical
●   Pressure switch basically disarmed
●   First layer
    –   Prevent back flow, etc.
    –   Alterations of trip set points should only be made
        with written authorisation
    –   New equipment should be scheduled for testing
                  Second Layer
●   Difficult to maintain pressure in a metal
●   Trip kept operating
    –   Reduced trip to ¼ inch
    –   Reduced trip to 0
    –   Operators just knew it worked
                     Third Layer
●   Safety e quipment for a rare hazard
    produced a greater hazard
●   Did the equipment need to be in a zone 2
    –   'He did not ask if the control panel could be
        moved. That was not his job. His was to supply
        equipment suitable for the agreed classification
        It was no one's job to ask if it would be possible
        to change the classifaction ...'
●   It is bad to assume a trip will always work
    and rely on it.
                    Other observations
●   Latent failures
●   Applicability to Software
    –   Numerous examples of efforts to gain additional
        safety/security having the opposite effect.
         ●   PHP
              –   Paranoid sites increase logging
              –   The increased logging exposes them to format string attacks
         ●   Bind
              –   Digital signatures added for increased security
              –   New code has a buffer overflow,.
    –   Discussion other examples, etc.
     Chapter 3: Things that almost go
●   Tank water and some oil
●   Slip-plate has not been removed
●   Foreman proposed to break the join, remove
    the slip plate and remake joint before the
    water drained out
●   Manager agreed to the proposal
●   Foreman unsucessful, oil starts to leak
●   Attempt to shut down the burners fails
●   Burner were manually isolated.
                   5 Reports
●   Slip-plate is the problem
●   Failure to shut down the furnace
●   Don't rush
●   Management problems
●   Business climate in the facility
                    Don't Rush

●   Very few problems are too urgent to discuss
    for 15 minutes
●   Usually better solutions if you stop to think
         Management problems
●   Manager inexperienced
●   Foreman was respected expert
''The manager could not be blamed.
   Nevertheless sooner or later every manager
   has to learn to stand up to his foremen, not
   disregarding their advice, but weighting it in
   the balance. He should be very reluctant to
   overrule them if they are advocating caution,
   more willing to do so if, as in this case, they
   advocate taking a chance''
    Business Climate in the Facility
●   Manager was influenced by others attitude to
●   Was the manager given contradictory
    –   Get the plant on-line quickly
    –   Follow safety procedures
●   'no-win' situation
    –   After accidents blamed for not following safety
    –   If the required output/efficiency is not achieved
●   'What we don't say is as important as what
    we do say.'
●   Kletz criticizes those who want to hold
    individual managers and directors
    responsible But is there an other way to
    change the business climate?
●   Note potential relevance to software.

Shared By: