Issues in Safety Assurance Martyn Thomas Visiting Professor in Software Engineering, Oxford University Computing Laboratory email: firstname.lastname@example.org Abstract. The greatest problem facing the developer of a software based safetyrelated system is the challenge of showing that the system will provide the required service and will not cause or allow an accident to occur. It is very difficult to provide such evidence before the system is put to use, yet that is exactly what is required by society and regulators, and rightly so. Conventional wisdom recommends that systems are classified into safety integrity levels (SILs) based on some combination of the allowable rate or probability of unsafe failure and the probable consequences of such a failure; then, depending on the SIL, development methods are chosen that will (it is hoped) deliver the necessary system quality and the evidence on which to base a confident assessment that the system is, indeed, safe enough. Such conventional wisdom is founded on a number of unstated axioms, but computing is a young discipline and progress has thrown doubt on these assumptions. It is time for a new approach to safety assurance. 1. Safety: the very idea. Computer-based safety related systems exist in a wide range of sizes, but they are all complex. The hardware will comprise sensors, actuators and one or more processors perhaps with associated chipsets. The size of the software may range from some kilobytes up to many megabytes of program logic, possibly with a similar volume of data. A ―typical‖ system will perhaps contain a few million transistors in the hardware and one or two hundred kilobytes of program and data, but the observations below apply mutatis mutandis to all software based safety related systems (abbreviated to safety systems in what follows). Safety systems may cause or allow accidents only through the physical systems they are designed to control or protect. When a boiler overheats and explodes, or a reaction receives too much reagent and vents toxic gases, or an aircraft is misidentified on a radar display and destroyed by ―friendly fire‖, one of the contributing factors leading to the accident may have been the undesired behaviour of a computer-based safety system. The same behaviour in a different environment might be neutral or beneficial. Thus safety systems have requirements that determine what is considered safe behaviour in the specific environment for which they have been developed (and which are often implicit and only fully recognised with hindsight); these requirements are captured in tangible form as specifications. 2 Martyn Thomas Safety systems may fail for various reasons: 1. the specifications may not adequately capture the requirements, at least as these requirements are recognised with hindsight following the failure; 2. the hardware may contain a design error which, under some input conditions, causes the system to fail; 3. the software may contain a design error which, under some input conditions, causes the system to fail; 4. the hardware may contain a component which fails after a period of normal usage (for example: as a result of thermal stress); 5. the hardware may be subject to unintended external physical interference (for example: physical shock, ionising radiation, electromagnetic interference); 6. the hardware may be subject to deliberate external physical interference (for example: vandalism); 7. the fixed data may contain accidental or deliberate errors (for example: a physical constant may be entered using the wrong units); 8. the variable data may contain accidental errors (for example: a pilot error may lead to a rate of descent being entered into a flight management system instead of an angle of descent); 9. the variable data may contain deliberate errors (for example: a radio signal giving a train movement authority may be spoofed); … and there may be other reasons that I have not listed. For safety assurance purposes, we are interested in knowing the probability that the safety system will fail to maintain the safety of the system it controls or protects, irrespective of the reason for such failure. Some engineers may argue that one or more of the reasons listed above are outside the scope of their work: it is a critical decision where to draw the line that limits the scope of a safety system and the responsibilities of its engineers. In my opinion, it is important to draw such boundaries widely (and risk multiple engineers worrying about the same issues) rather than narrowly (and risk that there are possible causes of failure that no-one considers their responsibility). I therefore propose that a safety system should be defined as responsible for all possible failures of the physical system it controls or protects, other than those explicitly excluded by the specifications. This definition would increase the attention given to deciding exactly where the boundary should be drawn, rather than leaving some aspects implicit. (For example, what type and degree of physical damage to an aircraft should a fly-by-wire system be designed to accommodate?) Some engineers may even claim that a safety system has not failed if it behaves according to its specification. Such an argument contradicts established definitions of failure1. 1 ―A system failure occurs when the delivered service deviates from fulfilling the system function, the latter being what the system is aimed at..‖ [J.C. Laprie. ―Dependable Computing: Concepts, Limits, Challenges,‖ in 25th IEEE International Symposium on FaultTolerant Computing - Special Issue, pp. 42-54, Pasadena, California, USA, IEEE, 1995]. ‗The phrase ―what the system is aimed at‖ is a means of avoiding reference to a system ―specification‖ - since it is not unusual for a system‘s lack of dependability to be due to inadequacies in its documented specification.‘ [B Randell, Facing up to Faults, Turing Lecture, 2000]. Issues in Safety Assurance 3 The probability of some of the failures listed above can be estimated with useful accuracy (for example: the in service failure rates of standard hardware components can be estimated through knowledge of the components and the hardware design); others cannot, and must be designed out as far as possible (#9 for example). The only way to estimate the probability of system failure from any cause, is to measure the actual failure rate under truly representative operating conditions (including, where appropriate, deliberate attack). Such testing can deliver failure probabilities but these are rarely useful because of the difficulty of predicting and recreating the operating conditions accurately enough and because of the cost and time of carrying out the tests. In practice, we may be able to show that the system is unusable quite quickly, but it will rarely be practical to gather statistical evidence that it is safe enough unless the target probability of unsafe failure is higher than around 10 -4 per hour, and even this requires heroic amounts of evidence—around a year of realistic testing with no faults found. Despite the inherent difficulties in safety assurance, customers and regulators continue to require that safety systems achieve probabilities of unsafe failure of 10 -8 per hour and lower, and international standards such as IEC 61508 allow such claims2. There is no possibility of useful evidence that such low probabilities have been achieved, at least not until the system has been in actual service for many years without failure or modification, yet safety cases are regularly written to show that such systems are safe enough to be put into service—and the safety cases are regularly accepted as providing adequate evidence. 2 Safety Integrity Levels IEC 61508 gives the following correspondence between target probability of failure and safety integrity levels: Safety integrity level 4 3 2 1 Low demand mode of operation (Average probability of failure to perform its design function on demand) 10-5 to 10-4 10-4 to 10-3 10-3 to 10-2 10-2 to 10-1 2 IEC 61508-1 1999 Functional safety of electrical/electronic/ programmable electronic safetyrelated systems 4 Martyn Thomas Safety integrity level 4 3 2 1 High demand or continuous mode of operation (Probability of a dangerous failure per hour) 10-9 to 10-8 10-8 to 10-7 10-7 to 10-6 10-6 to 10-5 The difference between these modes of operation is explained by the standard as follows: mode of operation way in which a safety-related system is intended to be used, with respect to the frequency of demands made upon it, which may be either: low demand mode – where the frequency of demands for operation made on a safety-related system is no greater than one per year and no greater than twice the proof test frequency; or high demand or continuous mode – where the frequency of demands for operation made on a safety-related system is greater than one per year or greater than twice the proof check frequency. proof test periodic test performed to detect failures in a safety-related system so that, if necessary, the system can be restored to an ―as new‖ condition or as close as practical to this condition. For safety systems containing software, low demand mode should be ignored because proof testing would require exhaustive testing and is infeasible. The lowest safety integrity level, SIL 1, applies to target failure probabilities between 10 -5 per hour and 10-6 per hour; as we have seen, this is already beyond practical verification. IEC 61508 explains and illustrates the difficulties of providing adequate evidence for very low probabilities of failure, but it has to work with current approaches taken by industry and regulators. The standard uses SILs as the basis for recommending a very large set of software development methods, with most methods being more strongly recommended at higher SILs than at lower SILs. It is implicit that (a) using these methods leads to a lower probability of failure in the resulting software; (b) using these methods costs more than developing the same software without them would cost, so their use cannot be highly recommended at lower SILs. Let us examine each of these assumptions in turn. 2.1 Do methods highly recommended for higher SILs lead to fewer faults? There is very little experimental software engineering research, but a recent paper  reported the results from static analysis of a wide range of avionics software developed to different SILs (actually Level A and Level B code according to RTCA DO-178b ) and in a variety of languages (C, Lucol, Ada and SPARK). Static analysis found significant code anomalies in all the software (ranging from one anomaly in every 6-60 SLOC in C, to one anomaly in 250 SLOC in SPARK), but ―no discernible difference‖ between software developed to DO-178b level A (which Issues in Safety Assurance 5 requires extensive Modified Condition/Decision Coverage testing) and software developed to level B (which does not). In terms of the residual anomalies found by static analysis (1% of which were assessed as having safety implications) the extra testing had yielded no benefits. The authors of  also conclude that their results mean that the programming language ―C and its associated forms should be avoided‖ although McDermid  reports that his analysis of data in Shooman  suggests that ―the programming language seems to have little bearing on the failure rate‖. 2.2 Does stronger software engineering cost more? The great pioneer of software engineering, Edsger Dijkstra, observed in 1972 that the greatest cost in software development flowed from the work of removing errors and that the only way to achieve much higher reliability would be to avoid introducing the errors in the first place—which would eliminate much of this cost. This observation has proved to be true, and companies that use strong software engineering methods report far lower error rates combined with reduced development costs. Readers interested in following up these reports are recommended to start with the papers on line at http://www.sparkada.com/industrial. 2.3 SIL based software safety: conclusions SILs are based on the assumption that there is a set of techniques that will substantially reduce the risk that a system will fail unsafely , but that these techniques are so expensive that they can only be justified for the most safety critical functions. Both halves of this assumption lack evidence and such evidence as exists suggests that a different approach is required. This is not a new conclusion: McDermid  said the same thing in 2001 and his proposals for ―evidence-based approaches to demonstrating software safety‖ are very interesting. 2. Safety: a Pragmatic Approach It seems to me that the development and assurance of software-based safety systems is an engineering task that merits a pragmatic engineering approach. My own experience suggests that the following changes would be beneficial. 1. Current systems are often required to show evidence for probabilities of failure so low that no scientific evidence is possible. This is damaging to our whole engineering approach, for several reasons: statistical evidence from testing is devalued, because such evidence will necessarily be insufficient; engineers have to make numerically based claims where no such claims can be justified; 6 Martyn Thomas many current systems fail far more often than their safety case claims, but these failures rarely lead to accidents because there are mitigating factors; this suggests that the targets are set too low, devaluing the targets themselves; I believe that we need to reassess the target failure probabilities used in many industries. 2. There is plenty of evidence that trying to formalise the specifications for a system (for example, using the Z notation) uncovers and resolves very many ambiguities, contradictions and omissions in the stated requirements. Such work pays for itself by reducing the cost of later stages of development. I believe that every safety system should have a formal specification. 3. There is growing evidence that the level of defects in delivered software can be hugely reduced by using a well defined programming language and static analysis toolset such as SPARK, and that this costs little or nothing extra in development time and effort. I believe that every safety system should use such tools as far as practical, and that investment should be made to increase the power and scope of such tools. 4. Many of the tools and methods recommended by IEC 61508 part 3 are simply good software engineering practice for any system, safety related or not. I believe that the software industry needs to define a core set of methods and tools that is considered the baseline for professional competence, to force up standards across industry. 5. The notion of SILs should be abandoned as serving no useful purpose. 6. The starting point for every safety case should be that the only acceptable evidence that a system meets a safety requirement should be an independently reviewed proof or statistically valid evidence from testing. Any compromise from this position should be explicitly identified and justified as being the best evidence that can be provided reasonably practicably. If an accident occurs, this justification will be subject to challenge in court. 7. If early operational experience shows a level of error that undermines the arguments in the safety case, the system should be withdrawn from service, not patched up, even if no safety related incidents have occurred. 8. When software is modified (―maintained‖), the whole system should have its safety analysis repeated except to the extent that it can be proved that this is unnecessary. (Good architectural partitioning and a formal specification will massively reduce the cost of this re-verification). I believe software maintenance must be a serious vulnerability with many systems currently in use, where there is no formal specification against which a rigorous analysis can be carried out to show the potential impact of changes and the extent of revalidation that is necessary. 9. COTS components should have to conform to the above principles. Where their use is justified primarily on cost grounds, but they lack the development history or statistical evidence that would justify claims of adequate safety, the organisation that selected the COTS component should be strictly liable for any in service failures attributed to the its use. 10. All safety systems should be warranted free of safety defects by their developers. 11. Any system where the safety of the public is at risk should have its development and operational history kept in escrow, so that accidents and incidents can be Issues in Safety Assurance 7 investigated independently of the developers and so that academic research can be carried out to improve software and systems engineering. These proposals may appear extreme to some people but I believe they are simply good engineering practice. In the UK, most of these proposals appear to be necessary to meet the ALARP principle of the Health and Safety at Work Acts. 4 References 1 Air Vehicle Software Static Code Analysis Lessons Learnt, Andy German and Gavin Mooney, in "Aspects of Safety Management" - Proceedings of theNinth Safety-Critical Systems Symposium, Bristol, UK 2001. Edited by Felix Redmill and Tom Anderson. Springer-Verlag. ISBN 1-85233-411-8 2 RTCA DO 178B Software Considerations in airborne systems and equipment certification, RTCA Inc, 1992. 3 Software Safety: Where‘s the Evidence?, John A McDermid, Proc. 6th Australian Workshop on Industrial Experience with Safety Critical Systems and Software, Brisbane, 2001. 4 Avionics Software Problem Occurrence Rates, M.L.Shooman, IEEE Computer Society Press 1996.