Safety Analysis of Hardware Software Interactions in Complex Systems by eel17334


									                  Safety Analysis of Hardware / Software Interactions in Complex Systems

                   John A. McDermid, David J. Pumfrey; University of York; York, UK

                          Keywords: software segregation, operating system safety

                     Abstract                            which it runs. In seeking evidence of safety, there
                                                         is a need to answer questions such as:
This paper describes a new analysis technique            • does the software use the hardware safely and
developed specifically to study the safety                   appropriately?
implications of the relationship between software        • can the software cope acceptably with
and the hardware on which it runs.                           plausible hardware failures?
                                                         • could        hardware     failures    invalidate
The technique was developed in response to a                 assumptions of independence of software
request for assistance in completing the safety              failure modes (or vice versa)?
argument for a critical avionics application.
Evidence was required that the segregation               Safety requirements for a computer system are
mechanism, used to partition functions of                normally expressed in terms of requirements on
different integrity levels running on the same           the functions they provide, which are derived
processor, would adequately protect critical             from hazard analyses carried out on the system
program and data memory from corruption by the           which is to be controlled. These requirements are
lower integrity software.                                propagated down and refined as the computer
                                                         system design is developed, so that, at each level,
The technique is based on an analysis of time and        the implementers are aware which functions are
physical resources, using interpretations of a           critical, and what failure modes must be avoided.
number of generic failure classes to prompt              In addition to these specific requirements, there
consideration of various hypothetical deviations         will also be some general requirements over the
from designed behaviour.                                 whole system, such as “no single failure shall
                                                         lead directly to a hazard”. At the level of
We consider this research to be of particular            hardware / software interactions, a safety effect is
significance, as the ability to provide such             considered to be anything which causes one of
evidence is fundamental to the development of            the functional failure modes identified as critical,
safety cases for future systems which will need to       or causes a violation of one of the general safety
use a generic high integrity kernel to manage a          requirements.
number of processes with different integrity
levels running on the same hardware.                     Historically, safety critical systems have tended
                                                         to be written as monolithic software,
The paper describes the principles of the                incorporating bespoke scheduling, interface and
technique, and also presents our experience in           support functions, which ran on a single, custom-
applying it to the avionics system project which         designed hardware platform. This style of system
prompted its development.                                was relatively inflexible, but had the advantage
                                                         (from the safety assessment viewpoint) that
                   Introduction                          application software interfaced very directly with
                                                         the hardware (figure 1a).
As the number and diversity of safety critical
applications of computer systems has increased,          More recently, there has been a move to build
there has been an attendant requirement for              safety critical software using architectures more
increased levels of assurance of the safety of such      similar to those found in general computing
systems, where safety encompasses both the               systems, with a distinct operating system layer
normal (intended) operation of the system, and           between application functions and hardware
the management of failures. One area which has           (figure 1b). Since it adds a level of indirection
proved particularly problematic is assessment of         between application code and hardware, this
the safety properties of the interaction between         additional layer makes some parts of the task of
safety critical software and the hardware on             analysing software / hardware interactions more
complex. However, it also offers significant          this will be detected, and the system caused to
advantages for the safety analyst, not least by       take whatever action is necessary to ensure
making it possible to provide detailed evidence       continued safety. A suitable operating system
about the system which is independent of the          may make it possible to demonstrate this
application code, and does not need to be revised     segregation without needing to analyse all of the
if the application is changed.                        application code. Although the analysis
                                                      technique described in this paper is applicable to
                                                      any computer system, our focus will be its
       'Monolithic'                                   application to provide evidence for such an
       application          Application functions     operating system.

                                                      For any realistic computer system, whether of the
                                       Scheduling     monolithic style or with an identifiable operating
         functions                                    system, it would be impossible to carry out a
                           Interface       and
       embedded in
                           functions     support      complete analysis of the effects of every
         software                                     plausible failure of the hardware at every step of
                                                      software execution. What is required, therefore,
                                                      is an analysis based on a simplification or
         Hardware                Hardware
                                                      abstraction of the system, supported by an
                                                      argument of the suitability and acceptability of
                                                      the abstraction used. Parts of this argument will,
     Controlled system       Controlled system        necessarily, be system specific, but much of it is
                                                      general, and some of the general parts will be
                                                      described as we present our approach.
      a) Monolithic         b) With operating
         software                system                              Analysis Approach
 Figure 1 – Safety Critical System Architectures
                                                      Our approach is based on the observation that all
In particular, when critical and non-critical         interactions between software and the hardware
application functions are implemented on the          on which it is running can be considered in terms
same system, it is necessary to demonstrate not       of the use of physical resources and time. By
only that the use of resources by each is             identifying a number of classes of resource
appropriate in itself, but also that functions of a   criticality, it is possible to describe the type of
lower integrity cannot in any way interfere with      arguments which must be made to demonstrate
the operation of those of a higher integrity. More    the acceptability of the use of the resources in a
specifically, it must be demonstrated that under      given system. We then show how these
normal operation:                                     arguments can be based on consideration of the
1. Data flow corruption is prevented                  effects of a relatively small set of hypothetical
    – low integrity level software cannot modify      failure modes.
       high integrity data.
2. Control flow corruption is prevented               Identifying Resources:        Physical resources
    – critical functions can always execute at        consist of the processor registers, memory
       the correct time, without being affected by    locations, I/O and other special registers. This is,
       the actions of non-critical functions.         effectively, the programmer’s model of the
    – low integrity software cannot modify high       hardware; hardware features such as buses,
       integrity level code.                          arbitration logic etc. are considered in terms of
3. Corruption of the execution environment is         the registers which control them.
    – corruption of parts of the system used by       For any specific combination of software and
       both high and low integrity software (e.g.     hardware, we can partition these resources into
       processor registers, device registers and      classes based upon the criticality of the resource
       memory access privileges) cannot occur.        usage. We identify five classes of criticality:
It must also be shown that, if any of requirements    • intrinsically critical
1 to 3 is violated, e.g. due to hardware failure,         those resources which contain safety critical
                                                          data at any point in the execution of the
    software, or the program code for safety           1. it is possible to ensure that the model is
    critical functions; examples include I/O and          complete (i.e. includes the entire memory
    RAM used by safety critical functions,                map, all non-mapped devices such as
    processor registers etc.                              processor registers, and all interrupts and
•   primary control                                       synchronisation events)
    resources which directly control the use or        2. the model is familiar to the system’s
    function of an intrinsically critical resource;       designers and programmers, so it is possible
    examples include memory management unit               to discuss safety analysis in familiar terms
    (MMU) registers, I/O control registers etc.        3. although potentially large, the model is of a
•   secondary control                                     fixed and predetermined size, so the effort
    resources which either provide a backup to            required for analysis can be predicted
    primary controls (e.g. a secondary MMU                reasonably accurately in advance.
    giving redundancy in memory protection), or
    control access to primary resources (for           As an example of the identification and
    example, key registers which must be set to        classification of resources, consider the simple
    particular values before MMU registers can         vehicle brake-by-wire example shown in figure 2.
    be altered)                                        Braking inputs from a pair of sensors on the
•   non-critical                                       driver’s foot pedal are converted by the
    resources which are never used by critical         processing electronics into digital values
    software, and do not affect the operation of       available to the software in particular registers in
    any part of the hardware which is used by          the I/O map. The software includes a routine
    critical functions                                 which reads these registers and calculates a
•   unused                                             required braking value which is placed in an
    locations in the memory map which do not           output register, from where further electronics
    correspond to a physical device. The               convert it into drive signals to the brake actuator.
    importance of these locations is that there        The program code is stored in part of the
    should be no attempts by any part of the           Program ROM and runs on the processor, using
    software to access them; such an attempt           some of the RAM locations as work space. If
    indicates a failure, and must be trapped and       braking control is safety critical, then all these
    handled safely.                                    parts of the system must be regarded as
                                                       intrinsically critical.
In considering time, we are not concerned with
the passage of real time, such as execution times      The system also includes a memory management
of code sections, for which good methods exist.        unit, which performs the mapping between
Our model of time is based on the identification       logical and physical memory locations. The
of discrete timing events which have associated        registers which define this mapping are regarded
hardware actions. Examples of these include            as primary controls; they are not actually
interrupts, the use of system timers and counters,     involved in the braking calculations, but directly
and synchronisation actions. Again, this is a          affect resources which are.
model familiar to programmers.
                                                       The software is cyclic, and its periodic execution
Timing events can be identified as either critical     is triggered by the arrival of an interrupt from the
or non-critical, depending upon whether they           timer. This interrupt is a critical event, and the
affect the execution of critical code. Note that       device registers which control the timer are also
there will be a set of primary (and possibly some      considered to be primary controls.
secondary) control resources associated with
each timing event. For example, primary                The example brake-by-wire system incorporates
resources associated with a timer-generated            other functions which are not critical, including
interrupt will include the control registers for the   functions to display system status on the
timer, and CPU registers which determine its
response to the arrival of the interrupt.

As a basis for analysis, this model has several
                                                                                      MMU             CPU
                                                          Input 1
     Brake Pedal                                          Input 2                        Critical RAM
                                                                                       Non-critical RAM

                                  Output                                                Critical ROM
                               electronics /              Output
                                actuation                                             Non-critical ROM

                                                       I/O Registers
    Wheel Brakes
                                  Output                                                  Registers
                               electronics /            Status Out
                                actuation                                                   Timer
    Driver Displays
                                                                    Brake-by-wire Computer System

                              Figure 2 – Simple Vehicle Brake-by-wire System

dashboard. The parts of the program ROM                     critical and non critical RAM and ROM areas.
containing the code for these functions, the status         The initialisation code must set these boundary
output register, and RAM locations used only by             values. However, the MMU is used in all
the status display functions may be regarded as             memory accesses from the CPU, and will
non-critical resources. If we assume that the               therefore be influential in the execution of the
software is implemented as a set of application             code which initialises its own registers.
functions running on top of an operating system,
then the operating system must be capable of                A complete safety argument for the system must
providing suitably segregated access to all the             therefore demonstrate that the system powers up
system resources.                                           in a safe state, and respects minimum safety
                                                            requirements throughout every stage of
Resource Dependencies: From the descriptions                initialisation. To guarantee the correct execution
of the resource classes, it is clear that there are         of the application, it must also be shown either
dependencies between resources; that is, the state          that successful completion of the initialisation
of one resource affects the behaviour of another.           guarantees that the hardware is correctly
Indeed, this is explicit in the definition of               configured, or that it is impossible (or at least
primary and secondary control resources.                    extremely improbable) that the main body of the
However, there are other, less direct                       software could fail to detect, and safely respond
dependencies which are also vital. The most                 to, any incorrectness in its execution
significant of these is that, in most systems, there        environment.
must be an initialisation phase, in which the
software configures the hardware to the state               Safety Arguments for Resources:            Having
required for the execution of the main body of              identified the timing events and resources in the
the application. However, this initialisation code          system, and assigned appropriate criticality
is, itself, run on the very hardware it is                  classes, we must now demonstrate the
configuring, so a circular dependency is created            acceptability of their implementation and use.
(figure 3).                                                 The arguments made must consider both normal
                                                            (intended) operation and the effects of failures.
For example, when the brake-by-wire system is               There are many fault tolerance strategies, and our
powered up, the MMU registers will contain the              purpose here is not to discuss these, but rather to
manufacturer’s defaults, and will therefore not             consider some general properties which are
correctly describe the boundaries between the               relevant whatever the system architecture.
                     EVENTS                                         MEMORY
                                                                                                          All resources

          Interrupts      Output events           ROM           RAM          CPU regs          I/O regs

            Master                               Program        Stack        Critical                       Intrinsically
          cycle clock                             ROM           RAM         variables                critical resources

           Timer                                   MMU          Bus arbitration                        Primary control
         registers                               registers     control registers                           resources

                                Initialisation                                               Initialisation routines for
                                  routines                                              primary control resources use
                                                                                              system resources, and
                                                                                        dependencies become cyclic.
                        ROM         RAM             CPU regs

                        Figure 3 – Illustration of Cyclic Dependencies in System Resources

Because the criticality of data and calculations is                     algorithms compared. The effectiveness of these
application dependent, it is not possible to make                       strategies depends upon the improbability of two
general arguments for the safety of intrinsically                       hardware failures resulting in identical but
critical resource usage based on knowledge of                           incorrect results.
the hardware and operating system alone. It is
entirely the responsibility of the application                          An alternative strategy may be to argue that
designer to demonstrate that the intended                               intermediate values are stored for so little time
(normal) operation of the system is safe.                               that the effective exposure to random hardware
However, study of the underlying hardware is                            failures is negligible. This argument may prove
very important in understanding the behaviour of                        fallacious if a single calculation contains many
the system in the presence of failures. There are                       steps which use previous intermediate results, or
two categories of failure to consider; failure of                       if the calculation is repeated frequently, in which
the hardware implementing the resource itself,                          case the proportion of the time for which values
and failures in the configuration and protection                        are stored in the same temporary location may
of the resource arising from faults in primary                          prove too high.
control resources.
                                                                        Many of the possible resource protection
In both monolithic systems and those with                               strategies in application software depend upon
operating systems, the arguments which can be                           the ability to control, or make use of, features of
made for the tolerability of hardware failure in                        the hardware. For example, storing two copies of
intrinsically critical resources depend upon two                        data gives greater protection if the two locations
factors; the improbability of that failure, and the                     are in separate devices (RAM chips), potentially
provision of appropriate protection mechanisms                          avoiding sources of common mode failure such
for critical data and operations within the                             as faulty address decoding or ‘stuck at’ faults on
application. For example, in a simple, single                           individual devices. This is relatively easy to
channel system, the integrity of critical data held                     achieve in monolithic software, where the
in RAM may be checked by storing the same                               programmer has direct control over the hardware.
value in two locations and comparing them                               For an operating system to offer similar
before use. For temporary storage (e.g.                                 protection, it may be necessary to implement
intermediate results in critical calculations) this                     special features (possibly with complementary
may not be viable, so the calculation may be                            compiler directives) to provide the application
repeated, or the results of two alternative                             programmer with the necessary control.
However, the benefit of this will be that it is also   a location is necessarily an error, the argument
possible to provide generic 'argument fragments'       must show how the system will trap and respond
or patterns, which can be applied each time the        to such an attempt.
feature is used.
                                                       Failure modes to consider in safety arguments:
Failures of intrinsically critical resources arising   We have now described a system model based
from faults in, or incorrect management of,            upon identifying and classifying resources, and
primary control resources are of particular            briefly discussed the types of safety argument
concern because they have the potential to cause       which may be made for each. Most of these
common mode failures, possibly invalidating any        arguments require an assessment of the effects of
of the above arguments. However, this is an area       failures, and it is now necessary to consider an
where generic arguments may be made about the          appropriate model of failure. Again, it is
interaction of the operating system and the            infeasible to consider every type and cause of
hardware; as there will normally be relatively few     failure of each device individually, so it is
of these, it is feasible to devote significant time    necessary to make another abstraction.
and effort to analysis.
                                                       We base our approach on a small number of
So, for intrinsically critical resources, we have      hypothetical failures, based on research into the
identified three essential strands to a safety         classification of computer system failures (refs.
argument:                                              1-2). We have previously used these same
1. Safety of normal usage – argument is                hypothetical failure categories as guide words for
   responsibility of application developer.            a HAZOP-style analysis of software designs
2. Toleration of hardware failure – argument is        (refs. 3-4), and this experience has shown that
   responsibility of application developer, but        these categories are applicable to many aspects
   with support from hardware and operating            of computer systems. The failure categories are
   system analysis.                                    introduced      below,     together    with    the
3. Correct management via primary controls –           interpretations we have used in our case studies.
   argument is primarily responsibility of             Clearly, as this approach is new, we cannot be
   hardware and operating system analysis.             certain of the completeness and adequacy of
                                                       these interpretations, but comparison with results
For primary control resources, the safety              obtained by other analysis techniques (e.g. more
arguments that can be made depend on so many           traditional Functional Failure Analysis at device
factors that it is impossible to give general          block diagram level) has given us reasonable
guidelines for the safety argument. If there is a      confidence that they provide a sound basis for
secondary control resource which duplicates the        analysis.
behaviour of the primary for redundant
protection (e.g. an external device which              The failure categories we identify are based on
duplicates the memory protection functions of the      the concept of a service, that is, the provision of
MMU), it may be sufficient to argue safety             a particular value at a specific time. They are:
simply from improbability of coincident failure,       • Omission – the value is never provided
provided that initialisation is not a potential        • Commission – a value is provided when it is
source of common failure. More probably, it will           not required (i.e. a perfectly functioning
be necessary to identify means of detecting and            service would have done nothing)
managing the effects of a failure. For example, if     • Early – the value is provided before the time
the MMU is incorrectly configured so that a                (either real time, or relative to some other
process is denied access to memory locations it            action) at which it is required
requires, can it be shown that the resultant errors    • Late – the value is provided after the time at
are always trapped and lead to the system taking           which it is required
appropriate action?                                    • Value – the timing is correct, but the value
                                                           delivered is incorrect.
Similarly, for secondary control resources, the
argument will depend entirely on the role of the       The interpretations of these categories used for
resource, and no general guidance can be given.        timing events are:
Only one argument is normally required for non-
existent resources. As any attempt to access such
• Omission – the failure of an event to occur. In       treated as an omission, e.g. by triggering a
  a multi-processsor system where events                bus timeout.
  affecting more than one processor are               • The value of a resource is its data content.
  possible (e.g. broadcast interrupts or                For control resources, the correct value can
  synchronisation events), it is necessary to           often be determined in advance, and the
  consider symmetric (where no recipient                effects of changes predicted. In the case of
  responds to the event) and asymmetric (one            memory (RAM or ROM) the effect of
  or some recipients respond) omission.                 unwanted changes can only be determined
• Commission – the spurious occurrence of an            with knowledge of the application software.
  event. Again, there may be different cases to
  consider in a multi-processor system. It may        Analysis Steps: The numbered steps below
  also be necessary to consider whether the           summarise our approach to conducting an
  spurious event is a repetition of an expected       analysis based on the principles outlined.
  event, or insertion of something completely         1. Identify resources from system design
  unexpected or out of sequence.                         documentation, e.g. memory maps.
• Early and Late take the obvious                     2. Describe the function and usage of each
  interpretations.                                       resource. Many of the resources can be
• Value has no meaningful interpretation for             grouped together at this stage; for example,
  timing events.                                         blocks of memory with the same function
                                                      3. Classify resources to help decide argument
Interpretations for physical resources are:              requirements.
• Omission and commission – interpreted as            4. For each resource, consider each applicable
    access permission violations. An omission            failure mode:
    failure occurs if a process which should be          4.1. Describe the exact failure mode(s) being
    able to access a resource is denied                        considered.
    permission. Commission failure occurs where          4.2. Describe the effects of the failure,
    a process is granted access to a resource                  considering hardware and software
    which it should not have.                                  response, existing prevention, detection
• Early has two interpretations in the case of a               and mitigation mechanisms.
    physical device such as memory, both leading         4.3. Decide whether an acceptable argument
    to (unpredictably) corrupt data:                           for safe handling of this failure mode
    • the processor reads from a location in the               can be made.
        device, and attempts to latch the data from      4.4. If argument exists, record it, otherwise
        the bus before it is stable, or                        propose necessary design revisions.
    • the processor writes to a location in the          4.5. If necessary, repeat 4.1 - 4.4 for system
        device, and de-asserts data before the                 initialisation.
        device has latched it correctly.              5. Repeat until acceptable arguments have been
    These may seem unlikely failures, and only           made for all failure modes of all resources.
    possible with poor hardware design.
    However, there are systems in which               The natural format for recording the results of the
    parameters such as the number of wait states      analysis a large table. Note that we find it helpful
    inserted on accessing a particular device are     to add a 'check' column in which to note whether
    programmable; the system can dynamically          the argument obligations for a particular resource
    alter its own timing characteristics. In such     have been fully discharged. Obviously, no
    systems, this type of timing failure is           intrinsically critical resource can be considered
    plausible and extremely important.                to be fully discharged until arguments have been
• Late refers to delay in accessing the resource,     completed for each of the primary controls upon
    arising either from effects such as contention    which it depends.
    for a shared bus, or from the same type of
    configuration fault that could lead to early      In practice, many of the mechanisms in step 4.2,
    failures. In general, lateness will not cause     and the related arguments, will be found to be
    data corruption, and is only of interest in our   generic, and step 4.3 will consist simply of
    analysis if the delay is great enough to be       adding an appropriate reference. Some of these
                                                      generic mechanisms may be identified before the
                                                      main body of the analysis is commenced, as they
have been specifically provided as features of the                                                                For reasons of efficient processor utilisation, a
design; others will emerge as analysis proceeds.                                                                  segregated operating system was used, with high
It is important that any mechanisms identified in                                                                 and low integrity functions executing on the same
advance are thoroughly investigated to ensure                                                                     processor. The application software ran in an
that they function as intended.                                                                                   environment which provided high integrity
                                                                                                                  system initialisation, scheduling and memory
                   Case Study                                                                                     protection. In order to ensure system consistency
                                                                                                                  and guarantee the critical function access to the
The major case study used in the development                                                                      system bus and I/O when required, the system
and testing of this analysis approach was a large,                                                                employed synchronised cyclic schedules,
multi-processor avionics system. The system was                                                                   executing a high integrity code segment on all of
required to provide one safety critical function,                                                                 the processors at the same time in each cycle.
and a large number of functions of lower
criticality. The system was developed in                                                                          Our analysis technique was applied to investigate
accordance with UK defence standards, requiring                                                                   whether there was any way in which hardware,
the critical function to be developed to the                                                                      scheduling or protection systems could fail in
highest integrity level, with the other functions                                                                 such a way that the three principles of safe
developed to lower integrity levels as                                                                            operation outlined in the introduction could be
appropriate.                                                                                                      violated. Some specific areas of concern,
                                                                                                                  including bus arbitration, high integrity data I/O
The basic structure of the computer system                                                                        and the management of asynchronous events
hardware is shown in figure 4. The processors                                                                     were identified before the study started, and
were arranged in pairs, each processor having its                                                                 particular attention was paid to these during the
own private bus, giving access to RAM, ROM                                                                        analysis.
and timers, and to the arbitration logic for access
to the shared local and system buses. In addition,                                                                Another specific requirement was that the
a secondary MMU on each private bus provided                                                                      analysis produced should be as independent of
redundant protection of critical memory areas.                                                                    the application software as possible, to permit
                                                                                                                  changes to the application without the need for
                                                                                                                  extensive repeat analysis. This was achieved by

                        Processor                                                                 Processor
                     Processor                                                           Processor
                        Second MMU                                                          Second MMU
                 Processor                                                           Processor
                                                                  Private bus bus
                                                                                    Private bus

                    Second MMU                                                          Second MMU
                           Timers                                                              Timers
                                                Private bus bus

                Second MMU                                                          Second MMU
                       Timers                                                              Timers
                                  Private bus

                        Private RAM                                                         Private RAM
                   Timers                                                              Timers
                    Private RAM                                                         Private RAM
                        Private ROM                                                         Private ROM
                Private RAM                                                         Private RAM
                    Private ROM                                                         Private ROM
                Private ROM                                        Private ROM
                                           Arbitration                                                                                     I/O
                                                                                    Local bus

                          Shared RAM
                                                                  Local bus

                      Shared RAM
                          Shared ROM
                                                Local bus

                   Shared RAM
                      Shared ROM
                   Shared ROM                               Arbitration
                                Arbitration                                                                           Arbitration      Arbitration

                                                                                                     System bus

                              Figure 4 – Case Study System Hardware Architecture
assuming that all application software except that                      were identified and these, resulted in a number of
of the critical function itself would always                            revisions to the system design.
behave in the worst conceivable way, i.e. any
failure of resource protection or scheduling                            The final report was around 60 pages in length,
would always result in the lower integrity                              most of which was occupied by tables
software causing the maximum possible                                   summarising the analysis results. This report was
interference to the operation of the critical                           submitted to the certifying authority as part of the
function. If satisfactory protection of the critical                    review package. The comments received in
function could be demonstrated under this                               response were detailed and technical in nature,
assumption, no future change to the lower                               and concentrated on the system rather than the
integrity software could invalidate the safety                          analysis approach, which was seen as generally
arguments.                                                              sound and useful. The only specifically analysis-
                                                                        related request made was that the response of the
Since the analysis approach was novel, the                              software to certain hardware failures should be
certifying authority for the system was involved                        described more completely.
at an early stage, and their agreement sought for
the principles and method to be used. Figure 5                          The analysis identified a number of features of
shows a fragment of the output produced                                 the system which facilitated the safety argument.
                                                                        These included:
Case Study Results: The analysis of the study                           • An easy to achieve safe state in which the
system took approximately nine man months to                                 system could stop. This meant that proving
complete, distributed over a considerably longer                             that detected failures were handled correctly
elapsed period. About one third of this time was                             was relatively easy. The task would have
spent in understanding the design of the system,                             been far harder had there been a requirement
and the rest in analysis and report writing. During                          for continued operation.
the course of the analysis, a number of issues

 Location:               EFC8 - Timer T1 Counter Load Register
 Use / Criticality:      Timer T1 is used as the master frame timer on Processor 1 / Primary Control
 Guide Word         Deviation                    Causes                     Summary of Acceptance Arguments
 Omission           Denial of read access to     Failure of processor       • Only write access needs to be protected; there are no
                    any software; denial of      MMU or protection built       safety implications of allowing any software read
                    write access to Class 1      in to Timer hardware.         access to the timer.
                    software.                                               • Failure of one device may lead to refused write
                                                                               access; this will lead to bus exception being raised,
                                                                               followed by orderly shutdown of affected processor.
 Commission         Granting of write            Simultaneous failure of    • Requires simultaneous failure of both access control
                    permission software          processor MMU and             mechanisms. Note that these are separately
                    other than Class 1.          protection built in to        initialised by different code from two separate
                                                 Timer hardware.               configuration tables.
 Early              Processor attempts to        Timer hardware failure     • Timers are read for two reasons;
                    latch data off bus before    (waitstates for Timer      1. to check how much time is left in a frame -
                    it has stabilised.           access are controlled by        incorrect reading will lead to loss of
                                                 the timer hardware).            synchronisation between processors at end of
                                                                                 frame. Processors remaining in sync. will force
                                                                                 orderly shutdown of affected processor
                                                                            2. in continuous built-in test to check timer status -
                                                                                 incorrect reading will lead immediately to orderly
 Late               Excessive latency            Timer hardware failure     • Local device - no arbitration delays
                    between device access        (waitstates for Timer      • Worst case will lead to bus exception being raised,
                    and data read.               access are controlled by      followed by orderly shutdown of affected processor.
                                                 the timer hardware).
 Value              Incorrect timer setting      Timer hardware failure,    • May be detected by CBIT, or result in loss of
                                                 or corruption of              synchronisation between processors at end of frame.
                                                 initialisation data.          Arguments for Early apply.

                                           Figure 5 – Example Analysis Output
•   Redundant MMUs, independently initialised,        1. A. Bondavalli and L. Simoncini, Failure
    which permitted a simple and sound generic           Classification with respect to Detection, in
    argument for memory protection safety.               First Year Report, Task B: Specification and
• The physical and logical memory maps of                Design for Dependability, volume 2. ERPRIT
    the system were the same (i.e. the MMUs              BRA project 3092: Predictably Dependable
    performed no address translation). It was            Computing Systems, May 1990
    therefore a simple task to identify all the       2. P. D. Ezhilchelvan and S. K. Shrivastava, A
    areas of memory that could ever be involved          Classification of Faults in Systems, Technical
    in a critical process.                               Report, University of Newcastle upon Tyne,
The tight synchronisation between processors             1989
meant that many failures could be detected and        3. J. A. McDermid and D. J. Pumfrey, A
mitigated by related events on other processors.         Development of Hazard Analysis to aid
                                                         Software Design, COMPASS '94,
                   Conclusions                           Gaithersburg, MD, IEEE Computer Society
                                                         Press, 1994
We believe that the work described here presents      4. J. A. McDermid, M. Nicholson, D. J.
a useful advance in the range of low-level safety        Pumfrey and P. Fenelon, Experience with the
analysis techniques available for computer               application of HAZOP to computer-based
system safety analysis. The combination of a             systems, COMPASS '95, Gaithersburg, MD,
HAZOP-like approach with a suitable model of             IEEE Computer Society Press, 1995
the hardware allows for a very thorough analysis,
whilst still remaining relatively tractable.                            Biographies

However, there are many open issues, both for         Professor John A. McDermid, High Integrity
the improvement of safety analysis techniques,        Systems Engineering Group, Department of
and for the design of systems such that they are      Computer Science, University of York,
amenable to analysis. This is of particular           Heslington, York YO10 5DD, UK, telephone
relevance to the avionics industry at present, in     +44 (0) 1904 432726, fax +44 (0) 1904 432708,
view of the industry-wide move towards                e-mail
Integrated Modular Avionics (IMA). As noted,
the analysis of the case study system was greatly     John McDermid is Professor of Software
facilitated by certain features of the design, some   Engineering at the University of York, where he
of which (such as a safe fail stop state) are not     runs the High Integrity Systems Engineering
generally possible. Others, such as the provision     group. His primary interest is safety critical
of redundant memory protection, could relatively      systems in aerospace, and he directs the BAe
easily be incorporated into the design of future      Dependable Computing Systems Centre, and the
systems. More work is needed to identify features     Rolls-Royce University Technology Centre in
like this, and to focus the attention of designers    Systems and Software Engineering.
on the need to incorporate them into their
systems.                                              David Pumfrey, DCSC, Department of Computer
                                                      Science, University of York, Heslington, York
Particular research challenges for the future         YO10 5DD, UK, telephone +44 (0) 1904
development of this work are how to model and         433385, fax +44 (0) 1904 432708, e-mail
analyse systems with address translation and
dynamic memory allocation. We also need to
improve our understanding of arguments which          David Pumfrey is a Research Associate in the
can be made for the safety of all types of            BAe Dependable Computing Systems Centre,
resource usage, so that improved guidance can be      where he is currently investigating the use of
given to engineers attempting this type of            HAZOP and related techniques for software
analysis in future.                                   hazard analysis. He is also one of the presenters
                                                      of a highly successful series of short courses on
                                                      system safety and safety cases.


To top