How FMEA Improves Hardware and Software Safety Design by vla87225


									    How FMEA Improves Hardware and Software                   The relationship between HW and SW FMEA is evident
            Safety & Design Reuse                             where both HW and SW FMEAs can be performed at
                                                              functional, interface and detailed level.
                  Nematollah Bidokhti
                     Cisco Systems                                       HW and SW FMEA Correlation
                 170 West Tasman Drive                          Types           Hardware            Software
                San Jose, CA 95134 USA                                     Concept/Preliminary  During top level
                                                  design phase      software design
                                                                            Physical interfaces
1.0 Introduction                                                                                   dissimilar
                                                               Interface  between major system
                                                                                                  software and
Today’s market needs and applications demand products                                               hardware
with robust hardware and more importantly reliable and                                              elements
deterministic software.                                        Detailed         Part level         Code Level

Reliable software is defined as a piece of code, program      Therefore, hardware failure impacts software operation
or application that every time it is executed, it produces    and vise versa. It must be noted that hardware failures
the same output and operates consistently. Deterministic      are more understood than software and less dependent to
software is defined as a software program that once it        other factors. There are many categories of software
fails, it causes the same impact and is detected              failures such as:
                                                                  •     Requirements
Once the software is determined to be reusable for the            •     Interface
desired application, it is important to perform the               •     Fault Tolerance
appropriate analysis to identify all possible failure modes       •     Resource Usage
associated with the design with respect to the new                •     Data
environment and its association with the components of
the current architecture. One technique proven useful is      These categories are further described in the section
the Failure Modes and Effects Analysis (FMEA).                called “Software Failure Modes”.

In this paper we will discuss the following:
                                                              3.0 Design Reuse Challenges
•    Hardware and software relationships
•    Design reuse challenges                                  For the past decade the market has pressured product
•    Use of Failure Modes and Effects Analysis (FMEA)         manufacturers to reduce their product development life
     as a remedy                                              cycle significantly in order to decrease the time to
•    FMEA requirements and process                            market, releasing products with significantly new
•    When and who should perform FMEA                         features.
•    What software component should go through FMEA
                                                              This time to market requirement has raised the need for
                                                              design reuse at a higher level. Therefore, hardware and
2.0 Hardware and Software Relationships                       software designers usually borrow or build from designs
                                                              from past products.
There can be situation where software faults are
manifested as if they were hardware failures and cause        In theory this is the right thing to do. It improves time to
the loss of output. Therefore, no longer can we decouple      market and decreases product development cost. But, in
hardware from software in our approach to reuse of            fact if a low or marginally designed hardware or software
system architecture design and development. As the new        is selected to be used on the new product, we carry the
technologies are developed, hardware and software are         same issues to the new product and increase the
getting closer and software assumes a much more               probability of failures in addition to newly created issues .
important role in the system operation.
                                                              No designer would like to use a design that is less than
                                                              solid. But the time pressure to deliver the product within
                                                              the allocated time forces them to select portions of

2006 - ICSR9                                                                              Bidokhti - 1
previous design without performing any risk                    is a good risk analysis process that can be applied on a
management activity or researching the history such as         various types of systems such as Telecommunication
test results or field/operation issues .                       systems. The purpose of the FMEA is not to measure the
                                                               reliability of the product. Rather, it is a methodology that
The risk increases or decreases based on the company’s         helps build a reliable design and fault management into
development standard and designers’ skill level. In other      the product.
words, if there is a clear development process where
designers are required to perform detail design reviews,       How does FMEA impact design reuse? The main benefit
participate in peer code reviews and perform modeling          is the ability to identify various failure modes and their
such as Unified Modeling Language (UML), there is a            impact to the system and provide a mechanism to
higher probability that many design issues and corner          quantify the risk of the selected software. Once this
cases have been identified. We will refer to this subject      information is prepared and stored in the FMEA
later in this paper.                                           database, usage of this software by any designer at any
                                                               point in time can be traced and understood to what are
Following are some of the challenges of design reuse:          the risks that the user of the software needs to consider in
                                                               their own application of the software.
Lack of documentation: There are many situations
where the software designer is focused on coding, testing      FMEA is a powerful design risk analysis method which
and delivering the software where documentation is             could be done at different levels of hardware such as
overlooked.      This     issue    creates a    possible       ASIC, Board, System and Network level (as applicable)
misunderstanding of the original intent of the software,       where each level provides a specific benefits. But the
and it increases the integration time.                         FMEA process level remains the same for the hardware
                                                               and software. As it was mentioned earlier they are
Original designer no longer available: It is normal to         functional, interface and detailed.
see designers move to new project or leave the company
for better opportunities. This will not be as significant if   Software functional FMEA is used to highlight software
there is detailed code documentation. Lack of adequate         architectural changes and identify incorrect software
documentation along with loss of the original developer        behavior.
will make the design reuse more challenging.
                                                               Software      interface  FMEA       analyzes   the
Change in requirements: The original code was                  interface/connection between separate software or
developed to meet a set of requirements. Once any part         hardware elements.
of the requirements is modified, this will imp act the code
to some level. The extent depends on the complexity of         Software detailed FMEA determines the impact of single
the requirements and implementation method. FMEA               variable or command failure. Detailed FMEA generally
could be used to highlight some of these issues.               applies to products that do not take advantage of memory
                                                               protection in the hardware.
Reliability goals: Generally set at the beginning of the
project with respect to the architecture at hand. This         Following table shows some of standard failure modes
could impact the code implementation methodology.              that can be applied to any software functional and
                                                               interface FMEAs.
Performance objectives: Often there are clear
performance objectives from Customers, documented in           FMEA Type                       Failure Modes
product requirements document. Design reuse could                                Failure to execute
impact this metric based on the number of calls to                               Incomplete execution
different software entities and the order in which they are     Functional       Execution at an incorrect time
executed.                                                                        Errors in the software element’s
                                                                                 assigned functioning
                                                                                 Failure to update a value
4.0 What is FMEA?                                                                Incomplete update of the value
                                                                                 Value updates occur at an incorrect time
FMEA is a risk management activity that addresses                                Errors in the values or messages
product safety and reliability where its purpose is to
identify and document all possible failure modes in
hardware and or software. FMEA should not just be
performed on the mission and safety critical system; as it

2006 - ICSR9                                                                               Bidokhti - 2
5.0 FMEA Standards and Requirements                          risks in the design. The exceptions between the two
                                                             methods are:
Frequently FMEA is a contractual requirement for
Military and Defense products. But in general it is not        Metrics            Hardware              Software
required in the commercial market. However, most                              failure modes are
companies have realized the value of the FMEA and                             created based on
                                                                                                     software failure
have adopted it as part of their development process.                         a component or a
                                                                Failure                            modes are deriving
                                                                                   group of
                                                                Modes                              from a line or lines
Even when FMEA is required contractually, there are no                        components that
                                                                                                       of software
standards for software FMEA . In general software                                  make up a
FMEA is vague. Therefore, companies have taken the                                  function
traditional methods used for hardware and modified them                       an open, short or
for the software.                                                             things like alpha
                                                                                                       errors, wrong
                                                                                 particles and
The current FMEA standards which are used for                                    Gamma rays
                                                             Failure Cause                           incorrect logic or
hardware are as follows:                                                        (these failures
                                                                                                   algorithms, bad data
                                                                                 could cause
                                                                                                       and overflow
•   AIAG                                                                      software failures
•   SAE J1739                                                                     in Memory
•   SAE ARP5580                                                                   Calculation
                                                                                                     Based on lines of
•   MIL-STD-1629A                                                              method is based
                                                                                                    code, complexity,
                                                                                                      CMM level and
                                                                              on standards such
These are used for the following types of FMEA :                                                     other parameters.
                                                                               as MIL-HDBK-
                                                                                                      Also, there are
                                                             Failure Rates      217 or SR-332
•   Design FMEA (DFMEA)                                                       which is based on
                                                                                                        models that
•   Process FMEA (PFMEA)                                                                           calculates software
•   Machinery FMEA (MFMEA)                                                                          failure rates based
                                                                                failure rates or
•   Functional FMEA                                                                                  on development
                                                                              physics of failure
•   Software FMEA
                                                                                                        Applying a
•   Criticality Analysis
                                                                                                   monitoring software
                                                                                                      component that
Since are no formal standards for software reliability, it    Detection         Using register
                                                                                                      uses threads to
is acceptable to take the format that is suggested in MIL-   Methodology         information
                                                                                                       other software
STD-1629A and modify it for your own use.
                                                                                                   component to check
                                                                                                        their health
Following are some of the important information that
                                                                                  Retry or
should be captured in your analysis:                           Recovery                             Re-start / reboot of
                                                                                replacing the
                                                                Method                                 the software
                                                                              hardware element
•   Failure identification
•   Failure modes
                                                                Fault          components or
•   Possible failure causes                                   Simulation
                                                                                                    Corrupting the SW
                                                                                 high speed
•   State of software                                                              probes
•   Detection information
•   Who reports the alarm                                    In hardware design FMEA, it is assumed that all other
•   Who clears the alarm                                     processes are performing to the desired expectation. In
•   Criticality of the failure                               other words, we do not look at the failure mode of a
•   Failure rate                                             digital IC and at same time bring in the possibility of bad
•   How will the failure be reported                         assembly process.

                                                             If an engineer is performing the hardware design FMEA,
6.0 Hardwa re and Software FMEA Similarities                 he or she expects to find design or logic related issues
                                                             and not a bad solder joint.
The approaches to hardware and software FMEA are
very similar where the objective is to highlight possible    Similarly in software FMEA, it should be a requirement
                                                             of the code to be reviewed and then start the FMEA. The

2006 - ICSR9                                                                            Bidokhti - 3
purpose of this process is not to find syntax errors, rather   defining failure modes are the major part of FMEA
operational and requirement errors as an example.              scope.

7.0 Software Failure Modes                                     8.0 Timing of FMEA and Who should Perform
                                                                   the FMEA
As it was stated in section 2, there are several categories
of software failures. In this section, we will discuss this    We can perform FMEA at different stages of product
in more detail.                                                development cycle. It should be noted that FMEA details
                                                               and affectivity are directly related to when it is
For instance, Requirement’s failures can be divided into       performed.
several failure modes
                                                               Each phase where the FMEA is performed has its own
    •    Incorrect requirements                                benefits. Generally, the earliest software FMEA can be
    •    Ambiguous requirements                                performed is at the architectural level where different
    •    Conflicting requirements                              planes (i.e. Control, data or timing) and software
    •    Exceptional condition not specified                   components are identified.
    •    Test points or monitors not specified
                                                               The advantage of early FMEA is the ability to define and
Let’s take another example such as interface failure           highlight the system level failure mode categories,
modes. In this case failure modes are divided into three       expected behavior and critical interfaces. It is natural to
sub-categories:                                                expect the accuracy of the FMEA in the concept phase to
                                                               be limited since most of the detail design has not been
    1.   Functional interfaces                                 thought out.
    2.   Interfaces to hardware
    3.   Message based interfaces                              To prepare an accurate FMEA that reflects the actual
                                                               design, it is best to have the software designer create and
Each one of these sub-categories is based on a series of       fill in the failure mode information.
distinct and clear failure modes. If we focus on the
message based interfaces, we will find the following:
                                                               9.0 Role of UML Modeling in Software FMEA
    •    No message received
    •    Invalid message received                                                       n
                                                               UML modeling plays a important role in performing
    •    Message received out of sequence                      software FMEA, the intent of this section is not to
    •    Duplicate message received                            provide a tutorial on UML but introduces the reader to
                                                               this tool that can show the relationships among different
    •    Message not acknowledged
                                                               software elements and ease of understanding the design.
    •    Message acknowledged out of sequence
    •    Duplicate acknowledge received
                                                               The Unified Modeling Language (UML) is a
                                                               standard language for representing, specifying and
It is essential before any FMEA is started, definition of
                                                               documenting software designs of a system. Also is used
failure, category of failures a failure mode types are
                                                               for business modeling. In general, in case of l rge and
well understood and accepted by all team members.
                                                               complex systems UML have been proven effective.
These failure modes will define the roadmap that
designers need to follow.
                                                               The UML uses graphical notations to represent and
If this step is not addressed, designers will use their own    express the design of software projects. Also, it is an
judgment as to what is considered a failure and perform        excellent communication tool among engineers to
the analysis accordingly. This will create a major             discuss potential designs, possible corner cases and
obstacle in completion of the FMEA mainly where a              validate the architecture.
number of software components are in stake and more
importantly they are operation is inter-related.               There are different UML diagrams that have their own
                                                               application. Following table summarizes the definitions
One of the keys to a successful FMEA is setting the            and applications of each diagram.
scope and boundary of the analysis. Identifying and

2006 - ICSR9                                                                              Bidokhti - 4
Diagram                                                                  o     How to select failure modes
                     Descriptions           Applications
 Types                                                                   o     Sample failure categories and modes (such
                 Scenarios describing                                          as requirements, interfaces, logic and
                                             To expose
                    an interaction                                             calculation failures)
Use Case                                  requirements and
                 between a user and a                            •   Prepare a list of failure modes based on product
                                         planning the project
                        system                                       history
                                          The Classes of the     •   Create the template or format based on a standard
                 The types of objects
                                           system and their          and if not available, an agreed upon format to be
    Class        in a system and their
                                         relationships to each       used by the team
                                                 other           •   Become familiar with the system and software.
                 The behavior of use
                                         used when you want      •   Develop the Reliability Block Diagram (RBD)
                 cases by describing                                 which is a break down of the system to highlight the
                                             to model the
Interaction       the way groups of                                  HW & SW relationships. It is generally applied
                                          behavior of several
                  objects interact to                                more on hardware.
                                         objects in a use case
                   complete the task                             •   Perform UML modeling of the software to show the
                                              Where it is            relationships between SW elements.
                                             necessary to        •   Select the Component(s) or subsystem(s) for FMEA .
                   to describe the          understand the       •   Ensure that the selected SW elements have gone
                 behavior of a system       behavior of the          through code inspection.
                                          object through the     •   Set up the team and subject matter experts. This
                                             entire system           group of engineers is responsible to develop the
                                         analyzing a use case        failure modes and affect of the failures at all levels.
                    describe the          by describing what     •   Select the facilitator (he or she plays a big role in the
 Activity         workflow behavior      actions need to take        process). There are a set of distinct qualification for
                     of a system         place and when they         a person to be the facilitator.
                                             should occur        •   Develop a database to store the FMEA information
                      Relationship                                   (off the shelf tools)
                  between hardware                               •   Perform the FMEA
                   and software in a                                      o Highlight design weaknesses
                                         when development
                 system plus software                                     o Identify corrective actions as applicable
 Physical                                 of the system is
                   components of a                               •   Prepare the FMEA report
                    system and how
                                                                 •   Implement the corrective actions
                  they are related to
                       each other
                                                                 The template below (standard 1629) is tailored for
                                                                 hardware FMEA. As it was mentioned earlier in the
                                                                 paper there are no specific standards for software FMEA.
                                                                 Therefore, it must be modified to the situation at hand.
10.0        FMEA Process

It is a good practice before the start of any FMEA a set
of housekeeping steps be followed:

•     Define the scope of FMEA
•     Collect information such as design specifications
•     Create a FMEA guidelines document which as an
      example describes:

            o   Definitions
            o   Goals
            o   Component definition
            o   Interaction definition

2006 - ICSR9                                                                                 Bidokhti - 5
                                                                               FMEA Per 1629 Format

                         ITEM                                                               NEXT
      SEQUENCE           NAME           FAILURE           FAILURE          LOCAL           HIGHER            END               SEVERITY                                          FAILURE
       NUMBER             AND            MODES            EFFECTS         EFFECTS           LEVEL          EFFECTS           CLASSIFICATION                                     ISOLATION
                       FUNCTION                                                            EFFECTS
                                                                                           The effects
                                                           The                                              effect of               A severity                                   Description
        Reference                                                                             of the
                                         Probable     consequences                                         the failure            classification                                  of how the
       designation       Type of                                                           failure as it                                                   Description of
                                          failure        of each               The                         has on the         category assigned to                               failure once
            or         hardware or                                                            would                                                           failure
                                        modes for     failure mode          immediate                      operation,          each failure mode                                detected can
      identification     function                                                           been seen                                                        detection
                                        each item/       on item              impact                        function,          depending upon its                               be isolated to
       number for         name                                                             at the Next                                                      mechanism
                                         function      operation or                                        availability         effects on system                               a single root
       each failure                                                                          Higher
                                                        function                                           or status of             operation                                        cause
                                                                                                           the system

            Following is an example of a SW FMEA based on 1629 standard with modification to its format and list of attributes.

                                                                                                               Failure                                                                   Total
                                Failure                      Failure          Affected          Chance of                    Bandwidth        Exposure       Traffic     Traffic
FM ID      Category    Class                Description                                                         Rate                                                                   Bandwidth
                                Mode                          Case           Components          Failure                        loss          Duration       Impact      Weight
                                                                                                               (FITs)                                                                  hours lost

                                            Protocol A
                                Protocol       sends         Path not
Fabric-                            A          special       configured         Logical
           Protocol        3                                                                       Low             0.01         2304               12          of           1            276.48
  1                             Switch      commands            for             fabric
                                Failure     to Logical      protocol A

                                               LC #          Incorrect
                                             parameter      parameter,         Program                                                                        Loss
Fabric-                           Line
           Interface       1                in message          LC            Controller           Low             0.01             384            24          of           1             92.16
  2                               Card      received is       Number           Module                                                                        Traffic
                                                not         incorrect .

                                                                                Detection                             Recovery            Alarm          Process of                 Design
FM ID              Local                   system            Network             Point &                              Method &            Raised        Clearing the            Recommendation
                                                                                 Method                                Process             By              Alarm                   Summary

                                     Loss of protocol A                        Parameter
                                                            No recovery
                                       restoration for                        Validation in                               retry 3         Protocol      reset protocol          Design a reporting
Fabric-1           None                                     on multiple                               None
                                          multiple                              Logical                                    times          module           module                   message
                                        connections                              fabric

            Program Controller
              will activate the      Insert signal type
                                                            Insert signal                          Fabric not
           wrong Line Card. The        A on the Line
                                                             type B or                             capable of            Reset            Fabric        replace fabric
Fabric-2   correct Line Card will    Card that should                         Via Interrupt                                                                                           none
                                                              cause a                            activating line      fabric card         driver             card
            not be activated and        have been
            will not be included                             mismatch                                 card
              in the program.

            2006 - ICSR9                                                                                                                     Bidokhti - 6
The following table shows the historical estimated time     Nematollah Bidokhti
to complete a FMEA based on product complexity.
                                                            He is a technical leader at Cisco Systems. His
                                                            background includes hardware, software and system
  Estimated FMEA Completion Time                            Reliability engineering, Fault management, System and
      based on Complexity (Hours)              Accuracy     network modeling. He has contributed and managed
   Type        Low   Medium      High                       reliability activities for military grade, bio-medical,
Functional      40       60       100            Low        telephony, optical and Data products. He received a
 Interface      80      120       160            Med.       BSEE from Florida Atlantic University.
 Detailed      120      160     > 200            High


This paper described the role of the FMEA and how it
can improve design safety and reuse. This technique
becomes even more relevant and necessary when
companies are challenged by global competition and
required to shorten the product development cycle and at
the same time reduce cost. The information presented in
this paper can be used to build reliability into software
and at the same time maintain the goal of reducing the
time to market with high degree of confidence that
critical, major and even minor failure modes have been
identified and addressed (corrected or prioritized).

Following are some of the points that readers can take
away from this paper:

•   A proactive design analysis that points out the key
    design weaknesses and help develop the appropriate
•   Can be performed at different levels based on
    project needs
•   FMEA provides a knowledge database of all
    possible failure modes associated with the software
•   Should be performed early in the design process to
    be most effective
•   The main source of possible failure modes is the
    software designer
•   A good practice to apply code review before FMEA
•   Apply UML approach to identify relationships and
    interfaces between software modules or components
•   FMEA accuracy is directly related to the phase of
    development and type of analysis
•   Commitment of management is essential to the
    success of FMEA
•   Project schedule and budget needs to accommodate
    the execution of FMEA in the development process
•   FMEA has to be taken seriously and subject matter
    expert’s time need to be spent effectively

2006 - ICSR9                                                                         Bidokhti - 7

To top