An Overview of a Root Cause
Failure Analysis (RCFA) Process
• This presentation on the fundamentals of a Root Cause Failure Analysis (RCFA) process
• What RCFA is, and why it is done,
• Types of root causes,
• 7 generic steps in an RCFA investigation,
• Challenges in setting up an RCFA process,
• Setting up and sustaining an RCFA process.
2008 IPEIA Conference, Banff, Canada 1
What and Why RCFA?
A class of problem-solving methods to
eliminate recurrence of failure, or
manage the consequences of failure
• Root Cause Failure Analysis is a class of problem solving methods using a step-by-step
method to discover the basic causes of failure.
• Many commercial solutions exist with associated training, consulting, and software costs,
but all of these methods share the same fundamentals within this presentation.
• The fundamental purposes why RCFA is implemented are
• to eliminate the recurrence of a failure, or
• to manage the consequences of a failure should they occur again.
• In many cases, it is not possible to completely eliminate the probability of a failure, and
that is why we consider failure management policies to manage the consequences to a
tolerable risk level.
• Note that RCFA requires a failure to occur first before investigating and analyzing. Thus, it
is a reactive means to developing failure management policies. If consequences of failure
are intolerable, then a proactive method is required.
• These failures typically are classified as sporadic or chronic.
• Sporadic failure events often are one-time events that usually gain significant
attention because they usually involve significant, unexpected, and severe
• Chronic events, unfortunately, are those that are accepted but may have significant
cumulative losses over a long period. Most failure events that occur more than once
should be considered chronic.
2008 IPEIA Conference, Banff, Canada 2
What are “Root Causes”?
Physical component or material that failed
Human actions or decisions leading to failure
Reasons why actions or decisions were made
• What are Root Causes? The root cause is the most basic reason for the problem
occurring that, if eliminated, will eliminate the failure event from recurring. We define three
types of root causes: physical, human, and latent. Often, a failure results from one or more
• Physical root causes are the physical component that caused the failure event. These
are almost always present and are typically the overall physical reason the event occurred.
Traditional “Failure Analysis” is key to determining the physical root cause. Unfortunately, if
we only rely upon it, we will stop too early and implement a physical redesign because all we
know is what physically failed. This is a common error in performing RCFA.
• Human root causes are the last human actions that led to the failure event. These usually
are, but not always, present. Too often, organizations seek out the individual that did a
wrong action and stop there. This is counterproductive and will make people unsupportive
of RCFA because it becomes a “witch-hunt.”
• Latent root causes are the reasons why decisions were made that resulted in the error.
There will usually be more than one latent root, and typically, if these did not exist, then the
human root likely would have been avoided. Examples are organizational systems and
processes that made the human think a certain way and make the improper action.
Eliminating latent root causes will eliminate the failure event, and should be the focus of the
• Generally, analysis teams are hesitant to address the latent root causes because these
are weaknesses in the existing organization. It is important that they are supported in this
approach, and the method is fully understood among all individuals.
2008 IPEIA Conference, Banff, Canada 3
7 Steps of RCFA?
Preserving evidence and collecting data
• These seven steps describe the method used in almost any RCFA process. Each is
described in following slides.
• Note the steps can be remembered by the acronym “SPORADIC” – a classification of
failure discussed earlier.
• Not all of these steps are performed by the same individuals. It is important that RCFA is
viewed as a problem-solving method that spans the organization and beyond!
2008 IPEIA Conference, Banff, Canada 4
Consequences and risk
Who needs to be involved?
Internal – non-technical functions?
External – jurisdictional bodies?
Formal vs. informal methods
Local vs. external “principal analysts”
• Our method starts with a scoping of the failure. Scoping first evaluates the consequences
of the failure and its risk. Evaluating the risk identifies the consequences for what could
have happened if the failure recurs, and the associated frequency or probability of
recurrence. Doing so allows us to understand the reasonable, worst-case consequences
and either eliminate or manage them.
• Scoping also considers the nature of the failure event and directs the efforts to include the
internal and external functions as required. Internal functions, such as Environment / Health
/ Safety departments, Insurance, Communications, etc., or external functions such as
jurisdictional bodies, may be required to participate in the investigation. In some cases,
clearance to proceed with an investigation is necessary and good policies defining scoping
ensures appropriate steps are taken.
• The last purpose of scoping determines the level of formality. Small, relatively simple
failure events often follow a straightforward investigative method using local facilitators or
• It is not uncommon that complex failures, or those involving multiple internal and external
parties, use an external and experienced facilitator to provide an unbiased approach. In
some cases, these principal analysts come from an external source such as a jurisdictional
2008 IPEIA Conference, Banff, Canada 5
Preserving Evidence and Collecting Data
Most important step!
Basic skills using best practices
Coordinating activities among all parties
Interviewing and taking notes
Collecting logs, databooks, alarm data, etc.
• Preserving evidence and collecting data is the most important step in RCFA. Too often
we work in environments focused on repairing and returning equipment to service as soon
as possible. Within minutes, key evidence is lost or altered.
• Without effective evidence preservation and data collection, an RCFA becomes lengthy,
drawn out, identify the wrong root causes, lead to wasted resources, and allow failures to
• Developing basic skills using best practices are viewed as a suitable responsibility for
nearly all field operations, mechanical staff, and contractors. These individuals typically
• have the most experience with the equipment,
• are present when the failure occurs, or are first-responders, and
• participate in or coordinate the repairs and/or clean-up.
• Common tasks during this step include
• Coordinating activities to preserve evidence and collect data - at the site and off-site
such as repair shops, labs, etc.
• Interviewing parties, taking notes, and witness statements
• Photographing the overall site, the unaltered scene, damage, and all stages of
• Handling parts, including disassembly methods, cutting or torching to avoid altering
the evidence, and preservation / packaging
• Collecting logs, databooks, alarm data, drawings, manuals, etc.
• Because so many stages are involved with the disassembly, repair, and lab analysis, it is
reasonable to expect this step to span weeks.
2008 IPEIA Conference, Banff, Canada 6
ORganizing the Analysis
Facilitator (Principal Analyst)
• Organizing the analysis team usually occurs during preserving evidence but sometimes is
completed afterwards when the failure is better understood.
• Typically, the analysis team consists of a Facilitator or “Principal Analyst” responsible for
managing the analysis team through the analysis stage and documenting the findings /
recommendations. Typically, it is undesirable to have the facilitator who is a technical expert
since bias may be introduced. Ideally, the Facilitator thoroughly understands the RCFA
method, plans the project, and manages the dynamics among the participants’ during the
• Participants are individuals with expertise in the equipment (its manufacture, fabrication,
application, operation, servicing, and maintenance). The RCFA flows better and much more
efficiently when Participants have basic training in the RCFA process. Participants are not
expected to be Facilitators.
• Reviewers often are not on the analysis team, but later validate the technical conclusions
leading to the root causes and the technical feasibility of the recommendations.
• Often the implementation of tasks is the responsibility of individuals other than the analysis
team or reviewers. Communication is essential!
2008 IPEIA Conference, Banff, Canada 7
Analyzing – Sequence of Events
Good for simple, Liquid levels rising
since April 30/06 Site logs do not have regular
entries and inconsistent levels
linear events 4” sand found in
jammed by sand
with few root Entrained sand Sand building up in Level controller
Sand carried in
gas / liquid stream
entering separator separator fouled by sand
(during regular (since cleaning on (around
operation) Apr 3/04) June 1/06)
Evidence-based Elbow erodes
(between June 1/06
through Sept 3/06
20% LEL alarm
Site ESDs close
Identify means Eroded wall at
to break the
chain of events
• The next step is the analysis. One method is the Sequence of Events method.
• Sequence of events analysis is very useful for
• straightforward problems that have a known sequence of events leading to the failure
• complex problems where combinations of root causes exist and the approach is to
determine which cause(s) must be eliminated to break the chain,
• establishing timelines and identifying which events require some other analysis tool
such as a logic tree.
• It requires an understanding of what is controllable, and the resulting outcome of the
control, action, or response
1. Map the sequence of events that lead to the failure
2. For each event, determine if it is controllable, and if so, what alternatives exist to
change what happens,
3. Compare the alternatives and identify which can be implemented to break the chain
4. Create recommendations for physical, human, and latent roots contributing to the
sequence of events.
2008 IPEIA Conference, Banff, Canada 8
Analyzing – Logic Tree
Good for complex / ambiguous problems with
many root causes
• Another analysis method uses a “Logic Tree.” It is very well-suited for complex or
ambiguous problems with many root causes.
• The analysis is managed by constructing a logic tree using structured and simple
questions. These questions are used to
• first define a failure event at the top of the tree,
• identify possible hypotheses for the preceding cause of the failure,
• test the hypotheses using evidence and data collected earlier.
• This hypothesis/verification continues until the trail can be traced back to a latent root for
which a suitable failure management policy can be defined. The next level of hypotheses
must clearly flow from its predecessor (the one before it). If it is clear that a step is missing
between causes it is added in and evidence sought to support its presence.
• Once the fault tree is completed and checked for logical flow, the team then determines
recommendations to prevent the sequence of causes and effects from recurring.
• This method also accommodates a confidence rating based on the accuracy or quality of
2008 IPEIA Conference, Banff, Canada 9
Analyzing – Logic Tree
Event What is the abnormal state of failure?
How has this event occurred in the past?
Mode 1 Mode 2 Mode 3 Mode 4 What evidence do we have at hand
describing what caused the failure?
Hyp. 1 Hyp. 2
Hypothesis and Verification
How could the preceding event have
Hyp. 3 Hyp. 4
Hyp. 5 What was the action or decision that
H allowed this physical root to occur?
Hyp. 6 How do business practices and systems
L contribute to this thinking?
• This slide demonstrates the simple questions used throughout the logic tree. We will not
discuss the questions in detail considering the time constraints around this presentation.
• In short, the failure event is a straightforward description of the loss of function, not of the failure
• It is followed by asking “How has this event occurred in the past?” (for chronic failures), or “What
evidence do we have at hand describing what caused the failure?” (for sporadic failures). In both
cases, only the facts are listed without any guesses on the causes.
• Hypotheses for physical roots follow using the question “How could the preceding event have
occurred?” so an educational guess can be made. Evidence is used to prove or disprove the
hypothesis. If a hypothesis is disproved, or has a low confidence associated with it, then it is no
longer pursued. Only the developing roots that are proven with high confidence are pursued. This
prevents wasted resources chasing “red herrings”
• At a point, the question for physical roots no longer makes sense. Usually this is when we
transition into discovering human root causes, and the question “What was the action or decision
that allowed this physical root to occur?” As stated earlier, the analysis does not stop here. This
question only allows us to understand the human root cause.
• Once the human root cause is identified, it becomes apparent that the more suitable question is
“How do business practices and systems contribute to this thinking?” Both internal and external
business practices and systems are within the scope of this question. Simply put, include your
manufacturers, suppliers, vendors, engineers, packagers, distributors, shippers, constructors,
commissioners, operators, and maintainers.
• Typical latent root causes include training, skills verification, operating procedures, standards of
workmanship, time pressures, methods, drawing updates, communications, role and responsibility
definition, work scope definition, work conditions, management of change, holdpoints, inspections,
2008 IPEIA Conference, Banff, Canada 10
Documenting, Implementing, Confirming
3 stages of communication
Effort vs. likelihood to prevent recurrence
Not all causes need a corrective action
High payback is not uncommon, but surprising!
Long time periods to confirm results
Investigate similar failures
Originated before the RCFA?
• Communicating the analysis involves three stages.
• 1st: a summary of the failure event, the root causes, and the associated
recommendations coming out of the analysis;
• 2nd: which recommendations were selected during the evaluation, how they will be
implemented, when, and by whom;
• 3rd: whether or not the implemented recommendations were successful.
• It is important to understand that not all causes need a correction action applied to them
to prevent recurrence or to adequately manage the consequences of failure. For example,
an Sequence of Events requires the sequence to be broken, and often only a few
recommendations with a high impact require implementation. A Logic Tree analysis could
identify a number of root causes, but only a few have technically feasible recommendations
or have such a high impact that the remaining risk of recurrence is tolerable.
• During the selection of recommendations, it is not uncommon to find payback in the range
of 30:1 or higher! Because latent roots deal with organizational systems, policies, and
procedures, the effort to change and manage those is significantly less than complex
• Lastly, note that it may take months, years, or decades to confirm whether the
implementation was successful. Too often, the organizations pursuing an RCFA program
expect immediate results with a financial quarter or two. Many failure mechanisms
commence thousands of hours before the failure is recognized. It is foolish to immediately
conclude the implementation was unsuccessful without understanding the root cause of the
2008 IPEIA Conference, Banff, Canada 11
Attitude – The failure is preventable or manageable!
Learning – History and experience has value
Capacity – Busy repairing vs. busy eliminating
Capability – Follow a simple, well-defined process
with technical support
Expectations – Results may take a long time
Change – Already doing it to a degree
• Now we consider some challenges in setting up an RCFA process.
• First, the culture or status quo likely accepts failures because “stuff happens.” Starting on
small, chronic failures allows for quick demonstration that failures are preventable or
manageable and provides quick to real results.
• Engaging experienced employees through the process fosters a culture of learning from
their experience, and gains ”buy-in” by recognizing the importance of their experience.
• Often, being “too busy repairing” is a challenge. The question to be asked of employees,
and demonstrated by their supervisors, is whether they intend to remain busy repairing, or
get busy eliminating. Eliminating chronic failures tends to hit a critical mass where reduced
repair time easily accommodates additional RCFA activities.
• Starting with a straightforward, simple RCFA process that everyone can comprehend and
identify their responsibilities is key. Ensure experienced RCFA individuals are available to
train, coach, and do analyses.
• Setting the right expectations is important. As stated earlier, it may take years to confirm
the prevention of sporadic failures, or perhaps months for chronic failures. It is important
that sponsors understand this duration. After implementation it is necessary to ensure the
organization does not slip back to its bad practices.
• Lastly, fear of change is common. Most technical people already do ad-hoc RCFA
although not to the level of identifying latent roots (typically just physical roots, leading to
expensive redesigns). Building upon existing though processes is a good start to fine-tune
their skills to this more thorough analysis method.
2008 IPEIA Conference, Banff, Canada 12
Setting-Up & Sustaining RCFA
Dedicated Trainer / Coach resource
Training based on roles & responsibilities
Preserving Evidence & Data Collection
Participant / Facilitators / Reviewers
Leadership & Active Sponsorship
Assign resources / select / implement / track
Chronic vs. Sporadic problems
Sensible and achievable method
Learn the method before a software tool
Focused efforts in a “friendly sandbox”
• Here are some considerations to set up and sustain an RCFA process. First, dedicate at
least one individual as a trainer, coach, and “doer” until a larger network of facilitators is
• Establish competencies among your field staff by setting up training for Preserving
Evidence & Data Collection. Within EnCana, we have an online e-learning module
supported with a Quick Reference Card to make training accessible to nearly everyone.
• Train your principal analysts (facilitators), reviewers, and participants (generally your
technical specialists) in the RCFA method. Preparing them for the analysis ensures their
fluency in RCFA terminology and working with a common methodology.
• Ensure you have leadership and active sponsorship for the RCFAs. Success during the
first few analyses is essential to demonstrate the efficiency, simplicity, and effectiveness of
• Focus on chronic problems since these have a faster confirmation of results.
• Pick a sensible and achievable method of doing your RCFA work. Many commercialized
methods exist, but you must consider scalability costs and suitability of training for all roles
• Learn the method before attempting to use software as a tool. Developing the thought
processes is more important!
• If possible, start in a “friendly sandbox” surrounded with sponsors and peers who
understand there will be glitches, but will accept these as the process is tuned.
2008 IPEIA Conference, Banff, Canada 13
Acknowledgements & Further Reading
“Root Cause Analysis: Improving
Performance for Bottom-Line Results, 2nd ed.”
Robert Latino, Kenneth Latino
“Root Cause Failure Analysis”
R. Keith Mobley
• The two books above are recommended if you are seeking additional information on
RCFA. The first book (Latino) presents the logic tree analysis in the PROACT methodology.
The second book (Mobley) presents the sequence of events analysis tool.
• In conclusion, I encourage you to implement an RCFA process if you have not done so
yet. Strive to discover the latent root causes with your organization and externally. By doing
so, you will avoid pursuing many expensive physical redesigns and realize significant
reductions in your environmental / safety incidents and production costs.
2008 IPEIA Conference, Banff, Canada 14