Risk & Reliability An Application of High Reliability Theory TIAS Lecture 2006 April 27 – afternoon session Risk management Gerd Van Den Eede Department of Business Administration VLEKHO Business School – Brussels firstname.lastname@example.org PhD Candidate at UvT, supervised by Prof. Dr. P. Ribbers & Dr. B. Van de Walle Overview • An Era of Complexity • On Reliability • On Error • On Operational Risk • The Rationale of Reliability • High Reliability Theory An Era of Complexity Probability of performing perfectly in complex systems Probability of success, each element: 0.95 0.99 0.999 0.9999 1 0.95 0.99 0.999 0.9999 # of steps 25 0.28 0.78 0.975 0.997 40 0.12 0.66 0.96 0.995 100 0.006 0.37 0.90 0.99 1798 The Systems Space Knowledge Required to Understand the System Simple Complicated Complex Complex Adaptive Chaotic The Systems Space Bennet & Bennet, 2004 Cynefin (kun-ev‘in) Framework Un-ordered Domains Ordered Domains Complex Knowable C delays E C E relationships exist but so frequently that there Space & Time delays exist. are no predictive models. We still know what they are if we invest energy and Retrospective coherence, resources. the domain of patterns. Chaos Known No obvious perceivable C E C E links are strong, relationships. They are real, discoverable Order is needed. Nothing to and repeat in predictable analyse. Unpredictable ways. We know what reactions to interventions. these are. 0647 Interaction/Coupling Matrix INTERACTIONS Linear Complex Tight COUPLING 1 2 3 4 Loose Ch. Perrow. Normal Accidents, 1984, p. 327. Reflection 1 • Can you think of a process within your organization that is characterized by complexity and/or tight-coupling? • Why is this so? How does the organization deal with it? On Reliability Defining ―Reliability‖ 1.The measurable capability of an object to perform its intended function in the required time under specified conditions. (Handbook of Reliability Engineering, Igor Ushakov editor) 2.The probability of a product performing without failure a specified function under given conditions for a specified period of time. (Quality Control Handbook, Joseph Juran editor) 3.The extent of failure-free operation over time. (David Garvin) 1966 Quantifying ―Reliability‖ • ―Reliability‖ = Number of actions that achieve the intended result ’ Total number of actions taken • ―Unreliability‖ = 1 minus ―Reliability‖ • It is convenient to use ―Unreliability‖ as an index, expressed as an order of magnitude (e.g. 10-2 means that 1 time in 100, the action fails to achieve its intended result) • Related measure: Time or counts between failures, for example transplant cases between organ rejection, employee work hours between lost time injuries. 1966 Different Views on Reliability Reliability Unreliability ―Sigma‘s‖ (approximate) 0.9 10-1 1 0.99 10-2 2 0.999 10-3 3 0.9999 10-4 4 0.99999 10-5 0.999999 10-6 1966 Combining Flexibility & Reliability Reliability & Flexibility as vectors www.physicsclassroom.com/ mmedia/vectors/va.html Reflection 2 • What is meant by ‗Reliability‘ in your organization? Is this always the case? On what does it depend? • What measures are in place to guarantee this reliability? • How does reliability relate to flexibility? On Operational Risk Criticality of resources depends on business processes running on these systems! Marc Geerts, KBC 2004 Definition Operational Risk (Basel II) “Operational Risk (“OpR”) is the risk of loss resulting from inadequate or failed internal processes, people or systems and from external events” (BIS - Basel II) BIS / EU definition includes legal & tax risk excludes strategic, reputational and systemic risks KBC, Ph. Theus, July 2004 Operational risk & other risk categories Business Risk Credit Operational Liquidity Risk Risk Risk Market Risk Raft International, 2003 Examples of operational risks – Wrong pricing model (formula) used by dealers – Double / non (timely) execution of payments – Collateral not properly executed – Losses due to internal / external fraud – Selling wrong product to wrong type of customer – Selling products without proper authorisation or outside the scope of a given license – Fire, flooding, terrorism – Etc. KBC, Ph. Theus, July 2004 Recent losses in the financial industry erisk.com $ > 1.8 billion Sumitomo (copper trading) $ 1,450 million Kashima Oil (currency derivatives) $ 1,390 million Barings (Nikkei –futures & options) $ 1,340 million Metallgesellschaft (energy derivatives) $ 1,100 million Daiwa Bank (fraud in bond trading) $ 750 million AIB (rogue trading) $ 275 million Allied Lyons (currency options) … Who is next ? OpRisk Management is not (only) about trying to avoid the little big one that could bring down the bank KBC, Ph. Theus, July 2004 OPERATIONAL RISK MANAGEMENT includes: 1. Identification of risks 2. Assessment of exposure to risks 3. Mitigation of risks 4. Monitoring and reporting Degree of formality and sophistication of the bank‘s operational risk management framework should be commensurate with the bank‘s risk profile KBC, Ph. Theus, July 2004 Reflection 3 • What is the worst thing that can happen to your organization? • Is it sufficiently covered? Is there a shared opinion about how the organization should deal with this risk? • With risk in general? On Error Nominal human error rates for selected activities Activity (Assume no undue time pressure or stresses) Rate Error of commission, e.g. misreading a label .003 Error of omission without reminders .01 Error of omission when item is embedded in a procedure .003 Simple arithmetic errors with self checking .03 Monitor or inspector fails to recognize an error .1 Personnel on different shifts fail to check the condition of hardware .1 unless directed by a checklist Error rate under very high stress when dangerous activities are .25 occurring rapidly Source: Adapted from: Park K. Human error. In: Salvendy G, ed. Handbook of human factors and ergonomics. New York: John Wiley & Son, Inc. 1997: 163 1966 Different approaches • The problem of human error can be viewed in 2 ways: 1. The person approach 2. The system approach • Each has its model of error causation, and each model gives rise to different philosophies of error management 1567 Person approach, basis • The long-standing and widespread tradition of person approach focuses on the unsafe acts -errors and procedural violations- of people on the front line. • This approach views these unsafe acts as arising primarily from aberrant mental processes such as forgetfulness, inattention, poor motivation, carelessness, negligence, and recklessness. • People are viewed as free agents capable of choosing between safe and unsafe mode of behavior. • If something goes wrong, a person or group must be responsible. 1567 person approach, why? • Blaming individuals is emotionally more satisfying than targeting institutions. • Uncoupling of person‘s unsafe acts from any institutional responsibility is in the interests of managers • Person approach is also legally more convenient. 1567 Person approach: shortcomings Three important features of human error tend to be overlooked: – It is often the best people who make the worst mistakes- error is not the monopoly of an unfortunate few – Far from being random, mishaps tend to fall into recurrent patterns. The same set of circumstances can provoke similar errors, regardless of the people involved. – The pursuit of greater reliability is seriously impeded by an approach that does not seek out and remove the error- provoking properties within the system 1567 System Approach • Humans are fallible and errors are to be expected, even in the best organizations • Errors are seen as consequences rather than causes, having their origins not so much in the perversity of human nature as in “upstream” systemic factors. • Although we can not change the human conditions, we can change the conditions under which the human work. • A central idea is that of system defenses. All hazardous technologies posses barriers and safeguards. When an adverse event occurs, the important issue is not who blundered, but how and why the defenses failed. 1567 LATENT FAILURES Incomplete Inadequate Deferred Mixed Procedures Messages Training Attention Maintenance Production Distractions Clumsy Pressures Technology Regulatory Responsibility Narrowness Shifting Triggers The World Accident DEFENSES Swiss Cheese Model Modified from Reason, 1997 Swiss Cheese Model • Defenses, barriers, and safeguards occupy a key position in the system approach. High technology systems have many defensive layers: • some are engineered (alarms, physical barriers, automatic shutdowns), • others rely on people (surgeons, anesthetists, pilots, control room operators), • and others depend on procedures and administrative controls. • In an ideal word, each defensive layer would be intact. In reality, they are more like slices of Swiss cheese, having many holes- although unlike in the cheese, these holes are continually opening, shutting, and shifting their location. • The presence of holes in any one ―slice‖ does not normally cause a bad outcome. Usually this can happen only when the holes in many layers momentarily line up to permit a trajectory of accident opportunity- bringing hazards into damaging contact with victims. 1567 • The holes in the defenses arise for 2 reasons: 1. Active failures 2. Latent conditions • Latent conditions can translate into error-provoking conditions within the workplace (time pressure, understaffing, inadequate equipment, fatigue, and inexperience) • They can create long-lasting holes and weaknesses in the defenses (untrustworthy alarms and indicators, unworkable procedures, design and construction deficiencies). • Latent conditions may lie dormant within the system for many years before they combine with active failures and local triggers to create an accident opportunity. • Active failures are often hard to foresee but latent conditions can be identified and remedied before an adverse event occur. 1567 Reflection 4 • Where are the ―holes‖ in your ―cheese‖? Are they dynamic? • How do you deal with them? Do you reinforce the layers? Do you implement new layers? … The Rationale of Reliability The Edge Normally Safe No need Inherently Safe Return on 6% Capital 9% 12% Invested Normally Safe Safety Management Systems The Edge Safety Culture Patrick Hudson - Leiden University Reliability Flexibility The interaction between Reliability and Flexibility - • Financial + performance • Mindfulness + - Flexibility Short Term Long Term R&D innovation Reliability Reliability Adaptability to changes - + • Stakeholder‘s + confidence • Respons - 1830, p.74 Relationship between production and protection (Reason, 1997) BANKRUPTCY Parity zone High hazard ventures Low hazard ventures CATASTROPHE The lifespan of a hypothetical organization (Reason, 1997) BANKRUPTCY CATASTROPHE Achieving a small α ? State of Nature H0 H1 α Type I error Correct H1 incident with Decision severe loss Decision β Correct Type II error H0 decision Missed opportunity Wasted resources … or achieving a small β ? State of Nature H0 H1 α Correct H1 Type I error Decision incident with severe loss Decision β H0 Correct decision Type II error Missed opportunity Wasted resources Reduced scope of action Continuous Scope of updating of Regulated action procedures to avoid History of system recurrence of past accidents and Actions sometimes needed to get the incidents job done Adapted from Reason, 1997 Practical drift Logics of Action Rules Task Loose Engineered Applied Situational 2 3 Coupling 1 4 Tight Designed Failed Stable Unstable Friendly Fire. Snook (2000), p. 186 Diabolo I N D I A H R V G H H T I G R R R D R I T S T U S C O A R K P L T L I T R U O C I T S A O N I R K T L C O I S N S S S K S Reflection 5 • What is the null hypothesis in your organization? • Can your organization be called ‗working harder‘ or ‗working smarter‘ • Are there traces of some kind of practical drift? High Reliability Theory NAVAL AVIATION MISHAP RATE 776 aircraft destroyed in FY 50-96 1954 Angled Carrier Decks 60 Class A Mishaps/100,000 Flight Hours Naval Aviation Safety Center 50 NAMP est. 1959 39 aircraft RAG concept initiated destroyed in 40 1996 NATOPS initiated 1961 Squadron Safety program 30 System Safety Designated Aircraft 20 ACT 10 HFC’s 2.39 0 50 65 80 96 Fiscal Year 1798 Defining High Reliability Theory (HRT) ―How often could this organization have failed with dramatic consequences?‖ If the answer to the question is many thousands of times the organization is highly reliable Examples: nuclear power plants, aircraft carriers, air traffic control, emergency services, army, SWIFT, Nissan, Railways. Defining High Reliability Organizations (HROs) • HROs face complexity and tight-coupling in the majority of processes they run. • HROs are not error-free, but errors don‘t disable them • HROs are forced to learn from even the smallest errors Reluctance to accept simplifications Preoccupation with failure Commitment to resilience Sensitivity to operations Mindfulness Deference to expertise Sensemaking HRT Simple system design Sufficient system design Decoupling Process Design Avoid excessive redundancy 1. Preoccupation with failure: Systems with higher reliability worry chronically that analytic errors are embedded in ongoing activities and that unexpected failure modes and limitations of foresight may amplify those analytic errors. The people who operate and manage high reliability organizations ―assume that each day will be a bad day and at accordingly. but this is not an easy state to sustain, particularly when the thing about which one is uneasy has either not happened, or has happened a long time ago, and perhaps to another organization‖ (Reason, 1997, p. 37). These systems have been characterized as consisting of ―collective bonds among suspicious individuals: and as systems that institutionalize disappointment. To institutionalize disappointment means, in the words of the head of Pediatric Critical Care at Loma Linda Childrens‘ Hospital, ―to constantly entertain the thought that we have missed something.‖ 2. Reluctance to simplify interpretations: All organizations have to ignore most of what they see in order to get work done. The crucial issue is whether their simplified diagnoses force them to ignore key sources of unexpected difficulties. Mindful of the importance of this tradeoff, systems with higher reliability restrain their temptations to simplify. They do so through such means as diverse checks and balances, adversarial reviews, and cultivation of multiple perspectives. At the Diablo Canyon nuclear power plant people preserve complexity in their interpretations by reminding themselves of two things: (1) we have not yet experienced all potential failure modes that could occur here; (2) we have not yet deduced all potential failure modes that could occur here. 3. Sensitivity to operations: People in systems with higher reliability tend to pay close attention to operations. Everyone, no matter what his or her level, values organizing to maintain situational awareness. Resources are deployed so that people can see what is happening, can comprehend what it means, and can project into the near future what these understandings predict will happen. In medical care settings sensitivity to operations often means that the system is organized to support the bedside caregiver. 4. Cultivation of resilience: Most systems try to anticipate trouble spots, but the higher reliability systems also pay close attention to their capability to improvise and act without knowing in advance what will happen. Reliable systems spend time improving their capacity to do a quick study, to develop swift trust, to engage in just-in-time learning, to simulate mentally, and to work with fragments of potentially relevant past experience. 5. Willingness to organize around expertise: Reliable systems let decisions ―migrate‖ to those with the expertise to make them. Adherence to rigid hierarchies is loosened, especially during high tempo periods, so that there is a better matching of experience with problems. —adapted from Karl E. Weick & Kathleen M. Sutcliffe, ―Managing the Unexpected,‖ Jossey-Bass, 2001 Sensemaking vs. decision-making ―If I make a decision it is a possession; I take pride in it; I tend to defend it and not to listen to those who question it. If I make sense, then this is more dynamic and I listen and I can change it. A decision is something you polish. Sensemaking is a direction for the next period.‖ --Paul Gleason 1930 Before the After the Accident Accident Modified from Richard I. Cook, MD (1997) Top management‘s The Culture Premise Beliefs Values Actions Communication Credible Consistent Salient ―Perceived‖ values, philosophy Consistent Intensity Consensus Rewards Money Promotion Approval Employee‘s beliefs, attitudes and behaviors Adapted from O‘Reilly (1989) Corporations, Culture, and Commitment: Motivation and Social Control in expressed as norms Organizations‖. In: California Management Review, Summer 1989, Vol. 31, No. 4. HROs simultaneously minimize Type I & Type II errors HROs are able to strike a balance that: • minimizes Type I errors (catastrophic failure) while at the same time • keeps Type II errors (excessive and costly conservatism) at acceptable levels. (Little, 2005) Reflection 6 • Recall an experience – in any setting – in which the request that you ―try harder,‖ ―be careful,‖ or ―stay alert‖ improved your performance. Why did that work? • Identify a process in your organization that relies on vigilance. What would you estimate its reliability to be?