Decisions_ Decisions

Document Sample
Decisions_ Decisions Powered By Docstoc
					                 Risk & Reliability
     An Application of High Reliability Theory

                          TIAS Lecture
                          2006 April 27 – afternoon session
                          Risk management

Gerd Van Den Eede
Department of Business Administration
VLEKHO Business School – Brussels
PhD Candidate at UvT, supervised by
Prof. Dr. P. Ribbers & Dr. B. Van de Walle
•   An Era of Complexity
•   On Reliability
•   On Error
•   On Operational Risk
•   The Rationale of Reliability
•   High Reliability Theory
An Era of Complexity
             Probability of performing perfectly in complex systems

                           Probability of success, each element:

                            0.95        0.99       0.999       0.9999

               1           0.95        0.99       0.999       0.9999
# of steps

              25           0.28        0.78       0.975        0.997
              40           0.12        0.66        0.96        0.995
             100          0.006        0.37        0.90         0.99
                         The Systems Space

Required to
the System

              Simple   Complicated       Complex     Complex Adaptive Chaotic

                                 The Systems Space

                                                          Bennet & Bennet, 2004
Cynefin (kun-ev‘in) Framework
             Un-ordered Domains               Ordered Domains
            Complex                          Knowable
                                            C  delays  E
           C  E relationships exist
           but  so frequently that there
                                            Space & Time delays exist.
           are no predictive models.
                                            We still know what they are
                                            if we invest energy and
           Retrospective coherence,
           the domain of patterns.

                  Chaos                       Known
         No obvious perceivable C E
                                            C E links are strong,
                                            They are real, discoverable
         Order is needed. Nothing to
                                            and repeat in predictable
         analyse. Unpredictable
                                            ways. We know what
         reactions to interventions.
                                            these are.
Interaction/Coupling Matrix

                    Linear                       Complex

                                1   2

                                3   4

                                    Ch. Perrow. Normal Accidents, 1984, p. 327.
                         Reflection 1
• Can you think of a process within your organization that is
  characterized by complexity and/or tight-coupling?
• Why is this so? How does the organization deal with it?
On Reliability
                         Defining ―Reliability‖

1.The measurable capability of an object to perform its
  intended function in the required time under
  specified conditions. (Handbook of Reliability Engineering,
  Igor Ushakov editor)
2.The probability of a product performing without
  failure a specified function under given conditions
  for a specified period of time. (Quality Control Handbook,
  Joseph Juran editor)
3.The extent of failure-free operation over time. (David

                    Quantifying ―Reliability‖
• ―Reliability‖ = Number of actions that achieve the
  intended result ’ Total number of actions taken

• ―Unreliability‖ = 1 minus ―Reliability‖

• It is convenient to use ―Unreliability‖ as an index,
  expressed as an order of magnitude (e.g. 10-2 means
  that 1 time in 100, the action fails to achieve its
  intended result)

• Related measure: Time or counts between failures, for
  example transplant cases between organ rejection,
  employee work hours between lost time injuries.
          Different Views on Reliability

Reliability         Unreliability      ―Sigma‘s‖
   0.9                  10-1               1

   0.99                 10-2               2
  0.999                 10-3               3
 0.9999                 10-4               4
0.99999                 10-5

0.999999                10-6

Combining Flexibility & Reliability
      Reliability & Flexibility as vectors mmedia/vectors/va.html
                         Reflection 2
• What is meant by ‗Reliability‘ in your organization? Is this always
  the case? On what does it depend?
• What measures are in place to guarantee this reliability?
• How does reliability relate to flexibility?
On Operational Risk
Criticality of resources depends on
business processes running on these systems!
                                       Marc Geerts, KBC 2004
          Definition Operational Risk (Basel II)
 “Operational Risk (“OpR”) is the risk of loss resulting from
inadequate or failed internal processes, people or systems and
from external events” (BIS - Basel II)

BIS / EU definition
includes legal & tax risk
excludes strategic, reputational and systemic risks

                                                       KBC, Ph. Theus, July 2004
Operational risk & other risk categories


  Credit      Operational    Liquidity
   Risk          Risk          Risk


                                         Raft International, 2003
Examples of operational risks
  – Wrong pricing model (formula) used by dealers
  – Double / non (timely) execution of payments
  – Collateral not properly executed
  – Losses due to internal / external fraud
  – Selling wrong product to wrong type of customer
  – Selling products without proper authorisation or outside the scope of a
    given license
  – Fire, flooding, terrorism
  – Etc.

                                                          KBC, Ph. Theus, July 2004
 Recent losses in the financial industry
$ > 1.8 billion        Sumitomo (copper trading)
$ 1,450 million        Kashima Oil (currency derivatives)

$ 1,390 million        Barings (Nikkei –futures & options)

$ 1,340 million        Metallgesellschaft (energy derivatives)

$ 1,100 million        Daiwa Bank (fraud in bond trading)

$ 750 million          AIB (rogue trading)

$ 275 million          Allied Lyons (currency options)

                                      Who is next ?
 OpRisk Management is not (only) about trying to avoid the little big
              one that could bring down the bank

                                                             KBC, Ph. Theus, July 2004
   1.   Identification of risks
   2.   Assessment of exposure to risks
   3.   Mitigation of risks
   4.   Monitoring and reporting

   Degree of formality and sophistication of the bank‘s operational risk
   management framework should be commensurate with the bank‘s risk

                                                           KBC, Ph. Theus, July 2004
                         Reflection 3
• What is the worst thing that can happen to your organization?
• Is it sufficiently covered? Is there a shared opinion about how
  the organization should deal with this risk?
• With risk in general?
On Error
  Nominal human error rates for selected activities
Activity (Assume no undue time pressure or stresses)                                                                Rate

Error of commission, e.g. misreading a label                                                                         .003

Error of omission without reminders                                                                                   .01

Error of omission when item is embedded in a procedure                                                               .003

Simple arithmetic errors with self checking                                                                           .03

Monitor or inspector fails to recognize an error                                                                       .1

Personnel on different shifts fail to check the condition of hardware                                                  .1
unless directed by a checklist
Error rate under very high stress when dangerous activities are                                                       .25
occurring rapidly
Source: Adapted from: Park K. Human error. In: Salvendy G, ed. Handbook of human factors and ergonomics. New York: John Wiley & Son,
Inc. 1997: 163

                 Different approaches
•   The problem of human error can be viewed in 2 ways:
      1. The person approach
      2. The system approach
•   Each has its model of error causation, and each model gives
    rise to different philosophies of error management

                Person approach, basis
• The long-standing and widespread tradition of person approach
  focuses on the unsafe acts -errors and procedural violations- of
  people on the front line.
• This approach views these unsafe acts as arising primarily from
  aberrant mental processes such as forgetfulness, inattention,
  poor motivation, carelessness, negligence, and recklessness.
• People are viewed as free agents capable of choosing between
  safe and unsafe mode of behavior.
• If something goes wrong, a person or group must be

                 person approach, why?

• Blaming individuals is emotionally more satisfying than targeting
• Uncoupling of person‘s unsafe acts from any institutional
  responsibility is in the interests of managers
• Person approach is also legally more convenient.

           Person approach: shortcomings

Three important features of human error tend to be overlooked:
    – It is often the best people who make the worst mistakes- error
       is not the monopoly of an unfortunate few
    – Far from being random, mishaps tend to fall into recurrent
       patterns. The same set of circumstances can provoke similar
       errors, regardless of the people involved.
    – The pursuit of greater reliability is seriously impeded by an
       approach that does not seek out and remove the error-
       provoking properties within the system

                     System Approach
• Humans are fallible and errors are to be expected, even in the
  best organizations
• Errors are seen as consequences rather than causes, having
  their origins not so much in the perversity of human nature as in
  “upstream” systemic factors.
• Although we can not change the human conditions, we can
  change the conditions under which the human work.
• A central idea is that of system defenses. All hazardous
  technologies posses barriers and safeguards. When an adverse
  event occurs, the important issue is not who blundered, but how
  and why the defenses failed.

       Incomplete                       Inadequate                 Deferred
       Procedures Messages                Training Attention      Maintenance
                        Production                Distractions Clumsy
                         Pressures                            Technology
             Regulatory        Responsibility
             Narrowness          Shifting


The World

                                                        Swiss Cheese Model
                                                         Modified from Reason, 1997
                  Swiss Cheese Model
• Defenses, barriers, and safeguards occupy a key position in the
  system approach. High technology systems have many defensive
• some are engineered (alarms, physical barriers, automatic
• others rely on people (surgeons, anesthetists, pilots, control room
• and others depend on procedures and administrative controls.
• In an ideal word, each defensive layer would be intact. In reality,
  they are more like slices of Swiss cheese, having many holes-
  although unlike in the cheese, these holes are continually opening,
  shutting, and shifting their location.
• The presence of holes in any one ―slice‖ does not normally cause a
  bad outcome. Usually this can happen only when the holes in many
  layers momentarily line up to permit a trajectory of accident
  opportunity- bringing hazards into damaging contact with victims.

•    The holes in the defenses arise for 2 reasons:
    1. Active failures
    2. Latent conditions
•    Latent conditions can translate into error-provoking
     conditions within the workplace (time pressure, understaffing,
     inadequate equipment, fatigue, and inexperience)
•    They can create long-lasting holes and weaknesses in the
     defenses (untrustworthy alarms and indicators, unworkable
     procedures, design and construction deficiencies).
•    Latent conditions may lie dormant within the system for many
     years before they combine with active failures and local
     triggers to create an accident opportunity.
•    Active failures are often hard to foresee but latent conditions
     can be identified and remedied before an adverse event
                         Reflection 4

• Where are the ―holes‖ in your ―cheese‖? Are they dynamic?
• How do you deal with them? Do you reinforce the layers? Do
  you implement new layers? …
The Rationale of Reliability
                              The Edge

                            Normally Safe

    No need                   Inherently
                                 Safe                         Return on
                                  6%                          Capital
                                               12%            Invested
                            Normally Safe

Safety Management Systems    The Edge
          Safety Culture                     Patrick Hudson - Leiden University
Reliability   Flexibility
       The interaction between Reliability and Flexibility

              -                 • Financial
                         +      performance
                                • Mindfulness         +       -
Short Term        Long Term                     R&D  innovation
Reliability       Reliability                   Adaptability to changes

        -              +
                                  • Stakeholder‘s       +
                                  • Respons               -
1830, p.74
Relationship between production and protection (Reason, 1997)


                                         Parity zone
              High hazard

                                  Low hazard

The lifespan of a hypothetical organization (Reason, 1997)


            Achieving a small α ?

                                 State of Nature

                     H0                        H1

                Type I error               Correct
                 incident with             Decision
                  severe loss
                 Correct                 Type II error
                 decision               Missed opportunity
                                        Wasted resources
           … or achieving a small β ?

                                   State of Nature

                              H0                       H1

           H1           Type I error
                    incident with severe loss

           H0        Correct decision            Type II error
                                                Missed opportunity
                                                Wasted resources
            Reduced scope of action
                    Scope of                    updating of
                     action                     procedures

                                                  to avoid
of system                                        recurrence

                                                   of past


               Actions sometimes
                needed to get the                 incidents
                    job done
                                      Adapted from Reason, 1997
                         Practical drift
                        Logics of Action

                        Rules          Task

                      Engineered       Applied

Situational                      2 3
Coupling                         1 4

              Tight   Designed         Failed

                        Stable         Unstable

                                                  Friendly Fire. Snook (2000), p. 186
I   A                     H
V   G                         H
                      H   T
I   G   R             R       R
D   R   I             T   S   T
U       S      C          O
A   R   K             P   L   T
L   I          T      R   U   O
        C             I   T
    S   A                     O
                      N   I
R   K   T                     L
                      C   O
I   S                     N   S
S                         S
                          Reflection 5

• What is the null hypothesis in your organization?
• Can your organization be called ‗working harder‘ or ‗working
• Are there traces of some kind of practical drift?
High Reliability Theory
                                                                           NAVAL AVIATION MISHAP RATE
                                            776 aircraft
                                            destroyed in                    FY 50-96
                                                           Angled Carrier Decks
Class A Mishaps/100,000 Flight Hours

                                                                Naval Aviation Safety Center
                                       50                            NAMP est. 1959                           39 aircraft
                                                                           RAG concept initiated             destroyed in
                                       40                                                                        1996
                                                                                 NATOPS initiated 1961
                                                                                       Squadron Safety program
                                                                                            System Safety
                                                                                            Designated Aircraft
                                       10                                                              HFC’s      2.39
                                            50                        65                    80                     96
                                                                             Fiscal Year
              Defining High Reliability Theory (HRT)

―How often could this organization have failed with dramatic
consequences?‖ If the answer to the question is many thousands
of times the organization is highly reliable

nuclear power plants, aircraft carriers, air traffic control, emergency
services, army, SWIFT, Nissan, Railways.
               Defining High Reliability Organizations (HROs)

• HROs face complexity and tight-coupling in the majority of
  processes they run.

• HROs are not error-free, but errors don‘t disable them

• HROs are forced to learn from even the smallest errors
                         Reluctance to accept simplifications
                         Preoccupation with failure
                         Commitment to resilience
                         Sensitivity to operations
                         Deference to expertise

                                         Simple system design
                                         Sufficient system design
      Decoupling Process Design          Avoid excessive redundancy
1. Preoccupation with failure: Systems with higher reliability worry chronically
that analytic errors are embedded in ongoing activities and that unexpected
failure modes and limitations of foresight may amplify those analytic errors. The
people who operate and manage high reliability organizations ―assume that
each day will be a bad day and at accordingly. but this is not an easy state to
sustain, particularly when the thing about which one is uneasy has either not
happened, or has happened a long time ago, and perhaps to another
organization‖ (Reason, 1997, p. 37). These systems have been characterized
as consisting of ―collective bonds among suspicious individuals: and as systems
that institutionalize disappointment. To institutionalize disappointment means, in
the words of the head of Pediatric Critical Care at Loma Linda Childrens‘
Hospital, ―to constantly entertain the thought that we have missed something.‖

2. Reluctance to simplify interpretations: All organizations have to ignore
most of what they see in order to get work done. The crucial issue is whether
their simplified diagnoses force them to ignore key sources of unexpected
difficulties. Mindful of the importance of this tradeoff, systems with higher
reliability restrain their temptations to simplify. They do so through such means
as diverse checks and balances, adversarial reviews, and cultivation of multiple
perspectives. At the Diablo Canyon nuclear power plant people preserve
complexity in their interpretations by reminding themselves of two things: (1) we
have not yet experienced all potential failure modes that could occur here; (2)
we have not yet deduced all potential failure modes that could occur here.
3. Sensitivity to operations: People in systems with higher reliability tend to
    pay close attention to operations. Everyone, no matter what his or her
    level, values organizing to maintain situational awareness. Resources are
    deployed so that people can see what is happening, can comprehend
    what it means, and can project into the near future what these
    understandings predict will happen. In medical care settings sensitivity to
    operations often means that the system is organized to support the
    bedside caregiver.

4. Cultivation of resilience: Most systems try to anticipate trouble spots, but
    the higher reliability systems also pay close attention to their capability to
    improvise and act without knowing in advance what will happen. Reliable
    systems spend time improving their capacity to do a quick study, to
    develop swift trust, to engage in just-in-time learning, to simulate
    mentally, and to work with fragments of potentially relevant past

5. Willingness to organize around expertise: Reliable systems let
   decisions ―migrate‖ to those with the expertise to make them. Adherence
   to rigid hierarchies is loosened, especially during high tempo periods, so
   that there is a better matching of experience with problems.

—adapted from Karl E. Weick & Kathleen M. Sutcliffe, ―Managing the Unexpected,‖ Jossey-Bass, 2001
                 Sensemaking vs. decision-making

―If I make a decision it is a possession; I take pride in it; I tend to
     defend it and not to listen to those who question it.

If I make sense, then this is more dynamic and I listen and I can
    change it. A decision is something you polish. Sensemaking is a
    direction for the next period.‖

                                                             --Paul Gleason

                                           Before the   After the
                                           Accident     Accident

Modified from Richard I. Cook, MD (1997)
     Top management‘s                   The Culture Premise
―Perceived‖ values, philosophy
     Employee‘s beliefs,
   attitudes and behaviors            Adapted from O‘Reilly (1989) Corporations, Culture, and
                                      Commitment: Motivation and Social Control in
    expressed as norms                Organizations‖. In: California Management Review,
                                      Summer 1989, Vol. 31, No. 4.
                     HROs simultaneously minimize
                        Type I & Type II errors

HROs are able to strike a balance that:
• minimizes Type I errors
  (catastrophic failure)
  while at the same time

•   keeps Type II errors
    (excessive and costly conservatism)
    at acceptable levels.
                                                    (Little, 2005)
                            Reflection 6

• Recall an experience – in any setting – in which the request that
  you ―try harder,‖ ―be careful,‖ or ―stay alert‖ improved your
  performance. Why did that work?

• Identify a process in your organization that relies on vigilance.
  What would you estimate its reliability to be?