human-issues-operations-summary

Document Sample
human-issues-operations-summary Powered By Docstoc
					HDCC Workshop #1: Human Issues & Operations                                                          01/31/11




Technical Session: Human Issues and Operations
This technical group discussed human factors and operations issues that contribute to system dependability. The
attendees were:
         Anna Wichansky, Oracle (chair)
         Pat Mantey, UC Santa Cruz
         Jack Hansen, U. Central Florida
         Mike Conroy, NASA/KSC
         Dan Siewiorek, CMU
         Roy Maxion, CMU
         Lynn Carter, CMU
         Jun Park, NASA/ARC
         Mike Feary, NASA/ARC
         Bernard Antonuk, Five Nine Solutions, Inc.

We identified four basic areas that could benefit from research by the HDCC:


Automation & Human Control: Measurement, Modeling, & Optimization
The capabilities of the human operator must be balanced with those of automation in order to achieve optimal
system dependability. There are some tasks in which human control is mandatory, and others that can be better
performed by a machine. The classic human-machine allocation issue must be addressed for all new systems, and
resolved when existing systems are upgraded or otherwise changed.

This issue exists in space missions, air traffic control and management, enterprise databases, and the Internet. We
focused on the air traffic control issues. At any one second in time over the U.S., there are 5500 flights currently
tracked by air traffic management system. The NASA-developed tool suite Center-TRACON (Terminal Radar
Approach Control) Automation System (CTAS) being deployed by the FAA has been shown to increase
throughput by 15% at DFW airport. Basic procedures in terms of human-machine task allocation have not been
changed. In the next 10 years, these systems will need to add 60% more capacity in order to manage ever-
increasing numbers of flights. This cannot be achieved by making incremental changes to the current system but
will require a whole new approach. In this automated airspace, computers will maintain reduced separations
between flights. The human’s role will change in this context to a higher level, strategic function.

In order to assess the value and potential change in dependability of such systems, comparisons against a baseline
of system-wide air traffic management performance would need to be made. Based on our knowledge, we did not
think such a baseline was currently available. Although portions of the air traffic control process had been modeled
and measured, no model of the system as a whole exists.

As Bill Hewlett often said, “If you can’t measure it, you can’t manage it.” We propose a research objective to
measure, model, and optimize performance of mixed-initiative systems. We believe the project is long-term, but
can achieve sub-goals in a 2-3 year time frame. Specifically, we could have measurement methods in 2 years,
modeling methodologies in 3 years, and optimization methodologies in 5 years. A baseline level of performance of
the current system is a goal in the 2-year timeframe. Dynamic models of the air traffic control system could be
generated in the 3 year time frame, enabling us to predict levels of dependability and failure given various
manipulations. Experience with the models would enable optimization of performance and dependability in the 5
year time frame.

Participants in such a project would include academia, NASA, and industry. Facilities exist at NASA to perform
this research. The impact on dependability would be high, considering we could simulate successes and failures of
current and future air traffic systems using models, and optimize our real-world strategies without impact on the
current systems operations.




                                                   Page 1
HDCC Workshop #1: Human Issues & Operations                                                           01/31/11


Increasing Human Expertise in Computer System Administration
Many examples were given at the conference of operations failures due to lack of human expertise in the areas of
computer system installation, configuration, optimization, and maintenance. Database and system vendors have
known for some time that administration of complex systems is a “black art”, and that talented and experienced
administrators are a rare and highly compensated specialty among computer scientists. The quality of expertise
possessed by an organization in this domain can make or break operations, missions, and whole companies.
Although hardware and software vendors are trying to simplify administrative software user interfaces, systems
themselves continue to get more complex in implementation. Customers demand that systems do more, and the
more features added, the more complex the system becomes. Large-scale enterprise systems may take months or
years to install and upgrade, leading to schedule slips, system outages, and related costs.

There is a critical shortage of individuals with the skills to install and maintain large, multi-vendor, multi-version
computer systems. These skills are not easily acquired in university training; and even if they were, there is a large
gap between theory and practice in the real world. In addition, it may be too costly or disaster-prone to train
people on the job. As the cost of systems declines, many operators come from eclectic backgrounds, often with
little or no training or formal education in computer science or technical fields.

The proposed project intends to increase the size of the available pool of computer system administration experts
by various means:

Experience Factory
In the 0-2 year time frame, we proposed to develop an “experience factory”, where system administrators could
observe and perform best practices as modeled by administration experts. In this model, which has some common
features to medical internship, administration students could meet mentors and be exposed to difficult
administration scenarios under guidance of experts in a low-impact setting. Mentors and task scenarios could come
from industry; academic partners could help develop curricula to structure the experiences for optimal learning.
This would create hands-on skill as well as testable knowledge and form the basis for a professional development
program in computer system administration.

Compendium of Best Practices
In the 3-5 year timeframe, a body of knowledge and a process to keep it current and relevant would need to be
developed and documented. This could take the form of databases, curricula, videos, websites, or books. The best
practices would be based on real world experiences of consortium partners as well as academic principles.
Resources to build on initial scenarios and keep the best practices up to date would be critical for long term
viability.

Certification/Licensing
In the 5+ year timeframe, it was proposed that some type of certification or licensing program be explored, given a
professional development program and a vehicle for learning and performing best practices under the guidance of
experts. Employers could then be assured of a minimal level of qualification for system administrators, which
would greatly improve the probability of highly dependable systems and operations in their organizations.

This project could involve NASA, industry, and academic partners. Facilities for the experience factory could be
publicly available development labs; it might also be possible to accomplish these goals at industry partners’ sites
using an internship model. This project was considered to have potential for high immediate impact, especially in
benefiting industry partners and their customers.




Preventing Human Errors
Lack of dependability is often blamed on human error in the media, and this was reinforced in several examples
among the conference case studies. The loss of the Mars Climate Orbiter was attributed to a metric conversion
error. A major commercial website outage was due to an incorrect operator response to a trouble ticket, which was



                                                    Page 2
HDCC Workshop #1: Human Issues & Operations                                                            01/31/11


ignored because of its ticket number beginning with the letters FUC. Numerous plane crashes, nuclear power
disasters, and other failures are attributed to human error on a regular basis.

Yet human error, as it has been studied by human factors experts, is considered a rare event. It is difficult to
observe, occurring in many industrial and military operations according to a Poisson distribution. Special
methodologies had to be developed, including critical incidents techniques, in order to study it. Rare events are
notoriously difficult to predict, let alone prevent. Yet, they continue to sabotage practically all large systems and
operations, particularly in a 24 x 7 scenario where fatigue, hunger, boredom, mental workload, and other human
factors can greatly impact dependability. Although these factors are often well-known within an industry (e.g.
nuclear power plants, aviation), they have not been adequately shared and compared across content domains in any
meaningful way.

The objective of this project is to decrease the human contribution to lack of dependability, by sharing best
practices, critical incidents, and methods of implementation (e.g. test procedures, operations policies, checklists) to
prevent human error. Industry and government agencies can contribute real world experience to this effort. In
particular, it is envisioned that a “blind” database of errors, near misses, and other critical incidents be contributed
for all partners to access. A similar approach is taken today in the Ease of Use Roundtable, a computer hardware
consortium led by IBM and Intel, in which members contribute 600 customer support calls every month. One of
the partners is responsible for categorizing and analyzing the calls, and white papers have been produced based on
the full data pool for desktop systems, laptops, and Internet software. The Aviation Safety Reporting System
(ASRS) and the Medical Safety Reporting System (MSRS, currently under test at the VA hospitals) are also
similar in nature. Academic institutions can add experimental data, and help determine useful categories and
criteria for combining data . Eventually, if there is a sufficient body of comparable data, modeling and prediction
of human error might be possible. In addition, remedies that worked or failed, and consequences of reacting to
failures vs. not reacting would be shared.

 The participants who might contribute most are the level 4 and 5 (high maturity) organizations (see
http://www.sei.cmu.edu/publications/documents/00.reports/00sr002/00sr002app-a.html), which have been tracking
historical operations data for several years. Successful consortia of this type require a level of effort of about 20-50
representatives to establish critical mass, including many industry leaders. The time frame is considered to be 1-3
years, and it is not expected to require massive investments in facilities or capital. The impact of this sharing of
best practices in the prevention of human error could be immediate and high.

Organizational Design and Culture for Dependability
One aspect of human factors which has become of greater interest over the last 20 years is macroergonomics.
Human factors experts recognize that no matter how well a particular tool or process is designed, no matter how
well an individual is trained, all human factors exist in an organizational or environmental context that can
facilitate or decrement overall system performance.

Many organizations represented at the conference have cultures that help facilitate development of high
dependability products. Microsoft has a high ratio of “test buddies” to development engineers, and code needs to
be tested independently before it can be checked in. HP has a strong process orientation to product development
and places high value on quality measurement. In the semi-conductor industry, zero defects, clean rooms and other
industry ethics of low tolerance for poor quality have led to significant improvements. Quality circles and QFD
processes have been used successfully in the auto industry. NASA, the FAA, and other government agencies
espouse major safety and dependability values in their organizational missions.

In other cases, in particular Internet start-ups, time to market is everything. These pressures result in a culture in
which may processes for high dependability are shortchanged or omitted completely, but the product is produced
quickly and with little investment in creating high dependability before shipping. This latter approach is rewarded
by venture capitalists, who are looking for a faster return on their investments than may be warranted by a more
process-driven approach that would ensure high dependability.

The purpose of this project is to develop and evaluate organizational cultures that produce dependable software.
An objective is to create case studies which would allow cost/benefit evaluations of approaches resulting in highly



                                                    Page 3
HDCC Workshop #1: Human Issues & Operations                                                         01/31/11


dependable software. The research would determine where these techniques are effective, the size and scope of
their impact, and how they were communicated and became imbued in the corporate culture. In particular,
favorable cost/benefit ratios would be helpful in garnering future support among venture capitalists and start up
companies, which often forgo high dependability techniques in favor of quick time to market. In fact, there is data
available that Baldridge Award winners are several times more profitable over the long term than companies that
do not have high values for dependability ( http://nist.gov/public-affairs/releases/n99-02.htm).
In addition, organizational experiments could be conducted with participating companies, including awards for
adoption of effective techniques and government incentives. Cross-cultural studies could also be conducted, to
determine international variation in tolerances to the discipline of dependability engineering.

The methods of organizational design and analysis would need to be determined, possibly with the involvement of
a business school. Possible content domains include Internet and space software. Participants would include all
current HDCC categories plus the addition of venture capital companies. The deliverables would be case studies,
cost/benefit data, best practices, business cases, educated venture capitalists, and development organizations which
would espouse a high dependability software ethic.




                                                  Page 4
ould espouse a high dependability software ethic.




                                                   Page 4

				
DOCUMENT INFO