AUTOMATION RELIABILITY IN UNMANNED AERIAL VEHICLE FLIGHT
CONTROL
Stephen R. Dixon & Christopher D. Wickens
University of Illinois at Urbana-Champaign
ABSTRACT
Twenty-four students flew a simulated unmanned aerial vehicle (UAV) through ten mission legs
while searching for targets of opportunity and monitoring system parameters. Participants were
assisted by automation which provided auditory alerts in response to system failures (SF). The
auto-alerts were either 80% reliable or 60% reliable; the latter condition resulted in either a 3:1
ratio of false alarms to misses, or vice versa. Results indicated that the 80% reliable automation
exceeded baseline (no automation) performance in the target search task. The two 60% reliable
conditions provided no benefits to performance; both false alarms and misses hurt performance
in the automated task and concurrent tasks, but did so qualitatively differently. Implications for
this study suggest that automated aids must be fairly reliable to provide global benefits, and data
regarding the relative costs of misses versus false alarms on performance were equivocal.
Keywords: unmanned aerial vehicle, automation, false alarm, miss
INTRODUCTION
Flying a single unmanned aerial vehicle (UAV) includes navigating the UAV, monitoring craft
parameters, and searching for possible targets (Dixon & Wickens, 2003). The military currently
employs different forms of automation to aid pilots in these tasks; however, very few automated
aids are perfectly reliable, and can create different states of overtrust, undertrust, or calibrated
trust (Parasuraman & Riley, 1997). It is unclear how unreliable the automation needs to be to
cause performance to drop below that of baseline (no automation), and while a 70% “threshold”
has been offered (Dixon & Wickens, 2003; Lee & See, in press), there are noted exceptions both
above and below that level (e.g. Dzindolet et al., 1999; Rovira, Zinni, & Parasuraman, 2002).
Dixon & Wickens (2003) found benefits for an auto-pilot with 67% reliability, but costs for an
auto-alerting system at the same reliability level, and reasoned that under conditions of high
workload, an operator may rely upon imperfect automation even if the automation is not fully
trusted. Such reliance will degrade performance of the automated task itself even as it helps
concurrent tasks (e.g. Rovira et al., 2002).
Within the class of automation that guides attention to notice or diagnose a failure
(Parasuraman et al, 2000), unreliable aids will create false alarms (alarm with no event) and/or
misses (no alarm with an event). False alarms tend to cause distrust in the aid (Meyer & Ballas,
1997), while misses lead to reallocation of visual resources to the raw data in order to “catch” the
automation miss (Cotté, Meyer & Coughlin, 2001). Using target recognition automation, Maltz
& Shinar (2003) found that increasing false alarm rates caused greater disruption to performance
than did increasing miss rates. Dixon & Wickens (2003) also made such a contrast by having
pilots perform a high-fidelity UAV simulation under conditions with either no automation,
perfectly reliable auto-alerts, or 67% reliable auto-alerts with either false alarms or misses.
Results revealed that while the perfectly reliable auto-alerts benefited the automated task, the two
1
imperfect auto-alert conditions equally hurt performance in both the automated task and
concurrent tasks.
While Dixon & Wickens (2003) used conditions with only false alarms or only misses,
the current study included an 80% reliable condition with an equal number of false alarms and
misses, as well as two 60% reliable conditions with a 3:1 ratio of false alarms to misses and vice
versa. We hypothesized that (1) 80% reliability would consistently improve performance above
baseline; (2) both 60% reliability conditions would degrade performance below baseline; (3)
decrements due to unreliability would be more pronounced on the automated task than on
concurrent tasks; and (4) miss-prone automation would disrupt concurrent tasks more than false-
alarm prone automation, because of the former’s requirement for more continuous visual
monitoring of SF status. Please refer to Dixon & Wickens (2004) for a more thorough
presentation of the experimental methods
METHOD
Participants and Equipment. Thirty-two students at the University of Illinois received $8 per
hour, plus bonuses of $20, $10, and $5, for 1st, 2nd, and 3rd place finishes, respectively, in their
group of eight pilots. Figure 1 presents a sample display for a UAV simulation, with verbal
explanations for each display window and task.
Figure 1. A UAV display with explanations for different visual areas.
Procedure. Each pilot flew one UAV through ten different mission legs, in one of the four
experimental conditions, while searching for targets of opportunity and monitoring system
parameters. Pilots obtained flight instructions via the Message Box, including fly-to coordinates
and a report question pertaining to the command target (CT). These instructions were present for
15 seconds, and pressing a repeat key automatically refreshed the flight instructions for an
additional 15 seconds.
CT reports required that pilots loiter around the target, manipulate a camera for closer
target inspection, and report back relevant information to mission command. Along each
2
mission leg, pilots were also responsible for detecting and reporting targets of opportunity
(TOO), a task similar to the CT report, except that the TOOs were much smaller (1-2 degrees of
visual angle) and camouflaged. TOOs could occur during simple tracking (low workload) or
during a pilot response to a system failure (high workload).
Concurrently, pilots were also required to monitor system gauges for possible system
failures (SF), which were indicated by the white needle moving into a red zone (at the top or
bottom of the gauges). SFs were designed to fail either during simple tracking (i.e. low
workload) or during TOO and CT inspection (i.e. high workload). The SFs lasted only 30
seconds, after which the screen flashed bright red and a salient auditory alarm announced that the
pilot had failed to detect the SF.
Automation aids, in the form of auditory auto-alerts during SFs, were provided for three
out of the four conditions. The A80 condition (A = automation; 80% reliable) failed by giving
one false alarm (i.e. alarm with no actual SF), and one miss (i.e. a SF with no alarm) during each
mission. The A60f condition (f = false alarm; 60% reliable) resulted in more false alarms (3)
than misses (1), while the A60m condition (m = miss; 60% reliable) resulted in more misses (3)
than false alarms (1). Pilots were told that the automation was either “fairly reliable” or “not very
reliable”, as well as the bias setting (i.e. more false alarms or more misses). Ratings of
subjective trust were given by each pilot at the end of the mission.
RESULTS
3.1 Mission Completion. Tracking error was not affected by condition [F(3, 27) = 1.24, p >
.10]. The number of repeats was affected by condition [F(3, 25) = 3.56, p = .029]; however, only
the A60m condition (mean = 8.5) suffered relative to baseline (mean = 3) condition [p < .01].
3.2 Targets of Opportunity (TOO) and Command Targets (CT). For TOO detection rates,
only the A80 condition (mean = 93%) improved performance relative to baseline (mean = 76%)
[p < .05]. For TOO detection times, as shown in Figure 2, an interaction between condition and
load [F(3, 23) = 4.82, p = .01] indicates that the condition effect was only present at high load.
TOO Detection Times (High Load vs. Low Load)
20
Detection Times (secs)
16
12
High
8 Load
4 Low
Load
0
Man A80 A60f A60m
Condition
Figure 2. TOO detection times across condition and workload. SE bars are included.
3
Figure 2 reveals that the penalty for increased load was higher for both the A60f (mean =
14.73) and the A60m (mean = 11.87) conditions relative to baseline (mean = 6.04) [all p < .05].
Only the A60f condition differed from the A80 condition (mean = 8.58) [p < .01]. For CT
detection times, there was a main effect of condition [F(3, 27) = 6.16, p < .01], and both the A60f
(mean = 4.17) and the A60m (mean = 4.11) conditions suffered relative to baseline (mean =
2.45) [all p < .05].
3.3 System Failures (SF). For SF detection rates, higher load reduced detection rates [F(1, 27)
= 21.46]; however, there was no main effect of condition [F(3, 27) < 1.0], or interaction [F(3, 27)
< 1.0]. For SF detection times, as shown in Figure 3, higher load increased detection times [F(1,
27) = 93.3, p < .001]. The main effect of condition [F(3, 27) = 3.62, p = .026] can only be
interpreted in the context of the interaction [F(3, 27) = 3.06, p = .045], which reveals that the
A60f condition (mean = 19.99) suffered more due to high load than the other conditions.
SF Detection Times (High load vs. Low load)
25
Detection Times (secs)
20
15
Low
10 Load
5 High
Load
0
Man A80 A60f A60m
Condition
Figure 3. SF detection times across condition and workload. SE bars are included.
Figure 3 reveals that the penalty due to high load was approximately 6-9 seconds more
for the A60f condition than the other three conditions [all p < .03]. We note that each of the 60%
condition means is actually composed of two different components: responses when an alert
correctly sounded, and those when the alert failed to sound. Table 1 shows the resulting four
means, within the high workload condition.
Table 1. Component means in the A60f and A60m conditions. SE is in parentheses.
CONDITION
A60f A60m
26.05 sec 23.29 sec
Miss (failure) (1.83) (2.77)
EVENT
13.93 sec 3.96 sec
Alarm (correct) (4.85) (1.17)
The data reveal the clear slowing for RT when the alarm “missed” the SF event,
indicating that in both conditions, pilots had relied heavily upon the automation, and their
detection suffered when it failed. Correct alerts were responded to more rapidly with the miss
prone automation (mean = 3.96) than the false alarm-prone automation (mean = 13.93) [p < .05],
4
reflecting the pilots’ immediate compliance with the auditory alert (Meyer, 2001) in the former
condition, in contrast to the false-alarm prone condition, where pilots were less likely to interrupt
target inspection to deal with the alarms. We also infer that greater compliance in the miss
condition is coupled with an ongoing greater awareness of the SF gauges, fostered by a reduced
reliance on that automation, and causing greater disruption to memory recall.
3.4 Subjective ratings of trust. Pilots were surprisingly accurate in their overall assessment of
the automation reliability [A80 = 82%; A60f = 54%; A60m = 56%], in contrast to Dixon &
Wickens (2003), who concluded that pilot trust in the automation was poorly calibrated when
they did not receive any prior information as to reliability levels or bias setting.
DISCUSSION
The A80 condition (80% reliability) supported a significant increase in concurrent task
performance, confirming our first hypothesis. This indicates that the automation, while
imperfect, still allowed pilots to save visual and cognitive resources, which they could reallocate
to the concurrent target search task (Rovira et al, 2002).
At 60% reliability, neither the false alarm nor miss conditions (A60f and A60m) provided
any benefits, and in some instances performance was well below baseline during high workload
conditions, thereby confirming hypothesis 2. In general, however, the costs of imperfection were
as heavily born on the concurrent tasks as on the SF task itself, a pattern inconsistent with
hypothesis 3.
Finally, regarding hypothesis 4, the false alarm condition (on average, across
performance measures) resulted in slightly poorer performance in the SF detection task, than did
the miss condition. On the one hand, the miss condition degraded CT memory (requiring more
repeats) to a greater extent than did the false alarm condition, supporting hypothesis 4. That is,
more continuous monitoring of the raw system data was required in the miss condition. On the
other hand, the false-alarm condition (in high workload) appeared to delay detection of a TOO
that became visible while the failure was present, more than the miss condition. This difference
we attribute to pilots’ need, when an alarm sounds in the A60F condition, to double check the
raw data (visual system gauges) to assess its consistency with the auditory alert (a distrust, or
reduced compliance). Thus the two types of automation imperfection had opposing effects on the
concurrent tasks, both replicating prior findings of Dixon & Wickens (2003).
With regard to SF performance itself, figure 3 and table 2 clearly indicate reduced costs
for the miss condition than for the false alarm condition at high workload, a pattern at odds with
that reported by Dixon & Wickens (2003). We can account for the current pattern in terms of the
greater compliance with, and lesser reliance on, the imperfect automation in the miss than in the
false alarm condition (Meyer, 2001). Compliance is increased because of the belief that if an
alarm sounds, it is quite likely to be true. Reliance on the alert is reduced because of the subjects’
knowledge that it may frequently fail to signal a true system failure. The reason for the
discrepancy of the current pattern of results with those of Dixon and Wickens requires further
research.
The implications of this study are that higher reliability automation in necessary to
facilitate improvements in overall performance relative to baseline, and that false alarms may be
more detrimental to overall alerted task performance than misses.
5
ACKNOWLEDGMENTS
This research was sponsored by a subcontract # ARMY MAAD 6021.000-01 from
Microanalysis and Design, as part of the Army Human Engineering Laboratory Robotics CTA,
contracted to General Dynamics. David Dahn and Marc Gacy were the scientific/technical
monitors Any opinions, findings, and conclusions or recommendations expressed in this paper
are those of the authors and do not necessarily reflect the views of the Army CTA. The authors
also wish to acknowledge the support of Ron Carbonari and Jonathan Sivier (in developing the
UAV simulation), and of Dervon Chang for assisting with data collection.
REFERENCES
Cotté, N., Meyer, J., & Coughlin, J. F. (2001). Older and younger driver’s reliance on collision
warning systems. Proceedings of the 45th Annual Meeting of the Human Factor Society (pp.
277-280). Santa Monica, CA: Human Factors and Ergonomics Society.
Dixon, S. & Wickens, C.D. (2003). Imperfect Automation in Unmanned Aerial Vehicle Flight
Control. (AHFD-03-17/MAAD-03-1). Savoy, IL: University of Illinois, Aviation Research
Lab.
Dzindolet, M. T., Pierce, L. G., Beck, H. P., & Dawe, L. A. (1999). Misuse and disuse of
automated aids. Proceedings of the 43rd Annual Meeting of the Human Factors and
Ergonomics Society (pp. 339-343). Santa Monica, CA: Human Factors and Ergonomics
Society.
Lee, J. D., & See, K. A. (in press). Trust in automation: Designing for appropriate reliance.
Human Factors.
Maltz, M., & Shinar, D. (2003). New alternative methods in analyzing human behavior in cued
target acquisition. Human Factors, 45, 281-295.
Meyer, J. (2001). Effects of warning validity and proximity on responses to warnings. Human
Factors, 43, 563-572.
Meyer, J., & Ballas, E. (1997). A two-detector signal detection analysis of learning to use alarms.
Proceedings of the 41st Annual Meeting of the Human Factor Society (pp. 186-189). Santa
Monica, CA: Human Factors and Ergonomics Society.
Parasuraman, R., & Riley, V. (1997). Humans and automation: Use, misuse, disuse, abuse.
Human Factors, 39, 230-253.
Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2000). A model for types and levels of
human interaction with automation. IEEE Transactions on Systems, Man, & Cybernetics,
30(3), 286-297.
Rovira, E., Zinni, M., & Parasuraman, R. (2002). Effects of information and decision automation
on multi-task performance. In Proceedings of the 26th Annual Meeting of the Human Factors
and Ergonomics Society. (pp. 327-331). Santa Monica, CA: Human Factors and Ergonomics
Society.
6