Document Sample
TR_00.06 Powered By Docstoc
					"Do That Again": Evaluating Spoken Dialogue Interfaces

                        Frankie James
                        Manny Rayner
                       Beth Ann Hockey

                 RIACS Technical Report 00.06

                        February 2000
  “Do That Again”: Evaluating Spoken Dialogue Interfaces

                            Frankie James, Manny Rayner, and Beth Ann Hockey
                           Research Institute for Advanced Computer Science (RIACS)
                                                  Mail Stop 19-39
                                          NASA Ames Research Center
                                      Moffett Field, CA 94035-1000 USA
                                                  (650) 604-0197
                                     {fjames, manny, bahockey}
ABSTRACT                                                         TESTBED: THE PERSONAL SATELLITE ASSISTANT
We present a new technique for evaluating spoken dialogue        The testbed for our simulator is the Personal Satellite Assis-
interfaces that allows us to separate the dialogue behavior      tant (PSA) [12], currently under development at NASA
from the rest of the speech system. By using a dialogue          Ames Research Center. The PSA is envisioned as a small
simulator that we have developed, we can gather usability        round robot (about the size of a softball) that contains a
data on the system’s dialogue interaction and behaviors that     variety of sensors and can move autonomously in a micro-
can guide improvements to the speech interface. Prelimi-         gravity environment, such as the Space Shuttle.
nary testing has shown promising results, suggesting that it     A speech interface is being developed at RIACS to support
is possible to test properties of dialogue separately from       the use of the PSA in micro-gravity. [13] The interface
other factors such as recognition quality.                       accepts high level commands that could typically be asked
Keywords                                                         of a human (e.g., “measure temperature at payload rack
                                                                 three”), as opposed to “verbal joystick” commands (e.g.,
Speech and voice, usability testing, scenarios, prototyping
                                                                 “move ahead two feet, turn right three degrees, and measure
INTRODUCTION                                                     temperature”). In turn, like a human, the PSA can ask clari-
Speech recognition is an important interface modality for        fication questions and confirm what it is planning to do.
situations where canonical input devices are unusable. In        EVALUATION OF SPOKEN DIALOGUES
addition to the voice dictation systems that have become         Previous evaluations of speech interfaces have not focused
popular with office workers, there is interest in using          directly on dialogue components of the system. The stan-
speech interfaces to command and control systems that            dard evaluation methods for spoken dialogue systems have
allow users to issue commands and enter data in hands- and       usually focused on measures that can be easily quantified,
eyes-busy situations. For example, many NASA applica-            thereby allowing easier comparison between competing
tions require astronauts to use computers in micro-gravity       systems. For example, spoken dialogue systems are often
or while wearing space suits. Under these conditions, using      evaluated based on their speech recognition accuracy, by
a keyboard or mouse is difficult, if not impossible. Aircraft    calculating word error rates using a benchmark corpus.
pilots may sometimes be able to use a keyboard or mouse,         Such measurements have been used to guide dialogue strat-
but will usually want to keep their eyes free rather than hav-   egies that minimize errors (e.g., using system-initiative
ing to look at a display screen. Spoken dialogue interfaces      techniques where the user input is constrained by the sys-
can alleviate both of these problems, since all of the infor-    tem’s utterances [11]), but do not address expected system
mation may be communicated between the human and the             actions to spoken commands.
computer using the audio channel.                                To evaluate system actions, it is necessary to look beyond
As with any other modality, speech interfaces must be sub-       simply recognition accuracy. Syntactic evaluation, where
jected to usability testing to ensure that the dialogue          the system is judged by counting the proportion of correct
between the user and the machine is understandable and           parses, can help us to analyze the underlying language
                                                                 model, but not the system’s interpretation of that language.
appropriate for the task. [10] However, evaluating speech
                                                                 Semantic evaluations, which judge the meaning that the
interfaces is problematic due to the many layers of process-
                                                                 system derives from user utterances, can take the form of
ing that are required to turn speech into system commands.       canonical answer specification [4], where the system rating
Our approach to evaluating speech interfaces is to extract       is based on the SQL translation (and subsequent database
the dialogue component from other parts of the system. By        lookup) of the utterances. [3] The drawback of such meth-
doing so, we can directly analyze the relationship between       ods is that they can be effective for commands that do not
the system’s actions and the way in which the system’s           need clarification to be evaluated, but not for exchanges
intentions are made known to the user. Using prototype           involving negotiation between the user and the system. As
techniques as described in [10], we have created a tool that     we argue in [13], these exchanges are more common than
presents users with pre-recorded interactions (scenarios [9])    stand-alone commands in normal speech.
for evaluating a system’s dialogue behavior. We describe         Another alternative for evaluating speech interfaces is to
here preliminary results from the design of this tool.           look at the dialogue processing actions of the system,
which allows us to focus on the negotiations that take place              Simulator Design
between the user and the system. The current, best, way to                As discussed above, there are many aspects of spoken dia-
evaluate dialogue is to use the PARADISE framework. [14]                  logue interfaces that can and should be tested, including the
PARADISE rates overall dialogue performance by attempt-                   accuracy of the speech recognition, the quality of the syn-
ing to correlate user satisfaction measures with more                     thesized speech, and the system’s dialogue behavior. We are
readily quantifiable measures, focusing mainly on whether                 interested in testing the dialogue behavior, which is one of
the information requirements of the user are met. The limi-               the main parts of the interface between the human and the
tation of PARADISE is that, while it is an excellent generic              system. To do this, it is necessary to have a way to separate
framework for performing the correlations between user                    the dialogue components of the system from the other com-
data and quantitative data (such as recognition accuracy or               ponents, such as speech recognition, so that user errors and
speed), it does not suggest how exactly to measure the                    reactions to the other parts of the system do not affect the
user’s satisfaction with a particular dialogue strategy or sys-           evaluation of the dialogue. Performing a full-system evalu-
tem behavior. [5]                                                         ation to gather usability data (using a framework such as
                                                                          PARADISE) does not allow us to separate dialogue-level
The PARADISE framework also brings to light an impor-                     issues from other aspects of usability.
tant point about dialogue evaluation: when testing a spoken
dialogue system, user satisfaction may correlate with many                The most obvious way to separate out the dialogue compo-
factors, some of which have little to do with dialogue strat-             nent of a speech interface is by performing a “Wizard of
egy. [6] Whether or not the system allows barge-in, for                   Oz” (WoZ) experiment using a mock-up. [9] In this type of
example, will greatly affect user satisfaction. This type of              evaluation, a human “wizard” takes on the role of the com-
system behavior is not a dialogue strategy per se, but does               puter so that the parts of the system that are not relevant to
indeed play an important role in usability. The point is that             the test will always have ideal behavior. However, the prob-
if a generic framework like PARADISE is used to gather                    lems with WoZ experiments, in terms of evaluating dia-
usability data, the results will be affected by a myriad of               logue, are two-fold: (1) experimentation becomes
system factors that are important in terms of interface                   expensive in terms of human time, since the wizard must be
design, but that can confound the results we are interested               available for all trials, and (2) given that we are interested in
in regarding the system’s interpretations of the dialogue.                gaining feedback on a given dialogue fragment, it is neces-
                                                                          sary under the WoZ paradigm for the user and the wizard to
In dialogue evaluation, as with other usability tests, there is           exactly replicate that dialogue fragment in question. The
no clear right or wrong answer. For example, one issue that               reproduction of this kind of canned interaction script is very
arose very soon after we began developing the PSA inter-                  unnatural for the user, and is much more efficiently done by
face was the meaning of the phrase “do that again.” This                  the use of prerecorded dialogue fragments.
utterance was originally intended as short-hand to issue the
                                                                          For this reason, we designed a dialogue simulator to replay
same command a second time. However, it was unclear                       the dialogue fragment we wish to study within the context
what behavior to expect if “do that again” was uttered after              of the full system’s visual display.2 Our current version of
the completion of a complex command such as “measure                      the simulator is based on a spoken language interface to the
radiation at all three decks” or, even worse, after only part             PSA system. In the simulation, we use the same visual
of the command had been executed (e.g., the radiation level               environment as the demonstration system, which shows a
had been reported only for the lower deck).1 There are                    diagram of the Shuttle interior with a red dot representing
many circumstances under which the “correct” behavior for                 the PSA that can move around in response to (simulated)
a spoken dialogue system can only be determined through                   user commands. The recorded dialogue fragment is played
user evaluation, including:                                               back via the audio component of the simulator, and is syn-
•   Ambiguous statements (e.g. “do that again”). Behavior                 chronized with the actions on the visual display.
    for such statements will usually depend on context.                   The dialogue simulator is controlled using a set of VCR-
•   Whether or not the system should confirm before per-                   style buttons, that allow the user to play, stop, rewind, or
    forming a task. This is important both for time-consum-               fast-forward the recorded scenarios. At the end of each sce-
    ing and safety-critical activities.                                   nario, the user fills out a questionnaire that is displayed in a
•   Default settings (e.g., if no location is specified in a               pop-up window on the visual display. The questionnaire is
    command, should the location default to where the PSA                 automatically generated from a file containing question text
    is, where the astronaut is, or where the PSA went last).              (multiple choice, Likert scale, and free response) associated
                                                                          with the scenario. The questions thus far have been mainly
1. There is already an extensive literature on the problem of scop-       of the type “what should the PSA do now,” in order to eval-
ing in the case of undo or redo that addresses problems similar to
that of “do that again” in our interface (e.g., [6] [15]), although the   2. Although the current version of the simulator uses a visual dis-
specifics of choosing a granularity for undo and redo are still sub-       play, it is not necessary to do so if the intended domain does not
ject to user evaluation, as well as undo/redo in domains with con-        require the visual presentation of information. For example, if the
tinuous (non-discrete) actions. [8] However, “do that again” can          system to be simulated is an audio-only interface for, say, the pre-
also be taken as a more general illustration of the kinds of circum-      sentation of warnings and alerts to fighter pilots, we can substitute
stances where direct user analysis of system behaviors is clearly         a realistic simulation of the pilot’s eye view of the cockpit to
warranted.                                                                enable test users to more fully appreciate the actual usage context.
uate the system’s semantic analysis of the given dialogue         message before moving to take a measurement, the vote
fragments. However, other question types, such as adjective       was split evenly. However, when the question was whether
pair ratings, can also be used in scenarios where they are        or not the PSA should confirm before closing the crew
more appropriate. In any case, the simulator automatically        hatch, users voted unanimously for the confirmation, pre-
records and saves the user’s responses for later analysis.        sumably because of the safety-critical nature of the task.
In addition to creating the simulator, we have also been          Although our current user sample is too small to make
working on the automatic generation of recorded scenarios         judgements about how to alter the dialogue, the simulator
using our larger demonstration system. In the case of the         will allow us to rapidly test enough users to get some gen-
PSA, we use the log files that are generated through normal        eral consensus on these dialogue-level issues.
system use. The current demonstration system logs the text        CONCLUSION AND FUTURE WORK
and audio of user utterances, system utterance text, PSA          When adding new functionality to any interface, it is often
movement commands, and other commands that affect the             necessary to determine exactly what the new commands
visual environment. These log files can be edited into sce-       will mean in the context of an interaction with a user. In
nario syntax and fed directly to the simulator. The only          many cases, the system developer simply decides what the
additional work required is the creation of the questionnaire     commands will mean; however, it is often the case that after
text file. Our goal is to log user interactions with the full     users try out the commands, the actions that are performed
PSA system, then use automatic log editing to create sce-         are somewhat different from what they expected. When this
narios for dialogue interactions that require further study.      happens, additional testing must be done to determine
Our technique for automatic scenario generation eliminates        whether the command’s behavior is reasonable. The most
the time-consuming task of generating scenarios by hand,          efficient way to do this testing is to ask users, “given these
allowing us to cycle more rapidly between testing dialogue        circumstances, what should happen when this command is
behaviors on a small scale using the simulator, and testing       issued?,” using a technique known as forward scenario sim-
an overall dialogue strategy within the context of the full       ulation. [2] The dialogue simulator we have presented in
system, using a PARADISE-like framework.
                                                                  this paper is an attempt to do just this sort of test for spoken
Usability Testing and Preliminary Results                         dialogue systems.
Our preliminary test was designed to explore the feasibility
                                                                  Our dialogue simulator allows us to gather user feedback
of using the dialogue simulator to gather usability data. Our
                                                                  by taking advantage of prototyping techniques that focus
test users were six RIACS employees, most of whom had
                                                                  users on relevant parts of the interaction. Users only have to
basic knowledge of the PSA system. Each user watched
                                                                  observe and evaluate interesting dialogues, and do not have
between two and seven scenarios from each of the three cat-
egories described earlier (ambiguous statements, confirma-        to produce utterances themselves. This eliminates the possi-
tions, and default behaviors). At the end of each scenario,       bility of confounding dialogue evaluation with the evalua-
users answered a short questionnaire. To judge the simula-        tion of other parts of the system, especially speech
tor itself, users were given an end questionnaire containing      recognition. Additionally, since it is unlikely that two users
adjective-pair questions about the test. [1]                      will produce the exact same utterances (even when per-
                                                                  forming the same task), sample dialogues can be tested
Initial testing shows that the simulator interface is useful      with multiple users to get results that can be evaluated sta-
for evaluating dialogue-level issues in the PSA system. In        tistically. Finally, new dialogue functionality can be tested
general, users found the simulator both appropriate and           (as in other prototype evaluations) by simply recording new
pleasant to use (mean = 4.3/5.0, std. dev. = 0.82 for both),      types of user utterances and potential system responses.
although there are still some usability issues to be
addressed regarding the interface for loading and playing         We feel that this simulation technique is an important step
scenarios. In particular, after the user plays a scenario, the    forward for the evaluation of spoken dialogue systems. Our
action for loading the next scenario requires returning to the    plan is to apply it to other speech applications by porting
main window to select the next scenario, which pops up a          the dialogue simulator. The basic mechanism for running
new controller window. A possible solution to this would          scenarios and their associated questionnaires will remain
be to automatically load subsequent scenarios, or to move         the same; only the visual display (and associated com-
the scenario selection widget from the main window to a           mands) will need to be domain-specific.
permanently-displayed controller window.                          REFERENCES
The preliminary results from the scenario questionnaires          1. Coleman, W.D., et al. 1985. Collecting detailed user
show that dialogue simulator tests can produce results that          evaluations of software interfaces. In Proc. Human Fac-
are immediately useful for improving system design. For              tors Society 29th Annl. Mtg.
example, in the “do that again” example described above
                                                                  2. Cordingley, E. 1989. Knowledge elicitation techniques
where the PSA is stopped in the middle of a complex com-
                                                                     for knowledge based systems. In D. Diaper, ed., Knowl-
mand (e.g., “measure radiation at all three decks”), users
                                                                     edge Elicitation: Principles, Techniques, and Applica-
were split on whether the PSA should reexecute the last
                                                                     tions. Chichester, U.K.: Ellis Horwood. pp. 89–172.
step in the plan or reexecute the whole plan from the begin-
ning. In this situation, it seems likely that the system should   3. Hemphill, C.T., et al. 1990. The ATIS spoken language
ask for further clarification. Similarly, in cases where users       systems pilot corpus. In Proc. DARPA Speech and Natu-
were asked whether the system should give a confirmation             ral Language Workshop.
4. Hirschman, L. Evaluating Spoken Language Interac-                      9. Nielsen, J. 1990. Paper versus computer implementa-
   tion: Experiences from the DARPA Spoken Language                          tions as mockup scenarios for heuristic evaluation.
   P ro g r a m 1 9 8 8 – 1 9 9 5 . To a p p e a r. S e e h t t p : / /      INTERACT ‘90, pp. 315–320.                       10. Nielsen, J. 1993. Usability Engineering. San Francisco,                                                                     CA: Morgan Kaufmann, Inc.
5. James, F., M. Rayner, and B.A. Hockey. 2000. Accu-                     11. Oviatt, S. 1994. Interface techniques for minimizing
   racy, coverage, and speed: What do they mean to users?                     disfluent input to spoken language systems. In Proc.
   To appear in CHI 2000 Workshop on Natural-Language                         CHI ‘94.
                                                                          12. Personal Satellite Assistant.
6. Lenman, S., et al. 1991. An experimental study on the                      psa/ As of 15 February 2000.
   granularity and range of the undo function in user inter-              13. Rayner, M., B.A. Hockey, and F. James. 2000. Turning
   faces. In Human Aspects in Computing: Proceedings of                       speech into scripts. To appear in AAAI Spring Sympo-
   the Fourth International Conference on Human–Com-                          sium on Natural Dialogues with Practical Robotic
   puter Interaction.                                                         Devices.
7. Litman, D., et al. 1998. Evaluating response strategies                14. Walker, M., et al. 1997. PARADISE: A Framework for
   in a web-based spoken dialogue agent. In COLING-ACL                        evaluating spoken dialogue agents. In Proc. ACL 35th
   ‘98.                                                                       Annl. Mtg.
8. Mashayekhi, V., et al. 1995. User recovery of audio                    15. Yang, Y., et al. 1992. Motivation, practice and guide-
   operations. Proceedings of the International Conference                    lines for ‘undoing’. Interacting with Computers,
   on Multimedia Computing and Systems.                                       4(1):23–40.

Shared By: