pdf - Cover letter for submission to CHI 2006 Interactivity by fionan


									Cover letter for submission to CHI 2006 Interactivity

Submission title: Feedback management in the pronunciation training system ARTUR

Authors: Olov Engwall, Olle Bälter, Anne-Marie Öster, Hedvig Kjellström; KTH,
Stockholm, Sweden.

Contact information:

Olov Engwall [olov@speech.kth.se]
Centre for Speech Technology
Lindstedtsv. 24
SE-100 44 Stockholm

tel. +468 790 75 65
fax. +468 790 78 54

The submission is relevant to the following communities (in order of relevance):

Research (the submission presents applied research in the area of human-computer
interaction in computer-assisted language learning),

Usability (The work is focused on usability tests of different feedback strategies in a
language tutoring system),

Design (of the multimodal – graphical and spoken – user interface for instructions and
feedback), and

Engineering (of the organization and implementation of different speech technology and
user interface components in a computerized interactive pronunciation training system).

Please note:

The video has been compressed in order to reach the 200 Mbytes limit. The image quality
of the video that will be presented at the conference will be significantly higher.
                               Feedback management in the
                               pronunciation training system ARTUR

Olov Engwall                       Anne-Marie Öster               Abstract
Centre for Speech Technology       Centre for Speech Technology   This extended abstract discusses the feedback given to
KTH                                KTH                            the user of a computer-assisted pronunciation training
SE-100 44 Stockholm, Sweden        SE-100 44 Stockholm, Sweden    system, depending on what level of feedback
olov@speech.kth.se                 annemarie@speech.kth.se        management that has been implemented.

Olle Bälter                        Hedvig Kjellström              Keywords
Interaction and Presentation       Interaction and Presentation   Computer-assisted pronunciation training, virtual tutor,
Laboratory                         Laboratory                     feedback, Wizard of Oz, usability.
KTH                                KTH
SE-100 44 Stockholm, Sweden        SE-100 44 Stockholm, Sweden    ACM Classification Keywords
balter@kth.se                      hedvig@kth.se                  H5.2. Information interfaces and presentation (e.g.,
                                                                  HCI): User interfaces.

                                                                  Learning a language may be very rewarding, but also
                                                                  frustrating, if a hearing-impairment makes it difficult to
                                                                  discern important distinctions, or when it is a foreign
                                                                  language with speech sounds (i.e. phonemes) that are
                                                                  unfamiliar from the mother tongue. In order to master
                                                                  the new phonemes, the learner is required to first
                                                                  become perceptually aware of the distinction between
                                                                  the target phoneme and familiar sounds. The learner
Copyright is held by the author/owner(s).                         must then understand how to realize the distinction in
CHI 2006, April 22–27, 2006, Montreal, Canada.                    the own production. The final step is to achieve
ACM 1-xxxxxxxxxxxxxxxxxx.                                         automaticity, i.e. to be able to produce the sound
                                                                  without conscious planning.

                             The task of speech therapists and language teachers is               however that the pedagogy for giving feedback to the
                             to support this process by detecting pronunciation                   student has fallen behind [4]. Instead of basing the
                             errors, diagnose the cause, give feedback on how to                  feedback on the pedagogical needs of the student, it
                             improve the pronunciation and to stimulate the student               has been defined by what is technologically easy to
                             to reach automaticity by repeated training.                          present, such as a pronunciation score or a graphical
                                                                                                  representation of the acoustic difference between the
                             Human teachers are very apt at detecting errors and                  user’s pronunciation and a correct model. Such
                             can often give pedagogical explanations on how to                    feedback is profoundly non-intuitive and difficult to
                             improve the pronunciation. Class-room teaching does                  interpret for untrained users. As a consequence, the
                             however not permit the large amounts of repeated                     major breakthrough for CAPT has yet to come.
                             training needed to achieve automaticity.
                                                                                                  ARTUR - the ARticulaton TUtoR
                             Computer-assisted pronunciation training (CAPT) has                  In order to address the issue of non-intuitive feedback
                             the benefit that the student may get unlimited amounts               in existing CAPT systems, we are currently developing
                             of practice, at any time. All the existing commercial or             ARTUR, the ARticulation TUtoR [2], a virtual tutor who
                             research systems are however still vastly inferior to                uses three-dimensional animations of the face and
                             human teachers. One reason is that their detection and               internal parts of the mouth to give his students
                             diagnosis of pronunciation errors is not good – and                  feedback in pronunciation training. The structure and
                             especially not robust – enough. The major problem is                 components of this virtual tutor is outlined in figure 1.

                                    Video                                            Computer vision                  No visual tracking and recognition are
                                    image                                                                             performed in the Wizard of Oz tests.
                                                       ”Hally Pottel”                                                 Instead video images are stored in an
                                                                         Relation between facial and                  audiovisual database for training of the
                                                                         vocal tract movements                        system.

                                                                                                                      These tasks are performed by the
                                                             Mispronunciation            Articulatory                 human judge in the Wizard of Oz tests.
                                                             detection                   inversion

                                                                                                                      The vocal tract (VT) model of the tongue,
                                                            Speaker adaptation
                                                                                                                      teeth, palate etc is generated through a
                                                            (model scaling)                                           statistical analysis of a Magnetic Resonance
The user interface consists of a
feedback display showing audio-                                                                                       Imaging database of a subject producing
visual animations, a window with                                                                                      Swedish vowels and consonants.
                                             Feedback display                               VT model
the word to practice and a set of
interaction buttons.
                                      Figure 1. Overview of the ARTUR system in automatic and Wizard of Oz set-ups.

An overview of ARTUR is also given in the introduction          The wizard was unable to clearly diagnose which
video [1] recorded for CHI 2006.                           articulation mistake had caused the error.
                                                               The student started to loose motivation, because
To test the usability of the system and involve end        the virtual tutor’s feedback was too long and detailed.
users at an early stage of the development process, we
are conducting Wizard of Oz studies [2], in which a
                                                           To solve these issues, we have begun investigating
human, phonetically trained judge replaced the
                                                           feedback strategies used by human language teachers
automatic detection of mispronunciations and the
                                                           (e.g. [3]), when they are faced with repeated
diagnosis of the cause. The human wizard chooses the
                                                           pronunciation errors or cannot pinpoint what the error
feedback given to the student from a set of pre-
                                                           was, in order to evaluate which of the strategies that
generated audiovisual instructions on how to improve
                                                           could be automated in a CAPT system.
the articulation. We are now investigating feedback
management in the virtual tutor, as one conclusion in
                                                           In parallel, we are implementing a multi-level feedback
[2] was that the set of pre-generated feedback was not
                                                           strategy in ARTUR, to be able to give better feedback
optimal for all the mispronunciations that occurred
                                                           for more varied pronunciation errors. At CHI 2006
during the training.
                                                           Interactivity we will illustrate the different levels of
                                                           feedback given by ARTUR in a training task focused on
Feedback management in CAPT
                                                           two of the Swedish fricatives: “s” and “sj” (for the latter
In the most basic form of pronunciation feedback in
                                                           the constriction is made with the tongue body at the
CAPT, the user will only get information on if the
                                                           velum, which is uncommon in other languages).
pronunciation was correct enough or not or which part
                                                           Figure 2 shows the feedback loop of a training word. If
of the utterance that was most incorrect, without any
                                                           the wizard deems that corrections are needed, the
cues about the problem. In ARTUR the feedback has
                                                           amount and detail of feedback is adapted to the user’s
been increased to include both detailed instructions on
                                                           previous performance, progress and mood, in order to
how a pronunciation should be corrected and general
                                                           maximize the efficiency of the feedback instructions
encouragement. In a previous user study [2], the
                                                           and avoid demoralizing the student. Examples of such
wizard found that the detailed instructions were
                                                           feedback are given in Table 1.
inadequate when

                                                           Successful management of feedback is even more
    The student repeated the same error several times.     important in a fully automatic system, as adequate
It would then be pedagogically unsound to repeat           fallback solutions are needed in cases where the
exactly the same feedback.                                 mispronunciation detection or the articulatory inversion
    The error fell between the defined categories – the    fails, which will happen more often with current state-
pronunciation was not correct, but it was better than in   of-the art speech technology components than with a
the predefined prototypic mispronunciations.               human judge.

           Training User pronunciation         Correct?                  Table 1. Examples of feedback responses given in the
             word                                                        different categories, for the training word “sjal” (scarf).
                                          Yes             No
 Same       Next
 word                          Positive feedback                         Type of           Example
                                                               Known     feedback
     Encouragement 1          No     Important?           No             Positive          “Yes, that was really good!”
                             Yes                    Some                 Detailed,         “That sounded more like ‘shal’; try to
     Encouragement 2
                                                    idea                 first time        retract the tongue to get the narrow
                                     feedback 1                  Yes                       passage further back.”
              word                                                       Augmented,        “The constriction is still too forward.
                               No     Important?    No
                                                                First    second time       Remember to let the back of the tongue
        Encouragement 3
                                                                time?                      touch the palate.”

                         No              Yes                             Vague 1           “Not quite. Think about where you place
        correction                                                                         the tongue tip.”
        feedback                                                 Yes
                       Yes                Is                             Vague 2           “Yes, almost. Say it once more: ‘sjal’.
        Vague                       more feedback
        feedback 2                  pedagogically                        Encourage 1       “Not bad at all. Let’s try the next word.”
                     The user                              Detailed      Encourage 2       “Good try! Could you say it again?”
                     probably knows                        correction
                     what’s wrong,                         feedback      Encourage 3       “It sounds much better now!”
                     but needs more practice

Figure 2. Flowchart over the feedback management in                      [1] ARTUR information video. Available at
the ARTUR system. The grey and black boxes show                          http://www.speech.kth.se/multimodal/ARTUR
feedback at higher levels. The dashed arrows indicate
feedback solutions that can be avoided with the multi-                   [2] Bälter, O., Engwall, O., Kjellström, H., Öster, A-M.,
level feedback system.                                                   Wizard-of-Oz Test of ARTUR - a Computer-Based
                                                                         Speech Training System with Articulation Correction.
                                                                         Proc ASSETS 2005
The ARTUR project is funded by the Swedish Research                      [3] Morley, J., The Pronunciation Component in
                                                                         Teaching English to Speakers of Other Languages.
Council and the Centre for Speech Technology is
                                                                         TESOL QUARTERLY, 25 (1991), 481-520.
supported by VINNOVA (The Swedish Agency for
                                                                         [4] Neri, A., Cucchiarini, C., Strik, H. and Boves, L.,
Innovation Systems), KTH and participating Swedish
                                                                         The pedagogy-technology interface in Computer
companies and organizations. The ARTUR information                       Assisted Pronunciation Training. Computer Assisted
video was sponsored by the Christian Benoît award                        Language Learning 15 (2002), 441-467.
received by the first author.
  Proposal for Interactivity presentation of the ARTUR system at the
                        Chamber at CHI 2006.

Presentation format:
The presentation of ARTUR at the Chamber would consist of the three parts shown in
Figure 1:

1. Computer (or                 2. Hands-on experience:         2. Wizard
TV+DVD) showing                 Computer screen +               controlling the
the information                 cable                           session

                                                       3. Discussions about the
                                                       interface and feedback

 Figure 1. Overview of the presentation of ARTUR at the Chamber. Boxed text indicate
equipment that the conference organizers are kindly asked to provide.

Visitors to the ARTUR station in the Chamber will

   1) First be shown the introduction video (5 minutes) that explains the aim,
      components and research issues of the computer-based speech training system
   2) Then experience hands-on training with the system (5-10 minutes), practicing on
      the pronunciation of minimal pairs of Swedish words starting with either ‘s’ (i.e. a
      sound that is most probably known from their mother tongue) or the distinctively
      Swedish rounded velar fricative ‘sj’ (i.e. a sound that is probably unknown and
      absent in their mother tongue). The system will be run in a Wizard of Oz mode
      (i.e. controlled by a human judge) for reasons outlined in the “Justification for the
      choice of presentation format” section below. Working with a Wizard of Oz
      version of ARTUR will let the visitors
           a. Experience the multimodal instructions and feedback that is unique for the
              ARTUR system.
           b. Experience different types of feedback management that a computer-based
              speech training system could employ to react to errors in the student’s
              pronunciation. The different levels of feedback management are described
              in the Extended Abstract, but could be summarized as: fixed one-level
              feedback (i.e., the same feedback is always given for the same error),
              varied one-level feedback for repeated errors (i.e., the level of detail in the
              feedback remains fixed, but the instructions are rephrased), fixed
              sequential level for repeated errors ( i.e., if an error is repeated, the
              feedback instructions will be on another level, with more or less details)
              and multi-level feedback for judgment insecurities, importance of the error
              and user mood (i.e., the feedback might be vaguer or even suppressed).
           c. Reflect about differences between human tutor feedback, Wizard of Oz
              system feedback and fully automatic feedback.

3) Finally be given the opportunity to discuss (approximately 5-10 minutes) the ARTUR
system in general and the feedback given by the human-computer interface in particular
with the presenter(s). We foresee that the discussions will be centered on the different
types of feedback strategies that the user experienced during the hands-on session and the
extent to which this could be fully automated in an unsupported computer-based speech
training system.

Justification for choice of presentation format
We consider the above to be the appropriate presentation format at CHI, because
•    The video (not previously published) gives a better introduction to the ARTUR
     project, its goals and components than could be made with any other presentation
     media. In addition, as the video can be screened continuously, new visitors will be
     able to get a good introduction while the system, and hence the presenters, are
     occupied in the hands-on session. These new visitors could then either wait for their
     turn to experience the system hands-on or join the discussions directly.
•    The hands-on session lets the user experience practicing the pronunciation of words
     in a foreign language (Swedish) with a virtual tutor.
•    We choose to demonstrate a Wizard of Oz version for both technical and academic
     reasons. The technical reason is mundane: state-of-the-art speech recognition is not
     yet able to automatically classify pronunciations of foreign speakers, especially not
     under noisy conditions, as can be expected in the Chamber. Opting for a Wizard of
     Oz version of the system thus ensures a more robust behavior. The academic reason
     is that we would like to demonstrate the task carried out by the human Wizard in
     current system tests and relate this to the issues in creating a fully automatic system.
•    The discussions with the presenter following the hands-on session will permit to
     focus on important pedagogical aspects and differences when feedback is given by
     a human, a semi-automatic (as displayed at CHI 2006), or a fully automatic, tutor.
Description of the ARTUR system
The computer-based speech training system ARTUR (the ARticulation TUtoR) that we
will demonstrate is presently being developed at KTH (Royal Institute of Technology),
Sweden. The goal of ARTUR is to provide hearing- or speech-impaired children or
second language learners with a virtual speech tutor who use three-dimensional
animations of the face and internal parts of the mouth (tongue, palate, jaw, etc) to give
instructions and feedback on how to achieve a correct pronunciation.

The users of such a speech training system may have difficulties hearing the differences
between a correct pronunciation and their own, due to a hearing-impairment or because
the distinction does not exist in the mother tongue. The rationale with the ARTUR system
is to make the user aware of the differences and how to achieve a better pronunciation by
supplementing the auditory channel with visual information.

The components of the fully functional ARTUR will include adapted speech recognition
to automatically detect mispronunciations, articulatory inversion to recreate the student’s
articulation from the acoustic input to the system, computer vision analysis of the
speaker’s face to assist in the mispronunciation detection and the articulatory inversion,
and a human-computer interface to handle and give feedback instructions.

The system that we propose to demonstrate at CHI 2006 consists of a Wizard of Oz
implementation of ARTUR, where a human judge will replace the automatic handling of
detecting mispronunciations and managing feedback. As stated above, we believe that
this approach has the benefits of being both more robust and more fertile for interesting
discussions on feedback management.

Relevance of the work
Computer-based speech therapy for hearing- or speech-impaired children or computer
assisted pronunciation training for second language learners have vast potentials. The
need for functional, autonomous and automatic speech training with diagnostic feedback
is enormous. A large number of commercial programs or research systems aim to address
this need. However, as pointed out in the Extended Abstract, the feedback given in all
these systems is not as efficient as one would wish; their main weakness being firstly that
the feedback is too abstract to be readily interpreted by a naïve user and secondly that the
feedback is not adapted to pedagogically suit the user. Human teachers on the other hand
use both varied techniques to explain how a correct articulation should be achieved and
adapted feedback in order to promote the student’s motivation.

The ARTUR system addresses the first point by giving instructions on how the student
should alter the articulation and illustrating important articulatory differences with
computer animations, which means that the feedback that the student gets is readily

In the version of ARTUR that will be shown at CHI 2006 we have addressed the second
point by providing a multi-level type of feedback. The feedback given to the user
depends on his/her previous performance (is it a repetition of an error already made?), the
graveness of the error (is it important enough to be highlighted at this point?), the
confidence of the judge (can the cause of the pronunciation error be diagnosed with
certainty?) and pedagogical issues (would the student benefit from getting feedback
instructions at this point or would it only be demoralizing?), as outlined in Figure 2 in the
Extended Abstract. This kind of fuzzy feedback management is quite novel and unique in
computer-based speech training systems.

Commercial status of the project
The ARTUR project is a research project carried out at the School of Computer Science
and Communication at KTH, Stockholm. It is funded by a grant from the Swedish
Research Council and is purely academic, without any commercial ties at the current state.

Equipment needed to present the work in the Chamber
We wish the conference organizers to supply our presentation with:
    • An extra computer screen including cable that we could connect to our laptop for
        the hands-on session.
    • A personal computer with loudspeakers and a DVD player or a standard DVD
        player and television monitor to screen the information video (this latter
        equipment point is not essential – if unavailable, the video could be screened
        using our laptop, but this would constitute a bottle-neck in the interactive
We ourselves bring: one laptop, microphone and two headphones (for the user and wizard)
for the hands-on session.

Description of the presenters
Dr Olle Bälter, Assistant Professor in Computer Science, KTH
Olle Bälter specializes in Human-Computer interaction. He has a M.Sc. in Engineering
Physics (1986) and a Ph.D. in Computer Science (Ph.D. thesis “Electronic Mail in a
Working Context”, 1998), both from KTH, Stockholm, Sweden. He is Assistant professor
at the School of Computer Science and Communication, KTH since 2000.

Dr. Olov Engwall, research fellow at the Centre for Speech Technology, KTH
Olov Engwall received his M.Sc. degree in Engineering Physics from KTH in 1998. His
Ph.D. thesis “Tongue Talking - Studies in Intraoral Speech Synthesis” from 2002 focused
on articulatory modeling of the vocal tract. His research at the moment mainly deals with
the ARTUR system and in 2004 he received the Christian Benoît Award for the project.

The ARTUR system will be presented by both the above researchers, allowing one
presenter to instruct and discuss with the participants while the other is performing the
Wizard of Oz task.

Equipment and support needed to present the work at a conference session.
We are content with the Standard Technical Support that the conference offers and have
no further needs (we will give the presentation using our own laptop).
Presentation outline at the technical session at CHI 2006

The presentation at the technical session will be organized as follows:
   • A short presentation of the speaker and the context of computer-based speech
       training – 2 minutes.
   • Screening of the short version of the introductory video explaining the goals and
       relevance of the ARTUR project – 2 minutes.
   • Overview of different feedback strategies that could be employed in a computer-
       based speech training system – 4 minutes.
   • Presentation of the Wizard of Oz set-up of the ARTUR system, including a short
       live demonstration of how different feedback strategies lead to different responses
       from the system to the user’s mispronunciations – 7 minutes.
   • Time for questions, comments and suggestions from the audience – 5 minutes.

The aim of the above outline of technical presentation is to first briefly introduce the
research field of human-computer interaction in computer-assisted pronunciation training
and the speaker, active in this field.

Then to give an overview of the system that the rest of the talk and the Interactivity
presentation will focus on. This is best achieved with the information video, which is
freshly recorded and has not been published previously. The long version of the video (5
minutes, attached) will be shown during the Interactivity presentation in the Chamber,
whereas an edited, two-minute version (with the same overall content, but with the
technical explanations of component functionality removed) will be shown at the
technical presentation.

The overview on feedback strategies will briefly contrast feedback given by human
language teachers with that given in commercial speech training systems and raise the
question to what extent it is possible to incorporate the human feedback strategies in an
automatic computer-based speech training system.

The short demonstration will illustrate how the Wizard of Oz version of the ARTUR
system functions and give a teaser of the different levels and type of feedback that the
system provides, depending on the different feedback strategies that have been

We finally believe that five minutes for questions is adequate in this context, as we on the
one hand are very eager to discuss the human-computer interaction issues involved, but
on the other will provide ample opportunity for discussions in the Chamber.

To top