Cover letter for submission to CHI 2006 Interactivity Submission title: Feedback management in the pronunciation training system ARTUR Authors: Olov Engwall, Olle Bälter, Anne-Marie Öster, Hedvig Kjellström; KTH, Stockholm, Sweden. Contact information: Olov Engwall [firstname.lastname@example.org] Centre for Speech Technology Lindstedtsv. 24 KTH SE-100 44 Stockholm SWEDEN tel. +468 790 75 65 fax. +468 790 78 54 The submission is relevant to the following communities (in order of relevance): Research (the submission presents applied research in the area of human-computer interaction in computer-assisted language learning), Usability (The work is focused on usability tests of different feedback strategies in a language tutoring system), Design (of the multimodal – graphical and spoken – user interface for instructions and feedback), and Engineering (of the organization and implementation of different speech technology and user interface components in a computerized interactive pronunciation training system). Please note: The video has been compressed in order to reach the 200 Mbytes limit. The image quality of the video that will be presented at the conference will be significantly higher. Feedback management in the pronunciation training system ARTUR Olov Engwall Anne-Marie Öster Abstract Centre for Speech Technology Centre for Speech Technology This extended abstract discusses the feedback given to KTH KTH the user of a computer-assisted pronunciation training SE-100 44 Stockholm, Sweden SE-100 44 Stockholm, Sweden system, depending on what level of feedback email@example.com firstname.lastname@example.org management that has been implemented. Olle Bälter Hedvig Kjellström Keywords Interaction and Presentation Interaction and Presentation Computer-assisted pronunciation training, virtual tutor, Laboratory Laboratory feedback, Wizard of Oz, usability. KTH KTH SE-100 44 Stockholm, Sweden SE-100 44 Stockholm, Sweden ACM Classification Keywords email@example.com firstname.lastname@example.org H5.2. Information interfaces and presentation (e.g., HCI): User interfaces. Introduction Learning a language may be very rewarding, but also frustrating, if a hearing-impairment makes it difficult to discern important distinctions, or when it is a foreign language with speech sounds (i.e. phonemes) that are unfamiliar from the mother tongue. In order to master the new phonemes, the learner is required to first become perceptually aware of the distinction between the target phoneme and familiar sounds. The learner Copyright is held by the author/owner(s). must then understand how to realize the distinction in CHI 2006, April 22–27, 2006, Montreal, Canada. the own production. The final step is to achieve ACM 1-xxxxxxxxxxxxxxxxxx. automaticity, i.e. to be able to produce the sound without conscious planning. 2 The task of speech therapists and language teachers is however that the pedagogy for giving feedback to the to support this process by detecting pronunciation student has fallen behind . Instead of basing the errors, diagnose the cause, give feedback on how to feedback on the pedagogical needs of the student, it improve the pronunciation and to stimulate the student has been defined by what is technologically easy to to reach automaticity by repeated training. present, such as a pronunciation score or a graphical representation of the acoustic difference between the Human teachers are very apt at detecting errors and user’s pronunciation and a correct model. Such can often give pedagogical explanations on how to feedback is profoundly non-intuitive and difficult to improve the pronunciation. Class-room teaching does interpret for untrained users. As a consequence, the however not permit the large amounts of repeated major breakthrough for CAPT has yet to come. training needed to achieve automaticity. ARTUR - the ARticulaton TUtoR Computer-assisted pronunciation training (CAPT) has In order to address the issue of non-intuitive feedback the benefit that the student may get unlimited amounts in existing CAPT systems, we are currently developing of practice, at any time. All the existing commercial or ARTUR, the ARticulation TUtoR , a virtual tutor who research systems are however still vastly inferior to uses three-dimensional animations of the face and human teachers. One reason is that their detection and internal parts of the mouth to give his students diagnosis of pronunciation errors is not good – and feedback in pronunciation training. The structure and especially not robust – enough. The major problem is components of this virtual tutor is outlined in figure 1. Video Computer vision No visual tracking and recognition are image performed in the Wizard of Oz tests. ”Hally Pottel” Instead video images are stored in an Relation between facial and audiovisual database for training of the vocal tract movements system. These tasks are performed by the Mispronunciation Articulatory human judge in the Wizard of Oz tests. detection inversion The vocal tract (VT) model of the tongue, Speaker adaptation teeth, palate etc is generated through a (model scaling) statistical analysis of a Magnetic Resonance The user interface consists of a feedback display showing audio- Imaging database of a subject producing visual animations, a window with Swedish vowels and consonants. Feedback display VT model the word to practice and a set of interaction buttons. Figure 1. Overview of the ARTUR system in automatic and Wizard of Oz set-ups. 3 An overview of ARTUR is also given in the introduction The wizard was unable to clearly diagnose which video  recorded for CHI 2006. articulation mistake had caused the error. The student started to loose motivation, because To test the usability of the system and involve end the virtual tutor’s feedback was too long and detailed. users at an early stage of the development process, we are conducting Wizard of Oz studies , in which a To solve these issues, we have begun investigating human, phonetically trained judge replaced the feedback strategies used by human language teachers automatic detection of mispronunciations and the (e.g. ), when they are faced with repeated diagnosis of the cause. The human wizard chooses the pronunciation errors or cannot pinpoint what the error feedback given to the student from a set of pre- was, in order to evaluate which of the strategies that generated audiovisual instructions on how to improve could be automated in a CAPT system. the articulation. We are now investigating feedback management in the virtual tutor, as one conclusion in In parallel, we are implementing a multi-level feedback  was that the set of pre-generated feedback was not strategy in ARTUR, to be able to give better feedback optimal for all the mispronunciations that occurred for more varied pronunciation errors. At CHI 2006 during the training. Interactivity we will illustrate the different levels of feedback given by ARTUR in a training task focused on Feedback management in CAPT two of the Swedish fricatives: “s” and “sj” (for the latter In the most basic form of pronunciation feedback in the constriction is made with the tongue body at the CAPT, the user will only get information on if the velum, which is uncommon in other languages). pronunciation was correct enough or not or which part Figure 2 shows the feedback loop of a training word. If of the utterance that was most incorrect, without any the wizard deems that corrections are needed, the cues about the problem. In ARTUR the feedback has amount and detail of feedback is adapted to the user’s been increased to include both detailed instructions on previous performance, progress and mood, in order to how a pronunciation should be corrected and general maximize the efficiency of the feedback instructions encouragement. In a previous user study , the and avoid demoralizing the student. Examples of such wizard found that the detailed instructions were feedback are given in Table 1. inadequate when Successful management of feedback is even more The student repeated the same error several times. important in a fully automatic system, as adequate It would then be pedagogically unsound to repeat fallback solutions are needed in cases where the exactly the same feedback. mispronunciation detection or the articulatory inversion The error fell between the defined categories – the fails, which will happen more often with current state- pronunciation was not correct, but it was better than in of-the art speech technology components than with a the predefined prototypic mispronunciations. human judge. 4 Training User pronunciation Correct? Table 1. Examples of feedback responses given in the No word different categories, for the training word “sjal” (scarf). Yes No Same Next word Positive feedback Type of Example word Known feedback error? Encouragement 1 No Important? No Positive “Yes, that was really good!” Yes Some Detailed, “That sounded more like ‘shal’; try to Encouragement 2 idea first time retract the tongue to get the narrow Vague feedback 1 Yes passage further back.” Next word Augmented, “The constriction is still too forward. No Important? No First second time Remember to let the back of the tongue Encouragement 3 time? touch the palate.” No Yes Vague 1 “Not quite. Think about where you place Augmented correction the tongue tip.” feedback Yes Yes Is Vague 2 “Yes, almost. Say it once more: ‘sjal’. Yes Vague more feedback feedback 2 pedagogically Encourage 1 “Not bad at all. Let’s try the next word.” sound? The user Detailed Encourage 2 “Good try! Could you say it again?” probably knows correction what’s wrong, feedback Encourage 3 “It sounds much better now!” but needs more practice References Figure 2. Flowchart over the feedback management in  ARTUR information video. Available at the ARTUR system. The grey and black boxes show http://www.speech.kth.se/multimodal/ARTUR feedback at higher levels. The dashed arrows indicate feedback solutions that can be avoided with the multi-  Bälter, O., Engwall, O., Kjellström, H., Öster, A-M., level feedback system. Wizard-of-Oz Test of ARTUR - a Computer-Based Speech Training System with Articulation Correction. Proc ASSETS 2005 Acknowledgements The ARTUR project is funded by the Swedish Research  Morley, J., The Pronunciation Component in Teaching English to Speakers of Other Languages. Council and the Centre for Speech Technology is TESOL QUARTERLY, 25 (1991), 481-520. supported by VINNOVA (The Swedish Agency for  Neri, A., Cucchiarini, C., Strik, H. and Boves, L., Innovation Systems), KTH and participating Swedish The pedagogy-technology interface in Computer companies and organizations. The ARTUR information Assisted Pronunciation Training. Computer Assisted video was sponsored by the Christian Benoît award Language Learning 15 (2002), 441-467. received by the first author. Proposal for Interactivity presentation of the ARTUR system at the Chamber at CHI 2006. Presentation format: The presentation of ARTUR at the Chamber would consist of the three parts shown in Figure 1: sa 1. Computer (or 2. Hands-on experience: 2. Wizard TV+DVD) showing Computer screen + controlling the the information cable session video. 3. Discussions about the human-computer interface and feedback strategies Figure 1. Overview of the presentation of ARTUR at the Chamber. Boxed text indicate equipment that the conference organizers are kindly asked to provide. Visitors to the ARTUR station in the Chamber will 1) First be shown the introduction video (5 minutes) that explains the aim, components and research issues of the computer-based speech training system ARTUR. 2) Then experience hands-on training with the system (5-10 minutes), practicing on the pronunciation of minimal pairs of Swedish words starting with either ‘s’ (i.e. a sound that is most probably known from their mother tongue) or the distinctively Swedish rounded velar fricative ‘sj’ (i.e. a sound that is probably unknown and absent in their mother tongue). The system will be run in a Wizard of Oz mode (i.e. controlled by a human judge) for reasons outlined in the “Justification for the choice of presentation format” section below. Working with a Wizard of Oz version of ARTUR will let the visitors a. Experience the multimodal instructions and feedback that is unique for the ARTUR system. b. Experience different types of feedback management that a computer-based speech training system could employ to react to errors in the student’s pronunciation. The different levels of feedback management are described in the Extended Abstract, but could be summarized as: fixed one-level feedback (i.e., the same feedback is always given for the same error), varied one-level feedback for repeated errors (i.e., the level of detail in the feedback remains fixed, but the instructions are rephrased), fixed sequential level for repeated errors ( i.e., if an error is repeated, the feedback instructions will be on another level, with more or less details) and multi-level feedback for judgment insecurities, importance of the error and user mood (i.e., the feedback might be vaguer or even suppressed). c. Reflect about differences between human tutor feedback, Wizard of Oz system feedback and fully automatic feedback. 3) Finally be given the opportunity to discuss (approximately 5-10 minutes) the ARTUR system in general and the feedback given by the human-computer interface in particular with the presenter(s). We foresee that the discussions will be centered on the different types of feedback strategies that the user experienced during the hands-on session and the extent to which this could be fully automated in an unsupported computer-based speech training system. Justification for choice of presentation format We consider the above to be the appropriate presentation format at CHI, because • The video (not previously published) gives a better introduction to the ARTUR project, its goals and components than could be made with any other presentation media. In addition, as the video can be screened continuously, new visitors will be able to get a good introduction while the system, and hence the presenters, are occupied in the hands-on session. These new visitors could then either wait for their turn to experience the system hands-on or join the discussions directly. • The hands-on session lets the user experience practicing the pronunciation of words in a foreign language (Swedish) with a virtual tutor. • We choose to demonstrate a Wizard of Oz version for both technical and academic reasons. The technical reason is mundane: state-of-the-art speech recognition is not yet able to automatically classify pronunciations of foreign speakers, especially not under noisy conditions, as can be expected in the Chamber. Opting for a Wizard of Oz version of the system thus ensures a more robust behavior. The academic reason is that we would like to demonstrate the task carried out by the human Wizard in current system tests and relate this to the issues in creating a fully automatic system. • The discussions with the presenter following the hands-on session will permit to focus on important pedagogical aspects and differences when feedback is given by a human, a semi-automatic (as displayed at CHI 2006), or a fully automatic, tutor. Description of the ARTUR system The computer-based speech training system ARTUR (the ARticulation TUtoR) that we will demonstrate is presently being developed at KTH (Royal Institute of Technology), Sweden. The goal of ARTUR is to provide hearing- or speech-impaired children or second language learners with a virtual speech tutor who use three-dimensional animations of the face and internal parts of the mouth (tongue, palate, jaw, etc) to give instructions and feedback on how to achieve a correct pronunciation. The users of such a speech training system may have difficulties hearing the differences between a correct pronunciation and their own, due to a hearing-impairment or because the distinction does not exist in the mother tongue. The rationale with the ARTUR system is to make the user aware of the differences and how to achieve a better pronunciation by supplementing the auditory channel with visual information. The components of the fully functional ARTUR will include adapted speech recognition to automatically detect mispronunciations, articulatory inversion to recreate the student’s articulation from the acoustic input to the system, computer vision analysis of the speaker’s face to assist in the mispronunciation detection and the articulatory inversion, and a human-computer interface to handle and give feedback instructions. The system that we propose to demonstrate at CHI 2006 consists of a Wizard of Oz implementation of ARTUR, where a human judge will replace the automatic handling of detecting mispronunciations and managing feedback. As stated above, we believe that this approach has the benefits of being both more robust and more fertile for interesting discussions on feedback management. Relevance of the work Computer-based speech therapy for hearing- or speech-impaired children or computer assisted pronunciation training for second language learners have vast potentials. The need for functional, autonomous and automatic speech training with diagnostic feedback is enormous. A large number of commercial programs or research systems aim to address this need. However, as pointed out in the Extended Abstract, the feedback given in all these systems is not as efficient as one would wish; their main weakness being firstly that the feedback is too abstract to be readily interpreted by a naïve user and secondly that the feedback is not adapted to pedagogically suit the user. Human teachers on the other hand use both varied techniques to explain how a correct articulation should be achieved and adapted feedback in order to promote the student’s motivation. The ARTUR system addresses the first point by giving instructions on how the student should alter the articulation and illustrating important articulatory differences with computer animations, which means that the feedback that the student gets is readily interpretable. In the version of ARTUR that will be shown at CHI 2006 we have addressed the second point by providing a multi-level type of feedback. The feedback given to the user depends on his/her previous performance (is it a repetition of an error already made?), the graveness of the error (is it important enough to be highlighted at this point?), the confidence of the judge (can the cause of the pronunciation error be diagnosed with certainty?) and pedagogical issues (would the student benefit from getting feedback instructions at this point or would it only be demoralizing?), as outlined in Figure 2 in the Extended Abstract. This kind of fuzzy feedback management is quite novel and unique in computer-based speech training systems. Commercial status of the project The ARTUR project is a research project carried out at the School of Computer Science and Communication at KTH, Stockholm. It is funded by a grant from the Swedish Research Council and is purely academic, without any commercial ties at the current state. Equipment needed to present the work in the Chamber We wish the conference organizers to supply our presentation with: • An extra computer screen including cable that we could connect to our laptop for the hands-on session. • A personal computer with loudspeakers and a DVD player or a standard DVD player and television monitor to screen the information video (this latter equipment point is not essential – if unavailable, the video could be screened using our laptop, but this would constitute a bottle-neck in the interactive demonstration). We ourselves bring: one laptop, microphone and two headphones (for the user and wizard) for the hands-on session. Description of the presenters Dr Olle Bälter, Assistant Professor in Computer Science, KTH Olle Bälter specializes in Human-Computer interaction. He has a M.Sc. in Engineering Physics (1986) and a Ph.D. in Computer Science (Ph.D. thesis “Electronic Mail in a Working Context”, 1998), both from KTH, Stockholm, Sweden. He is Assistant professor at the School of Computer Science and Communication, KTH since 2000. Dr. Olov Engwall, research fellow at the Centre for Speech Technology, KTH Olov Engwall received his M.Sc. degree in Engineering Physics from KTH in 1998. His Ph.D. thesis “Tongue Talking - Studies in Intraoral Speech Synthesis” from 2002 focused on articulatory modeling of the vocal tract. His research at the moment mainly deals with the ARTUR system and in 2004 he received the Christian Benoît Award for the project. The ARTUR system will be presented by both the above researchers, allowing one presenter to instruct and discuss with the participants while the other is performing the Wizard of Oz task. Equipment and support needed to present the work at a conference session. We are content with the Standard Technical Support that the conference offers and have no further needs (we will give the presentation using our own laptop). Presentation outline at the technical session at CHI 2006 The presentation at the technical session will be organized as follows: • A short presentation of the speaker and the context of computer-based speech training – 2 minutes. • Screening of the short version of the introductory video explaining the goals and relevance of the ARTUR project – 2 minutes. • Overview of different feedback strategies that could be employed in a computer- based speech training system – 4 minutes. • Presentation of the Wizard of Oz set-up of the ARTUR system, including a short live demonstration of how different feedback strategies lead to different responses from the system to the user’s mispronunciations – 7 minutes. • Time for questions, comments and suggestions from the audience – 5 minutes. The aim of the above outline of technical presentation is to first briefly introduce the research field of human-computer interaction in computer-assisted pronunciation training and the speaker, active in this field. Then to give an overview of the system that the rest of the talk and the Interactivity presentation will focus on. This is best achieved with the information video, which is freshly recorded and has not been published previously. The long version of the video (5 minutes, attached) will be shown during the Interactivity presentation in the Chamber, whereas an edited, two-minute version (with the same overall content, but with the technical explanations of component functionality removed) will be shown at the technical presentation. The overview on feedback strategies will briefly contrast feedback given by human language teachers with that given in commercial speech training systems and raise the question to what extent it is possible to incorporate the human feedback strategies in an automatic computer-based speech training system. The short demonstration will illustrate how the Wizard of Oz version of the ARTUR system functions and give a teaser of the different levels and type of feedback that the system provides, depending on the different feedback strategies that have been implemented. We finally believe that five minutes for questions is adequate in this context, as we on the one hand are very eager to discuss the human-computer interaction issues involved, but on the other will provide ample opportunity for discussions in the Chamber.