Microsoft PowerPoint - ALTW tutorial

Reviews
Shared by: techmaster
Stats
views:
45
rating:
not rated
reviews:
0
posted:
10/29/2008
language:
English
pages:
0
Dialogue, Dialogue Modeling, and Dialogue Systems Lawrence Cavedon National ICT Australia Victoria Research Lab and RMIT University School of Computer Science and IT Dialogue systems • Spoken language and multimodal access to information and applications – Via the phone or mobile interface – On a desktop or computational device – In an environment • Collaboration and collaborative interaction – With an intelligent agent or robot • Why might you be interested in a tutorial on dialogue systems? – – – – – What challenges does dialogue raise for LT What are the approaches to dialogue modeling How do dialogue systems work How do I build a basic dialogue system What are interesting current research directions in dialogue systems This tutorial • Broad overview of issues – Will skip most of the LT slides – Core issues for robust dialogue-based interaction – Computational dialogue models • From simple models behind commercial systems • To models of dialogue as collaboration on complex tasks – Starting point for assembling a dialogue system – List of some of the interesting research directions • A few words about me: – Building (practical) dialogue systems in the Center for the Study of Language and Information at Stanford University – Generic dialogue manager based on the information-state update approach – Various applications, around dialogue with intelligent agents in complex environments – Moved to NICTA Victoria and RMIT University in July Overview • What is dialogue – Why build dialogue systems – Applications – What’s important in dialogue • What’s in a dialogue system – Architecture – Outline of LT components • ASR, NLU, NLG, TTS • Core dialogue modelling issues – Adjacency pairs – Non sentential units – Dialogue strategies Overview (cont.) • Dialogue modeling – State and frame-based models (incl. VoiceXML) – Information state update model • Dialogue systems – Implementing a dialogue system application – Evaluation techniques – Multimodal integration • Advanced topics and current research – Adaptation and learning – Probabilistic modeling – Multi-party dialogue • References and readings What is dialogue • Collaborative interaction via – Spoken language: communicating via speech introduces many important issues and challenges of its own, over and above “natural language” – Gesture: plays a crucial role in communication, and technology is increasingly available to make use of it • Pointing, waving, … • Drawing, writing, clicking, … – Gaze, grunts, nods, ahms and uhs: backchannels and low-level communicative acts play important roles in effective communication • The purpose of dialogue is to communicate: – Dialogue acts (cf. speech acts): communicative acts affect dialogue partners and set-up obligations • Dialogue is a social phenomenon: – Interaction/turn management: who talks next – Grounding: shared understanding Why build dialogue systems • Understand the nature of dialogue and test theories: – – – – What information does a dialogue system need to model? What processes are important? Understanding the collaborative nature of interaction? Can we build systems that people find natural to use? • Building usable effective interfaces: – Technologies to interact with complex systems – More efficient access to information and services – Enable communication when difficult or not otherwise possible Applications • Simple services – E.g. commercial phone-based applications: timetable info or flight-booking – Robust dialogue flow; requests with simple responses; very specific task; specific information required for servicing request – Examples collected by Robert Dale: • Nuance Travel Planner (02) CMU Communicator (01) • Complex tasks in dynamic environments – Interacting with agent – Awareness of tasks and activities being performed; agent reports observations; mixed initiative: must understand turn management • Personal assistant – Understand user needs and tasks; personalized to specific user: learn habits, including dialogue style Applications • Collaborating with a robot – Real-time dynamic uncertain environment; situated language • Embodied conversational agents – Engagement via realistic and affective physical and facial gestures • Intelligent tutoring – Student engagement: does it help learning; student modeling; complex conversational plans; recognizing misconceptions • Intelligent environments: home or car – Understanding user situation and activity; personalization; speech directed at system vs. other humans • Physical meetings – Multi-party dialogue; gesture: pointing, writing and drawing. Varieties of dialogue • Initiative: who leads the dialogue? – System led: • Common for commercial or more simple applications • Constrains expected responses; helps reduce mis-recognition – User led: • Can be more efficient: user is less constrained as to contribution each turn • But greater flexibility means potentially more errors – Mixed initiative: • Attempt to balance natural conversation flow with some constrained Sample dialogue • CMU Communicator S: Hello. You’ve reached the [Communicator]. Tell me your full name. U: hi i’d like to fly to seattle on Tuesday morning S: Travelling to Seattle on Tuesday, August eleventh in the morning. Your full name? U: john doe S: Hello John, I’m glad to hear from you again. You can now ask me for help. S: OK. Let’s see. I have a United flight departing Pittsburgh at ten forty five a.m., arriving into Seattle at one fifty seven p.m. Is that OK? U: yes S: Will you return to Pittsburgh from Seattle? U: no, i will go on to los angeles S: Travelling to Los Angeles. On what date? U: sorry, can we go back to talk about the previous leg again S: For the first leg, you are on a United flight 1446 departing Pittsburgh at ten forty five a.m. tomorrow, arriving in Seattle at one fifty seven p.m. Do you want to take that? U: actually i want to leave on Wednesday S: a flight on Wednesday S: I have a flight departing Pittsburgh at ten forty five a.m., arrives Seattle at one fifty seven p.m. Is that OK? U: yes Sample dialogue • CSLI Dialogue System for in-car devices U: What music do you have? S: I have rock, soul … // aggregation U: What rock songs? Sys: Just a second: you have to turn left at the next intersection // interruption U: OK. S: I have songs by U2, Cake, Green Day, Jet, … // return to previous thread U: Play something by them. // pronominal reference S: OK. Playing xxx. < plays random song by Jet > U: Play American Idiot instead. S: Do you mean the song or the album? // disambiguation U: The album please. // fragment U: I need somewhere for lunch. S: There is a highly rated vegetarian restaurant near here. // knows preferences U: What other cuisines? S: There are Italian, Japanese, Malaysian, Nepali, and Thai restaurants close by. What’s important to dialogue? • Challenges of speech recognition – Disfluencies, restarts, etc: real phenomena that can happen in spoken dialogue, especially when a speaker is focused on some task – Misrecognition: speech-recognition is a high error task • How can we use dialogue to reduce error and to repair error – Real-time processing: language-processing is “left-to-right” • How to interpret half-finished utterances • Context-dependent interpretation – Co-constructed meaning: a semantic proposition may be constructed across an adjacency pair? • “When do you want to fly?” “Monday.” – Dialogue-act recognition: recognising the intent of a dialogue contribution – Non-sentential fragments: how are one-word utterances interpreted – Revisions, corrections, repairs: • “What do you want to fly?” “Monday ... I mean Tuesday.” What’s important to dialogue? • Dialogue strategies – Turn management: determining when the turn is over and who talks next • Important in human-system dialogue: we don’t want the system interrupting the human! • A real challenge in multi-party dialogue: relies on gaze, gesture, body position – Grounding (ensuring we understand each other): the whole process of dialogue is about coming to a shared understanding of each other • Grounding is the process by which we indicate that we’ve understood • E.g. acknowledgement, repetition, continued conversation – We’ll discuss these later – Clarification: question to resolve some lack of understanding • Repeat some of the information • Ask to disambiguate Overview • • • • • • • What is dialogue What’s in a dialogue system Core dialogue modelling issues Dialogue modeling Dialogue systems Advanced topics and current research References and readings What’s a dialogue system • A system that models, monitors or engages in dialogue • Dialogue Systems research/engineering as a sub-field of: – – – – Human Language Technology Human-Computer Interaction Artificial Intelligence Cognitive Science Dialogue system architecture Speech SPEECH SYNTHESIS Sentence LANGUAGE GENERATION AGENT OR KB Graphics DIALOGUE MANAGEMENT Dialogue Act Dialogue Act Interpretation Speech DIALOGUE CONTEXT SPEECH RECOGNITION Words LANGUAGE UNDERSTANDING Core issues for dialogue • Language technology – Automatic speech recognition (ASR) • Dealing with the error inherent in ASR – Language parsing and understanding • Recognising dialogue acts – Language generation and synthesis • Using system output to find out what we want, in a way we understand • Dialogue as a collaborative process – Strategies for collaboration and shared understanding • Modeling and using dialogue context – Interpreting non-sentential utterances (e.g. fragments) – Adjacency pairs and co-constructed meaning Auto. speech recognition (ASR) • We can assume outputs (n-best list of) string of words • Speak dependent (SD) vs. speaker independent (SI) – Choice depends on type of application • Hand-crafted grammar vs. statistical language models – Existing corpora – Wizard-of-Oz data collection – Directly compile unification-grammars Auto. speech recognition (ASR) • Input: speech signal • Output: hypothesis of string of words • Various different output representations possible, depending on ASR engine, but we’ll assume a string of words with associated (confidence) score for each: – What time does the flight leave? – “What time does the white leaf” 1245.6 “What time does the flight leave” 1250.1 “What time does a flight leave” 1252.3 “What time did the flight leave” 1270.1 “What time did a flight leave” 1272.3 – Depending on application, distinction between 2-5 may or may not matter; the first one may be discarded as unlikely • Later we discuss techniques that use context to bias against this hypothesis ASR • Speaker Dependent (SD) vs Speaker Independent (SI) – Choice depends on application type Speaker Dependent • Trained on a specific speaker • Lower Word Error Rate (WER) • Suited to: - Personal Assistant - E.g.: Dragon Speaker Independent • Handles a variety of speakers (within a language/dialect region) • Suited to: - Telephony applications - Information kiosk - Museum guide - E.g.: Sphinx (sourceforge), Nuance What about: • Smart home, in-car devices, office-wide assistant, … • (Relatively) small community of users • Ł speaker identification + speaker-dependent Other ASR parameters Speaking modes Speaking styles Vocabulary size Language model Perplexity Transducer Isolated words to continuous speech Read speech to spontaneous speech <100 words to >20,000 words State-based to context-sensitive (n-grams) High (>30 dB) to low (<10 dB) Noise-canceling microphone to telephone (Cole and Zue 98) Language models • Language models define the allowable recognisable language, thereby constraining ASR expectations and improving performance • Grammar-based models – Typically hand-crafted – Regulus: compiles Nuance Toolkit grammar-based language models from NLU unification-grammar (Rayner et al 2003; see Regulus project in Sourceforge) • N-gram models – Trainable • Either way, we require data relevant to the task domain – Existing corpora (depends on domain) – Data-collection • E.g. using Wizard-of-Oz Grammar definition • Nuance Grammar Specification Language (GSL) – Fill slot-based data-structure from recognised words (we’ll use this later) Airports [ [ (san francisco) (s f o) (san francisco california) ] { return(“sfo”) } [ (sydney) (sydney australia) (s y d) … ] { return (“syd”) } … ] ( ?( i ([ want need ] to) [ go fly ] ) [ ( from Airports:x ) { } ( to Airports:y) { } ( from Airports:x ) to Airports:y ) { } ( to Airports:y ) from Airports:x ) { } ]) ASR issues for dialogue • Dealing with error: – Restricting and predicting user contribution – Dialogue strategies for grounding, clarification and repair • Open-mic and barge-in: natural spoken conversation doesn’t require pushing a button, or waiting for the other party to finish, before talking – Introduces technical issues, such as background noise being interpreted as barge-in • Speaker-identification, speaker-segmentation • ASR: source of most problems! – Dealing with them is rich source for research Language understanding • Simple keyword-based rules • Unification grammars – Rich semantic representations – Some dialogue-specific robustness features • Statistical parsers – Can be trained on dialogue corpus – Weaker semantic representation Language understanding • Need to extract some semantic-level meaning from ASR output – Simple techniques include frame-filling from language-model grammars (example later) – Domain-dependent CFG-based semantic analyzers for constructing framebased representation • Pros: easy to construct; well-known how to process; fairly robust • Disadvantages: hand-coded means labour-intensive; domain-specific – Robust unification-grammar-based NLU • E.g. SRI’s GEMINI has been enhanced for noisy input – Robust parsing: maximal fragments, word skipping, probabilistic fragment combining, detect/repair errors – Can also be used to generate surface-level output – Grammar can be compiled directly to Nuance GSL • Cons: labour-intensive to produce grammar – Broad-coverage statistical parsers • Pros: more robust; trainable from data • Cons: typically not trained for dialogue; weak semantic rep (cf. Bos et al. 2004) Simple grammar example • A simple option is a keyword-based semantic grammar (Jurafsky and Martin): – Effectively equivalent to formalisms used in VoiceXML SHOW FLIGHTS DEPART_DATE DEPART_TIME DAY HOUR AMPM ORIGIN DESTINATION CITY show me | i want | can i see flights | (a) flight | (to) fly today | tomorrow | on DAY (after|around|before) HOUR | morning | afternoon | evening monday | … | DATE one | two | three | … | twelve (AMPM) am | pm from CITY to CITY Melbourne | Sydney | … | San Francisco | … SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME I want to fly from melbourne to sydney on tuesday morning NLU issues for dialogue • Choice of NLU component depends on focus: – Narrow application-specific coverage may be OK, and don’t want to invest major resources into complex NLU – May not be interested in nuanced semantics – Trained systems require large corpus, which may have to be collected • Issues in NLU for dialogue – Disambiguation: using features from dialogue context to choose between multiple possible interpretations or proposed parses – Robustness in the face of disfluencies, restarts, repeats, etc – Partial understanding and clarification: recognising specific parts of an utterance that need to be clarified – Incremental interpretation: speech is processed left-to-right; a half-formed utterance still carries information (Core 1997) – Dialogue-act recognition Dialogue acts • Dialogue act type: – COMMAND, QUESTION, ANSWER, etc • Purpose: – Recognising intent behind an utterance: • “Can you tell me what artists you have?” • “I want to hear something by Coldplay.” – Recognising conversational purpose of utterance: • “Hmm …”; “Uh huh”; “Okayyy …” – Matching non-adjacent adjacency pairs; used by dialogue manager to update state (see later) Classifying dialogue acts • Dialogue act tag-set typically designed for domain, to classify the contributions that occur in that task domain – DAMSL (Dialogue Act Markup in Several Layers) (Allen and Core 97) • Domain-independent tag-set for task-oriented dialogue • Considers both forward- and backward-looking aspects of utterance (utterance may be labeled by one or more of each type of tag: see over) – ICSI meeting corpus tag-set (Stolcke et al. 2001) – Verbmobil tags for meeting-scheduling dialogues – Conversation act types (Traum and Hinkleman 92): hierarchical classification of dialogue act types turn-taking grounding core speech acts argumentation take-turn, keep-turn, release-turn, assign-turn acknowledge, repair, continue inform, wh-question, accept, request, offer, … elaborate, summarize, question-answer, clarify DAMSL dialogue act tag-set • Forward-looking – STATEMENT – INFO-REQUEST • CHECK • Backward-looking (relationship to previous utterance) – AGREEMENT • • • • • • ACCEPT ACCEPT-PART NEUTRAL REJECT-PART REJECT HOLD – INFLUENCE-ON-ADDRESSEE • OPEN-OPTION • ACTION-DIRECTIVE – INFLUENCE-ON-SPEAKER • OFFER • COMMIT – ANSWER – UNDERSTANDING • SIGNAL-NON-UNDERSTANDING • SIGNAL-UNDERSTANDING – ACK – REPEAT-REPHRASE – COMPLETION – CONVENTIONAL • OPENING • CLOSING • THANKING Dialogue act recognition • Dialogue acts are often easily recognised using purely lexical information – NLU / parser can often classify dialogue act directly • Other informative features include: – Collocations: e.g. please indicates a request – Prosody: rising intonation indicates a question – Conversational structure: • “OK” following a PROPOSAL indicates AGREEMENT • “OK” following a STATEMENT indicates ACKNOWLEDGE • “OK” following a REQUEST indicates ACCEPT • (Stolcke et al. 2001) describe classifiers for the ICSI meeting corpus based on features such as those above – Also results on priming ASR models Priming ASR with dialogue acts • (Stolcke et al 2001) – ASR language models conditioned by n-gram of preceding dialogue acts – Slight reduction (2.2%) in WER – Better results on task-oriented dialogue • (Lemon and Gruenstein 2004): – ASR language models partitioned by • (context-specific)-model defined for each dialogue act-type • E.g. language model associated with QUESTION is set of possible ANSWERs – Try preceding dialogue act’s language model; back-off if unsuccessful – 87.9% of recognized utterances recognized by context-specific language models (= faster recognition) – 11.5% overall reduction in WER Other dialogue acts • Corrections: user following up to correct a misrecognition – User: “I want to fly from San Fran to Baltimore” System: “When do you want to fly from San Francisco to Bolton?” User: “Baltimore” – Can be harder to recognize, detect, and interpret than usual dialogue acts • Revisions: user follow-ups their own utterance with a correction – User: “I want to fly on the 12th. No, the 13th.” – Also difficult to recognize. Some lexical cues can help: • “I mean …” • “Make that …” • Modifications to previous contribution – – – – User: “What vegetarian Indonesian restaurants are there around here?” System: “I’m sorry, I couldn’t find any …” User: “What about other Asian” (vegetarian, around here) Difficult to distinguish between continuation versus new query • Research question: can machine learning be effective on these classification tasks Semantic representations • Many possibilities: see (Bos and Blackburn 2005). Choice depends on complexity of application, dialogue model, etc. – – – – Simple frame-based (cf. OO objects) Logical representations Feature structures Discourse Representation Structures (Bos et al. 2004) • Also need to represent dialogue-specific fields • “I need to book a flight from Melbourne to San Francisco ….” ACT_TYPE: SPEAKER: HEARER: CONTENT: request user sys SYN: (parse of “I need to book a flight …”) SEM: (semantic representation of SYN) Language generation / synthesis Content planner What to say Sentence planner Surface realizer How to say it Prosody assignment Speech synthesizer • Also When to say it – User’s situation – Currency of information • (Cohen et al. 2004): NLG/prompt design for usability Language generation • What to say – What is the information to communicate – Dialogue manager decides dialogue act to generate • How to say it – Planning multiple utterances and rhetorical relations – What words to use – Which words to emphasize Content planner What to say Sentence planner Surface realizer How to say it Prosody assignment Speech synthesizer Dialogue considerations for NLG • Dialogue occurs in real-time: is my output still relevant? – User may have barged in and changed the subject – World may have changed: report no longer relevant – New report arrives: can it be aggregated with previous utterance • “I see a blue car” “… and a red one as well.” • Other consideration: When to say it? – Is now a bad time to interrupt human? • Personalization of output: – Based on user history • E.g. personalized navigation directions – Based on dialogue history • E.g. try not to be repetitive – Based on user type • E.g. novice vs expert user Extended NLG architecture User Model e pres ut re I np Dialogue Manager ion nt a t Sentence Plan Dialo g ue C onte xt Content Planner Selected Input Sentence Planner Sentence Plan Surface Realizer Situation Model Modalities Behaviour Agent: status of activities, status of environment NLG design and usability issues • Many issues in output design: (Cohen et al 2004) discuss prompt design for VoiceXML systems. • System-generated output is an important opportunity to reduce ASR error: – Narrowing possible user contribution: • “Did you say Baltimore or Boston?” – reduces to previous ASR problem • “I think you said Baltimore. Is that right?” – much easier: only yes or no – Priming: dialogue participants re-use each others’ words (Bock 86) • Use in-ASR-vocabulary words whenever possible • E.g. “When would you like to leave?” vs “When would you like to depart?”: don’t use the latter if “depart” won’t be recognized … • Giving user feedback on progress helps keep them engaged: – “I’m going to need the origin and destination cities and travel dates. First, what city are you flying from?” … “And where are you flying to?” … “Finally, …” NLG design and usability issues • Use variations if repeating outputs. E.g., tapered prompts (i.e. incrementally shorter) on asking for multiple items (Cohen et al. 04): – “What’s the name of the first company to add to your watch list?” “Cisco” “What’s the next company name?” … “Next one?” – Similarly on repeated mis-recognition: “I’m sorry, I didn’t understand that, could you please tell me the destination again” … “Sorry, I still didn’t get that. Please try again.” • Human memory issues: – Last mentioned words are most salient: use this for suggesting responses • “To hear the list again say ‘Repeat list’” vs “Say ‘Repeat list’ to hear the list again” – Long lists should be aggregated: • “What songs are there in the collection?” “I have rock music, soul, and classical” vs “I have the following songs: …” Speech synthesis • Text-to-speech (TTS) systems widely available, and include powerful features such as prosody-markup for stress: – Stress is useful for contrastive purposes: • “The CHEAPEST flights are the following: …” The flights with the FEWEST CONNECTIONS are the following: …” – E.g. Festival and related TTS systems • For more natural output, use recorded voices – Systems such as Festvox allow you to record speech segments and templates, which are then automatically aggregated, with template holes being filled as appropriate Overview • • • • • • • What is dialogue What’s in a dialogue system Core dialogue modelling issues Dialogue modeling Dialogue systems Advanced topics and current research References and readings Dialogue strategies • Dialogue is not only a process that supports collaboration but also a collaborative activity itself (Clark 96) – Dialogue participants work to achieve shared understanding • Strategies for collaborative understanding: – Meaning co-construction and adjacency pairs – Grounding • Acknowledge achievement of common ground – Clarification questions • Check understanding • Fill in missing information • Disambiguation – Turn management • Is it my turn to speak? Who speaks next? – Grounding and clarification are also important for combating ASR error Turn management • Dialogue tends to be characterised by turn-taking. But … • Not all dialogue is turn-by-turn: – Participants co-construct contributions, complete each others’ sentences – Participants interrupt/barge in to another’s contribution • Understanding turn management ranges from the simple … – System knows it’s not its turn to talk … – Prodding the user after some time-out period • … to the complex – Knowing if a user is still “holding the floor” • Filler (“err …”), gesture, gaze … – Multi-party • Am I being spoken to?? • Who exactly being addressed? Me? That guy? Everyone? • Addressivity clues include: names, gaze, gesture, body position, context Adjacency pairs • A round of turns tends to consist of an adjacency pair: – A two-part structure of matching dialogue acts: • QUESTION-ANSWER; STATEMENT-ACKNOWLEDGEMENT; REQUEST-ACCEPTANCE; PROPOSAL-COUNTERPROPOSAL, etc. – We use dialogue act classification to classify each part of an adjacency pair – An adjacency pair may not be adjacent: • A: “Hi.” GREETING A: “How are you?” QUESTION B: “How’s it going.” GREETING B: “I’m pretty good.” ANSWER • A: “Is John going to be at the party?” QUESTION B: “You mean the one at Jenny’s place?” QUESTION A: “Yeah.” ANSWER B: “Yeah, I think so.” ANSWER • In these cases, dialogue-act classification becomes important in identifying matching adjacency pairs Grounding • Grounding: process of establishing (and demonstrating) shared understanding between dialogue participants – Common ground: the shared/mutual understanding/beliefs (see Clark 96) • Different strategies for grounding, some more intrusive than others: – – – – – Continued attention Next contribution is relevant Acknowledgement (includes backchannels, such as uh huh or hmmm) Demonstration, e.g. by reformulating speaker’s utterance Display: displaying verbatim all or part of utterance • Critical for verification of noisy input in dialogue systems Grounding illustrated Conversation between human client and human travel agent: C: I need to travel in May. A: And what day in May did you want to travel? C: OK, uh I need to be there for a meeting on the 12th A: And you’re flying into what city? (relevant next contribution) C: Seattle. A: And what time would you like to leave Pittsburgh? (relevant next contribution) C: Uh hmm, I don’t think there’s many options for non-stop A: Right. There’s three non-stops today. C: What are they? A: The first one departs at … [etc] C: OK, I’ll take the 5ish flight on the night before on the 11th. A: On the 11th? OK. Departing at 5.55pm arrives at Seattle at 8pm, US Air 115. C: OK Grounding in dialogue systems • Grounding is particularly important in dialogue systems, due to unreliability of ASR: – Gives the hearer the opportunity to notice that she has been misunderstood – Evidence show that lack of grounding can be confusing and lead to errors • But: too much grounding can become tedious: – Vary level of grounding strategy with ASR confidence • Low confidence: use Display or Demonstration, or ask for confirmation • Higher confidence: use short Acknowledgement or Relevant Contribution • If (ASR-confidence < 0.5) then generate(“I think you said Sydney. Is that correct?”) elseif (ASR-confidence < 0.8) then generate(“Sydney. And when did you want to leave?”) else generate(“And when did you want to leave?”) – Current research: reinforcement learning for adaptive strategies Co-constructed meaning • Many of the propositions in the common ground or set of shared beliefs are co-constructed across an adjacency pair: – E.g. QUESTION-ANSWER pair constructs a proposition A: “You’re flying into what city?” C: “Seattle” Destination(Seattle) • Fragments such as the short answer by C are common in dialogue and a central issue for dialogue modeling: – One of the specific tasks of modeling dialogue context is in order to interpret fragments – Fragments come in many forms, but we’ll focus on short answers • See (Schlangen 2003) for an thorough classification and treatment Clarification requests • Clarification questions: important tool in the arsenal of strategies for obtaining shared understanding • Come in a wide variety of forms (see taxonomy in (Purver 2004)) • Have many possible purposes: – User: “I want to fly from San Francisco to Melbourne” – Unrecognised utterance “What did you say?” – Misrecognised or unknown word “To where?” – Reference disambiguation “Melbourne Australia or Melbourne Florida?” – Semantic misrecognition “You want to do what?” Clarification requests • Issues for clarification in dialogue systems (Purver 2004): – Designing clarification requests so they are suitably interpreted and responded to by user – How (and when) should we expect users to respond to these – How can clarification requests fit smoothly into general dialogue • Design to increase likelihood of grounding on follow-up answer: – “Where did you say you want to fly to?” “Did you say you want to fly to Melbourne?” – “Which Melbourne?” “Melbourne Australia or Melbourne Florida?” “In Australia or in Florida?” Clarifications from users • A converse (and more difficult) challenge is responding appropriately to clarification requests from users (Purver 2004): – System: “Would you like to travel via Narita or LA?” User: “Narita?” System: (a) “Yes, Narita.” (b) “Narita, Japan.” (c) “JAL flies via Narita; it’s cheaper but longer.” (d) “OK. I’ve booked you to fly via Narita …” Overview • • • • • • • What is dialogue What’s in a dialogue system Core dialogue modelling issues Dialogue modeling Dialogue systems Advanced topics and current research References and readings Dialogue modelling and mgment • Purpose of dialogue modelling and management: – How does an utterance change the state of the dialogue? – Given the current state of the dialogue, what should the system do next? • We will consider the following modelling frameworks: – – – – State-based Frame-based Plan-based Information-state update • Combines state/frame-based and plan-based models Dialogue modelling frameworks technique used example task task complexity dialogue phenomena handled user answers questions finite-state script long-distance dialing least complex frame-based getting train timetable info travel booking agent user asks questions, simple clarifications by system shifts between predetermined topics sets of contexts plan-based models kitchen design consultant disaster relief management most complex dynamically generated topic structures, collaborative negotiation subdialogues agent-based models different modalities (e.g. planned world and actual world) (Allen et al 01) Dialogue frameworks properties State-based Input Verification Single words or phrases Explicit confirmation, either of each input or at end of transaction Information state represented implicitly in dialogue state. Dialogue control represented explicitly with state diagram Simple model of user characteristics and preferences Frame-based Natural language with concept spotting Explicit and implicit confirmation Explicit representation of information states. Dialogue control represented with control algorithm Simple model of user characteristics and preferences Agent/plan-based Unrestricted natural language Grounding strategies Dialogue model Model of system’s intentions/obligations, goals, and beliefs. Dialogue history, context Model of user’s intentions, goals, and beliefs User model (McTear 2002) State-based dialogue models What city would you like to leave from? What city are you flying to? On what date are you travelling? … Is this a one-way trip? Yes So I have you going from to on . Correct? No Yes Yes No What date do you want to return? So I have you going from to on . Correct? No Book flight State-based dialogue models • Highly rigid dialogue flow – System prompts followed by simple user response • Dialogue context is (implicitly) defined by position in state-diagram – Incoming utterance interpreted wrt state • E.g. ANSWER to question fills variable associated with that state – No diversion from predefined flow • Can support a variety of telephony-based applications but simple system-led initiative makes for very tedious conversations – Generally only one piece of information at a time … – See example on next slide: note the Clarification Questions would need to be explicitly added to the state diagram! Example: flight booking Sys: What city would you like to leave from? User: From San Francisco. Sys: What city are you flying to? User: To San Diego. Sys: Please say the destination city; for example, Atlanta Georgia. User: San Diego. Sys: On what date are you travelling? User: Uh, on May the 28th. Sys: Around what time would you like to leave? User: Uh, 8am. [only handles direct response ] Sys: Is this a one-way trip? User: Yes Sys: So I have you going from San Francisco to Sydney on Monday July 10 at 10.30pm. Correct? User: Yes, that’s fine. Frame-based dialogue • Frame-based dialogue addresses the one-issue-at-a-time problem of the state-based approach – Allows (limited) mixed initiative: system or user can drive dialogue – User can supply multiple pieces of information at once, in any order “I want to fly from San Francisco to Sydney on May 28” • Frame: slot-based knowledge representation scheme – Dialogue purpose is to fill slots of the frame • Frame used to query database or sent for execution when all slots filled – Sub-dialogue is associated with each slot to be filled • Sub-dialogue does not need to be executed if slot is already filled – Top-level dialogue model ensures all slots are filled Form-filling dialogue • “What city are you flying from?” “I want to fly from San Fran to Sydney” “What date do you want to leave?” … • “I want to fly to Sydney on July 10 at 10.30pm” “Is this a one-way flight?” … Input utterance I want to fly from SFO to Sydney … fly to Sydney on July 10 at 10.30pm From SFO To Sydney Sydney July 10 10.30pm Date Time VoiceXML • VoiceXML systems: archetypal frame/form-based dialogue systems – – – – W3C standard for building simple spoken language dialogue systems Has basic built-in programming constructs; interfaces with Javascript Range of commercial platforms that support VoiceXML standard Basis of most commercial telephone-based dialogue systems • Properties of VoiceXML scripts: VoiceXML form-filling dialogues • Each field to be filled has a sub-dialogue associated with it Where do you want to fly from? Please say the name of the city you want to fly from. … some logic to check field … Grammar definition Airports [ [ (san francisco) (s f o) (san francisco california) ] { return(“sfo”) } [ (sydney) (sydney australia) (s y d) … ] { return (“syd”) } … ] ( ?( i ([ want need ] to) [ go fly ] ) [ ( from Airports:x ) { } ( to Airports:y) { } ( from Airports:x to Airports:y ) { } ( to Airports:y from Airports:x ) { } ]) Other VoiceXML fields • VXML as a (fairly) rich collection of programming constructs for defining dialogue systems – Prompt definition, including retry prompts • – State-based dialogue control flow • , – Submit commands to server • – Exception handling • , – Some logic • Implementing dialogue strategies • Turn-taking: mainly determined by state-transitions – Mixed-initiative allows for some barge-in contributions that fit flow • Grounding: strategies for grounding must be hard-coded • Co-constructed meaning and handling fragments: – State-specific expectations and associated grammar – Linguistic template in state-specific grammar • Clarification questions: hard-coded question – For misrecognition, typically re-prompts as per previous (may be restated) Limitations of state/vxml systems • Rigid models of dialogue-flow – Designed for specific task / goal • Simple language processing – Grammars hardly reusable across domains • Very simplistic semantic structures • Template-based output • Most dialogue issues left to programmer Plan-based dialogue models • Based on the “language as action” metaphor – Need to recognize goal/intent of user utterance action – Task to be performed has subgoals or preconditions • E.g. knowing the destination city or date of travel – Actions for satisfying those subgoals/preconditions include dialogue acts – Effects of dialogue acts include changing state of hearer: • New beliefs • Adopting new goals Plan-based dialogue models • Based on the “language as action” metaphor – Need to recognize goal/intent of user utterance action – Task to be performed has subgoals or preconditions • E.g. knowing the destination city or date of travel – Actions for satisfying those subgoals/preconditions include dialogue acts – Effects of dialogue acts include changing state of hearer: • New beliefs • Adopting new goals • Advantages: – – – – Tight integration between task performance and dialogue interaction Complex dialogue strategies can be implemented as generic operations Task-dependent dialogue strategies can be added Powerful conceptual model (Cohen, Perrault 79, Allen, Perrault 80) • TRIPS (Allen et al. 01): archetypal plan-based dialogue system Simplified example • Sample plans for flight-booking and basic dialogue actions request-info(Info): Goal: Bel(Spkr, Info) Precond: ~Bel(Spkr, Info) Body: Say(request(Info)) Effect: Bel(Hearer, (Desire(Spkr, Info))) book-flight(Client,F): Goal: Booked(Client, F) Precond: Bel(origin(F)=O), Bel(dest(F)=D), Bel(depart-date(F)=P), Bel(return-date(F)=R), …) inform(Hearer, Info): Body: assert(Booked(Client, F)), Goal: Believe(Hearer, Info) inform(Client, Booked(…)) Body: Say(assertion(Info)) Effects: Booked(Client, F) Effect: Bel(Hearer, Info) Dialogue plans: simple example • Client: I want to book a flight from Melbourne to San Francisco – assert: Goal(Booked(Client, F), Bel(origin(F)=Mel), Bel(dest(D)=SF) • Execute plan book-flight – – – – need to satisfy all Bel() preconditions before executing plan asserts Goal(Bel(depart-date(F)=P)) and Goal(Bel(return-date(F)=A)) each of these trigger execution of a request-info plan first perform action: Say(request(depart-date(F)=?)) • System: What date do you want to leave Melbourne? • Client: I want to leave on June 13 and return on July 6 – asserts Bel(depart-date(F)=june13) and Bel(return-date(F)=july6) • Preconditions of book-flight now satisfied: execute body of plan – perform flight booking – perform inform action • System: I have booked a flight from Melbourne to San Francisco, leaving on June 13 and returning on July 6 Obligations • Successful cooperative action involves obligations; obligation leads to action (on other’s behalf): – – – – – Utterance from user Ł system obligation to respond (and ground) REQUEST from user Ł system obligation to perform task QUESTION from user Ł system obligation to provide ANSWER Misunderstood user utterance Ł system obligation to clarify Observation Ł system obligation to notify user • Obligations are prioritized and processed – Leading often to communicative action by system Information-state update models • Combines features of state/frame-based and plan-based • Logic-based formal model of dialogue management in the computational semantics tradition • Actually an abstract description of a family of dialogue models containing: – – – – Information State: roughly akin to dialogue context or “mental state” Interpretation Engine: interprets dialogue acts as dialogue moves Dialogue Move Engine: interprets dialogue move and executes operations Update Rules: update Information State as dialogue acts are interpreted Information state • Information state (Larsson and Traum 2000): – Information that distinguishes this dialogue – Represents cumulative additions from previous dialogue activity – Motivates future dialogue actions • Dialogue moves: – Triggered by dialogue contribution, performs specified operations • May be one-to-one mapping to Dialogue Acts but not necessarily – Operations may include: • Trigger information state update (via Update Rule) • Send command to Behavioural Agent (example Input Rule) • Send dialogue act for output (example Output Rule) • Information state updates: – Update information state • Declarative assertions may add a proposition • Answer may resolve question on Question stack, and create proposition • Question is added to Question stack (and creates obligation to answer) Information state • What goes into the information state? – Depends on instantiation of abstract model – E.g. GoDiS partitions Information State into Private and Public (Shared) portions • Private beliefs, dialogue plan of action, system agenda of obligations • Shared beliefs, Questions Under Discussion (Ginzburg 1996), last move • Questions Under Discussion (QUD) is a specific formal framework for modeling aspects of dialogue context, used in resolving fragments and mixed initiative: – Incoming fragments are (attempted to be) resolved against issue on QUD stack Information state representation PRIVATE: AGENDA: system obligations (queue) PLAN: system-initiated dialogue moves (stack) BELIEFS: info known to system (set) MBEL: info known to user and system (set) QUD: questions under discussion (stack) LAST MOVE: SPEAKER: User or System MOVE: move SHARED: Information-state architecture Dialogue Dialogue Acts Dialogue Acts Acts Dialogue Dialogue Moves Dialogue Moves Moves Update Rules INFO STATE PRIVATE: AGENDA PLAN BELIEFS SHARED: MBEL Generation Rules Dialogue Dialogue Acts Dialogue Acts Acts Dialogue Dialogue Moves Dialogue Moves Moves QUD LAST MOVE PLANS Update rules • Update rules act on the Information State – Generate output given certain IS conditions – Update IS based on user contribution IntegrateUserQuestion: IF SHARED.LASTMOVE.SPEAKER = user, SHARED.LASTMOVE.MOVE = ask(Q) THEN push(SHARED.QUD, Q), push(PRIVATE.AGENDA, respond(Q) IntegrateUserAnswer: IF SHARED.LASTMOVE.SPEAKER = user, SHARED.QUD.first = Q, relevant_answer(Q, R), not( MBEL |= R) THEN NLGenerate(respond(Q, R)) pop(SHARED.QUD, Q) add(SHARED.MBEL, R) Update rules • The AccomodateQuestion update rule allows an asked question to be resolved by a user input (c.f. frame-based mixed initiative) AccomodateQuestion: IF SHARED.LASTMOVE.SPEAKER = user, SHARED.LASTMOVE.MOVE = answer(A), PRIVATE.PLAN contains ask(Q), relevant_answer(Q, A) THEN delete(PRIVATE.PLAN, ask(Q)), push(SHARED.QUD, Q) • Once Q is on the QUD the IntegrateUserAnswer rule can be applied Simplified GoDiS process • System-initiated dialogue: – Get dialogue plan (list of dialogue moves) – If next dialogue act not resolved • Send it to NL Generator • Question is resolved if answer already in Beliefs • User-initiated dialogue: – Accept dialogue move – If move resolves a Question in Plan • Push Question onto QUD – If move resolves current top(QUD) • Assert Belief (constructed from Question + Answer) • Pop QUD – If a Question • Look for info to resolve it Example • User: “I want to book a flight to Sydney please” – Add plan for book-flight to PLAN – Find plan to handle request (later slide) – Add dest-city(sydney) to MBEL AGENDA: [] RAISE(?O.origin-city(O)) RAISE(?T.dest-city(T)) PLAN: RAISE(?D.date(D)) RESPOND(?P.price(P)) BELIEFS: [] MBEL: [ dest-city(sydney) ] QUD: [] LAST MOVE: SPEAKER: User MOVE: REQUEST(book-flight(sydney)) PRIVATE: SHARED: Example (cont.) • Sys: “Where do you want to fly from?” – RAISE(?O.origin-city(O)) popped from PLAN and added to QUD – Future utterances interpreted in context of QUD PRIVATE: AGENDA: [] PLAN: RAISE(?T.dest-city(T)) RAISE(?D.date(D)) RESPOND(?P.price(P)) BELIEFS: [] MBEL: [ dest-city(sydney) ] QUD: [ ?O.origin-city(O) ] LAST MOVE: SPEAKER: System MOVE: RAISE(?O.origin-city(O)) SHARED: Example (cont.) • User: “I want to book a flight to Sydney please” • Sys: “Where do you want to fly from?” • User: “San Francisco” – Input A resolves current Q on QUD • Add Q(A) = origin-city(sydney) to SHARED.MBEL • Pop Q from QUD – No need to act on RAISE(?T.dest-city(T)) since this is already in MBEL • Sys: “What date would you like to leave” – … Info-state update implementations • The information-state update approach to dialogue modelling and management is an abstract computational semantic-based approach to the dialogue process, but it does lend itself naturally to implementation: – GoDis (Larsson et al), Dipper (Bos et al) • True-to-model implementations • Both available as open-source – Midiki (MITRE), CSLI DM (Stanford) • “Practical” Java-based implementations based on info-state update model • Midiki available as open-source (sourceforge) • CSLI happy to talk collaboration! Implementing dialogue strategies • Turn-taking: flexible contributions from user possible, so long as they can be classified appropriately as dialogue acts • Grounding: plan-based approaches (Traum 1994) • Co-constructed meaning and fragments: interpreted within context of information state, particularly against the QUD stack (Ginzburg 96, Schlangen 2003) • Clarification requests: plan-based approaches (Purver 2004) Overview • • • • • • • What is dialogue What’s in a dialogue system Core dialogue modelling issues Dialogue modeling Dialogue systems Current research References and readings Building a dialogue system Speech SPEECH SYNTHESIS Sentence LANGUAGE GENERATION AGENT OR KB Graphics DIALOGUE MANAGEMENT Dialogue Act Dialogue Act Interpretation Speech DIALOGUE CONTEXT SPEECH RECOGNITION Words LANGUAGE UNDERSTANDING Building a dialogue system • So far we’ve discussed: – Issues for each component – Dialogue phenomena, considerations, and desiderata • To build a fully-functional dialogue system we need to address a host of other discourse phenomena: – – – – Reference and anaphora resolution Discourse structure Conversational implicature, inference, etc. General solutions to these are complex and major research issues of their own. Some off-the-shelf solutions exist. Simple solutions may be appropriate ... Building a dialogue system databases, agents, language models genre-specific additions: eg. Dialogue Moves basic dialogue theory: eg. QUD, NP-resolution basic data structures and dataflow application genre-specific system basic system framework (credit to Larsson) Building a dialogue system I • Framework – Takes care of low-level programming: dataflow, datastructures etc. – Define generic interfaces for dialogue-system components • Examples – Open Agent Architecture (available from SRI) – Galaxy Communicator hub-and-spoke infrastructure (sourceforge) basic data structures and dataflow framework Open Agent Architecture (OAA) • Event-based framework for building distributed applications from multiple components (agents) • Multi-language support (Prolog, Java, C) • Rich unification-based communication language • Easy prototyping • Lots of dialogue-research community Dialogue support for OAA-enabled tools: Mgr – Wrappers for ASR/TTS, parsers, application interfaces – OAA-based dialogue managers ASR NLU OAA Facilitator Behavioural Agent • In practice, components (ASR, NLU, etc) are not generally added “dynamically” – A more robust framework layer may be more appropriate … NLG TTS OAA • Providers: register with Facilitator; called via callbacks – – – – – oaa_Register(asr, [ setGrammar(ASRGrammar) ]) oaa_Register(nlu, [ handleASROutput(InputString, Result) ]) oaa_Register(dm, [ updateIS(DialogAct) ]) oaa_Register(nlg, [ generate(DialogAct, OutputString) ]) oaa_Register(tts, [ say(OutString), beQuiet(), openChannel() ]) OAA • Requesters: requests forwarded by Facilitator to appropriate agent – ASR agent: user_started_speaking :oaa_Solve( bargeIn(X) ). user_ended_speaking(InputString) :oaa_Solve( handleASROutput(InputString, X) ), oaa_Solve( openChannel() ), start_listening. – NLU agent: oaa_AppDoEvent( handleASROutput(InputString, Result), …) :parse(InputString, SemanticForm), recogniseDialogAct(SemanticForm, DialogAct), oaa_Solve( updateIS(DialogAct) ). – Behavioural agent: report_observation(Content) :constructDialogAct(“report”, Content, DialogAct), oaa_Solve( updateIS(DialogAct) ). Building a dialogue system II • Formulate an application-independent dialogue theory to instantiate the framework – GoDiS, VoiceXML, TRIPS, ... • Add LT components – These may change later, depending on application • Add other application-independent processes – Reference/anaphora-resolution, discourse structure, … basic dialogue theory: eg. QUD, NP-resolution basic data structures and dataflow basic system framework Building a dialogue system • Choice of components: we have previously discussed properties and issues of components – Choice of specific components (ie. for ASR, NLU, DM, NLG, TTS) will depend on the domain and application, the embedding environment, the focus of the research project, etc. • Other phenomena … – – – – Reference and anaphora resolution Discourse structure Conversational implicature, inference, etc. General solutions to these are complex and major research issues of their own. Some off-the-shelf solutions exist. Simple solutions may be appropriate ... Other components • Reference resolution: referents obtained from different sources – “We want to fly to LAX and book a hotel there” “We” – resolved using general dialogue setting “there” – resolved using dialogue history (cf. multimodal interface) “LAX” – resolved using general world knowledge – Anaphora resolution out-of-scope: see Discourse literature ... – Resolving referring expression using world knowledge: construct query from NP+context, run against domain-dependent database • “Play American Idiot by Green Day” Type: playable_object Name: American_Idiot Artist: Type: object Name: Green_Day – Inappropriate number of results may generate disambiguation question • “Do you want me to play the song or the album?” Building a dialogue system III • Add genre-dependent components – Dialogue moves specific to a dialogue genre genre-specific additions: eg. Dialogue Moves basic dialogue theory: eg. QUD, NP-resolution basic data structures and dataflow genre-specific system basic system framework Building a dialogue system • Different classes of applications may involve different types of dialogue moves, or other data structures of components – Activity-oriented dialogue may involve representations of the activities to be performed (see CSLI DM later) – Different sets of Dialogue Moves for, e.g., • Meetings/negotiations: e.g. argumentation • Intelligent tutoring Building a dialogue system IV • Add application-specific resources – Grammars / trained statistical models – Databases, behavioural agent/s, inference mechanisms, GUIs databases, agents, language models genre-specific additions: eg. Dialogue Moves basic dialogue theory: eg. QUD, NP-resolution basic data structures and dataflow application genre-specific system basic system framework Building a dialogue application • Grammars / language models – These are likely to be developed for a specific application, or trained from a application-specific corpus: this was discussed earlier • Databases and Behavioural Agents – These comprise the implementation of the application itself – Components may need to be “dialogue enabled” • Wrapper-APIs built to match Dialogue Manager query language • Some Dialogue Management toolkits provide domain-independent scripting languages Other components • Inference engine: – How much inference? Generally, this is left to the Behavioural Agent – An extra issue to consider: responses must generally be quick, since there are the HCI user-satisfaction issues to consider. • One possibility is to use fillers and other feedback to retain user’s engagement (“Uhmm … let me check … just a minute”), but these should be used sparingly – (Bos 2003) advocates the use of model building as an inference technique well suited for dialogue systems – Some indirect answers can be seen as constructing constraint problems: • “What date do you want to fly (to San Francisco)?” • “I need to be in San Francisco for a conference starting on July 5.” • Solving such constraint systems still requires domain-specific information, e.g. the implication that the flight should be there close to the specified date CSLI dialogue manager • Multi-domain dialogue management system based on information-state update model: – (Lemon and Gruenstein 2004, Mirkovic and Cavedon 2005) – Supports multi-device, multi-threaded conversations – Designed for activity-oriented dialogue: • Tasks specified collaboratively • “What are you doing now/next?”; “Why are you doing that?” – Library of dialogue moves written in Java: customizable to new domain/applications using scripting language • Applications – Human-robot interaction – Multimodal intelligent tutoring – Control of multiple in-car devices • MP3 / entertainment, restaurant booking, point-of-interest, navigation – Human-human interaction in physical meeting spaces CSLI dialogue manager Knowledge sources Knowledge manager Device manager Device API Activity model Device Move script Dialogue Move Tree Activity Tree NP resolution rules DM processes Input processor NP resolver Output processor Devices ASR and Parser NLG and TTS CSLI dialogue manager • Dialogue moves customized to specific application using scripts User Command: play { Inputs { } SecondaryInputs { } // inherits from generic “Command” DM // Templates to unify with parser output // Other sources of info for creating correct DM Producing { // possible adjoining moves: limits search space System WHQuestion:disambiguate // move defined elsewhere System Report:play:playing { // move defined “in place” Output [ avs “(e1 / play // DM-specific generation :patient (p1 / [song]) :aspect continuous)” ] }}} CSLI dialogue manager • Activity Model – Domain-independent scripting language acts as API to Behaviour Agent – Specifies capabilities and their properties Activity:play { Types: Playable; Slots: Playable playable-object; // matches DM script variable Taskdef : Slots { required playable-object; optional volume; } } // must match a type in ontology // asks filler question if missing Evaluating dialogue systems • Evaluating dialogue systems is not easy – Direct measures such as WER may not give insight into overall effectiveness of system and usability – Efficient performance on nominal task may not reflect efficient interaction – Subjective measures may be unreliable and domain-specific • Evaluation methodologies for dialogue systems is a current and ongoing area of research … Evaluating dialogue systems • Some metrics (Walker et al. 2000) – Task Completion Success • User’s perception of success tends to be better indicator of satisfaction – Dialogue Efficiency • Elapsed time, system turns vs user turns – Dialogue Quality • Mean recognition score, timeouts, rejections, helps, cancels, barge-ins (raw) • Timeout%, rejection%, help%, cancel%, barge-in% (normalised) – User Satisfaction • Via post-exercise questionnaire, rating 1-5 on following factors: • TTS performance, ASR performance, Task ease, Interaction pace, User expertise, System response, Expected behaviour, Comparable interface, Future use Evaluation with PARADISE • PARADISE (PARAdigm for Dialogue System Evaluation) – Relates previous metrics to User Satisfaction via multivariate linear regression • Maximize task success; minimize cost – Result: most significant predictors of user satisfaction: • Mean Recognition Score • User’s perception of Task Completion Success • Reject Percentage (i.e. system asking user to repeat/paraphrase) Evaluation with PARADISE • PARADISE (PARAdigm for Dialogue System Evaluation) – Addresses the following goals (Walker et al. 97, 00): • Support comparison of multiple systems on same domain tasks; • Provide method for developing predictive models of usability of a system as function of a range of system properties; • Provide technique for making generalisations across systems about which properties of system impact usability (what matters to user) – PARADISE algorithm: • Using corpus of sample dialogues of users using system • Use multivariate linear regression to construct models of how much weight each objective metric contributes to user satisfaction • Evaluated across systems and across user populations – Most significant predictors of user satisfaction: • Mean Recognition Score • User’s perception of Task Completion Success • Reject Percentage (i.e. system asking user to repeat/paraphrase) Empirical evaluation methods • Paek (2001) argues for experimental approach to evaluation – Develop “gold standard” for a dialogue task using Wizard of Oz • • • • Select dialogue metric to serve as objective function for evaluation Vary component or feature best matching metric Hold all other input/output through interface constant Repeat using different wizards – Compare competing systems against gold standard Dialogue interaction design • EAGLES proposal for designing dialogue system (McTear 02): 1. Data collection 1. Study of existing recordings in relevant domain 2. Wizard-of-Oz simulation 3. Transcription of dialogues 2. Specification, design, implementation of first version 3. Tests: 1. Lab tests with WoZ data, and pilot study with lab workers as users 2. Field tests with real users 4. 5. 6. 7. Tune system via iterative modify+test process Design and implement new version Perform tests as in 3. Return to 4., unless system is deemed complete Dialogue corpora • Can be used for research into general interaction patterns or dialogue phenomena: – ICSI meeting, TRAINS, Monroe, HCRC MapTask, WITAS, Switchboard – British National Corpus, Santa Barbara Corpus of Spoken American English, London-Lund corpus of Spoken British English, Wellington Corpus of Spoken NZ English, Australian Corpus of English Multimodal dialogue interfaces • Input modalities other than speech: – – – – Computer device inputs: mouse, touch-screen, digital pen, PDA stylus Semantically-rich: drawing/writing-recognition Haptic devices: digitized glove Physical gesture: pointing, eye gaze, body position, facial expresesion • These modalities play important roles: – Natural way of resolving deictic referents • “Put that there”: Bolt’s (1980) pen + touch-pad system – Flexibility: different modes for different information • Higher user-satisfaction for multi-modal interfaces – Efficiency: multimodal often means shorter utterances • Consequence: reduces WER – Redundancy can increase recognition reliability: combine confidence scores from different modalities – Physical gesture / eye-gaze places a central role in multi-party turn-management – Grounding: displays recognised referent – Visualisation via rich output modalities Error reduction in MM interfaces • Oviatt notes that multimodality can reduce input error, for a number of reasons: – – – – Users select input modality that is less error prone for given lexical entry User language is simpler when interacting multimodally Users switch modes after error, facilitating error recovery Users report less frustration with errors when interacting multimodally (even when errors are as frequent!) – Mutual disambiguation: combining evidence from both input streams can be less error-prone than the individual streams • Study by Oviatt found 41% error reduction over speech-only interface Oviatt’s myths of MM interaction • • • • • • • • • • Myth #1: if you build a multimodal system, users will interact multimodally Myth #2: speech and pointing is dominant multimodal integration pattern Myth #3: mutimodal input involves simultaneous signals Myth #4: speech is the primary input mode in any multimodal system that includes it Myth #5: multimodal language does not differ linguistically from unimodal language Myth #6: multimodal integration involves redundancy of content between modes Myth #7: individual error-prone recognition technologies combine multimodality to produce even greater unreliability Myth #8: all users’ multimodal commands are integrated in a uniform way Myth #9: different input modes are capable of transmitting comparable content Myth #10: enhanced efficiency is the main advantage of multimodal systems Multimodal integration • Standard architecture: – Separately process modalities – Combine as needed (e.g. to resolve deictic reference) ASR Gesture recog Gesture interp NLU Multimodal integration Semantic representation Typed feature structure unification • Semantic representation of both speech and gesture input – Multiple potential gestures may be recognised (point, line, bounded area) – Gesture Processor classifies gestural input (using task/domain knowledge) – Multimodal Integrator uses type-hierarchy to relate gesture-classifications to missing component from speech • Example: map-based application with simple semantic representations – “Show me the hotels on this street.” – NLU detects need for further constraints before referent can be determined Typed feature structure unification cat: NP type: street content: object: echelon: road location: | type: line | modality: speech time: interval(…) prob: 0.72 cat: spatial_gesture type: line content: location: coord: list(…) modality: gesture time: interval(…) prob: 0.85 cat: NP type: street content: object: echelon: road type: line location: coord: list(…) modality: [1] time: [2] prob: [3] where: [1]: assigned modality [2]: assigned time; constraints on overlap and sequence part of unification process [3]: combined probability: e.g. multiplied Unification-based integration • Location type is associated with NP’s semantic representation – Used to select amongst possibly multiple matching spatial_gesture events returned by Gesture Interpreter • This still leaves the issue of finding the precise referent – Typically use domain/interface-specific techniques to determine object matching the final description – May need decision for finding most likely candidate associated with location (e.g. nearest match) Statistical fusion • The feature-structure unification can combine probabilities from the different modalities to obtain a ranked list of multimodal candidates: – n-best list of ASR results combined with m-best list of possible gestures – Benefit: two modalities mutually disambiguate each other • Other factors to consider (beside individual confidence scores): – Likely meaningfulness of each speech-gesture combination – Likely reliability of each modality’s recognition (from empirical measure) – E.g.: (Wu et al. 1999) Overview • • • • • • • What is dialogue What’s in a dialogue system Core dialogue modelling issues Dialogue modeling Dialogue systems Current research References and readings Topical research initiatives • Adaptation: – Adaptive dialogue strategies • Appropriate confidence thresholds – Clarification / grounding » Maximally robust; minimally invasive – Customized to application and situation » Booking flight vs driving a car – Adapt NL generation to user • User modeling for generation (restaurant, navigation) • Salient descriptions of objects • Other learning tasks: – All dialogue papers at ACL’05 were learning papers Topical research initiatives • Uncertainty: – Combining evidence from different sources • Multi-LT technologies, prosodic, topic-recognition, context, other modalities • Improve SR using context – Expectations from dialogue acts – Pragmatic features impacting n-best lists – Probabilistic dialogue management • Horvitz and Paek, Williams and Young Topical research initiatives • Incorporating new modalities: – Physical gesture: pointing, other hand movements; eye-gaze • Multi-party dialogue: – Tracking human-human dialogue • Turn-management • Robust techniques for high WER scenarios • Grounding strategies – Human-system dialogue in multi-human domain – Multi-human meetings with speech and physical gesture • Embodied conversational agents – Integrating speech, gesture, facial expression, etc – Effective and affective interaction Topical research initiatives • Error and repair: – How to deal with disfluencies, restarts, fillers, edits, multisentence utterances, etc. • Use of prosody for detection • NLU strategies to repair input (following Core 1997) – Better detection of revision and continuation dialogue acts • “I mean …”; “No, I said …”; “How about …” • Architectures for robust dialogue – Multi-domain, multi-application • Use of large ontologies and language resources such as WordNet, FrameNet, etc to promote re-use – Combining multiple sources of information • NLU, shallow processing, multimodalities, prosody, etc Overview • • • • • • • What is dialogue What’s in a dialogue system Core dialogue modelling issues Dialogue modeling Dialogue systems Current research References and readings References and Readings • * Asterisked entries are overview/general readings • * Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L, Stent, A. (2001), Towards conversational human-computer interaction, AI Magazine 22(4). • Allen, J., Core, M. (1997), Draft of DAMSL: Dialog annotation markup in several layers. • Allen, J., Ferguson, G., Stent, A. (2001), An architecture for more realistic conversational systems, Proc. Intelligent User Interfaces. • Allen, J., Perrault, R. (1980), Analyzing intention in utterances, Artif. Intelligence, 15. • Bock, J.K. (1986), Syntactic persistence in language production, Cogn. Psychology, 18. • Bolt, R.A. (1980), Put-that-there: voice and gesture at the graphics interface, Computer Graphics, 14. • Bos, J. (2003), Exploring model building for natural language understanding, ICoS-4. References and Readings • * Bos, J., Blackburn, P. (2005), Representation and Inference for Natural Language: A First Course in Computational Semantics, CSLI Publications. • Bos, J., Clark, S., Steedman, M., Curran, J.R., Hockenmaier, J. (2004), Widecoverage semantic representations from a CCG parser, COLING’04. • * Clark, H.H. (1996), Using Language, Cambridge University Press • * Cohen, M.H., Giangola, J.P., Balogh, J. (2004), Voice User Interface Design, Addison-Wesley. • Cohen, P., Perrault, R. (1979), Elements of a plan-based theory of speech acts, Cognitive Science, 3. • * Cole, R., Zue, V. (eds.) (1998), Spoken language input; in, Cole, R. et al. (eds), Survey of the State of the Art in Human Language Technology, Cambridge Uni Press. • Core, M. (1997), Dialog Parsing: From Speech Repairs to Speech Acts, PhD thesis, University of Rochester. References and Readings • Ginzburg, J. (1996), Dynamics and the semantics of dialogue, in Seligman, J., Westerstahl, D. (eds.), Logic, Language, and Computation, Vol 1, CSLI Publications. • * Jurafsky, D., Martin, J.H. (2001), An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall. Revised Chap 19, Dialogue and Conversational Agents, available at www.cs.colorado.edu/~martin/slp.html. • Larsson, S., Traum, D. (2000), Information state and dialogue management in the TRINDI dialogue move engine toolkit, in (Van Kuppevelt et al. 2000). • Lemon, O., Gruenstein, A. (2004), Multi-threaded context for robust conversational interfaces: context-sensitive speech-recognition and interpretation of corrective fragments, ACM Trans. on CHI, 11. • * McTear, M.F. (2002), Spoken dialogue technology: enabling the conversational user interface, ACM Computing Surveys, 34. • Mirkovic, D., Cavedon, L. (2005), Practical multi-domain, multi-device dialogue management, PACLING’05. References and Readings • * Oviatt, S.L. (2003), Multimodal interfaces, in Jacko, J., Sears, A. (eds.), The Human-Computer Interaction Handbook : Fundamentals, Evolving Technologies and Emerging Applications, chap. 14., Lawrence Earlbaum. • Paek, T. (2001), Empirical methods for evaluating dialog systems, ACL 2001 Workshop on Evaluation Methodologies for Language and Dialogue Systems. • Purver, M. (2004), The Theory and Use of Clarification Requests in Dialogue, PhD thesis, King’s College University of London. • Rayner, M., Hockey, B., Dowding, J. (2003), An open-source environment for compiling typed unification grammars into speech recognisers, EACL (demo). • Schlangen, D. (2003), A Coherence-Based Approach to the Interpretation of Non-Sentential Utterances in Dialogue, PhD thesis, University of Edinburgh. • Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor, P., Martin, R., Meteer, M., Ess-Dykema, C.V. (2000), Dialog act modeling for automatic tagging and recognition of conversational speech, Computational Linguistics 26(3). References and Reading • Traum, D., Allen, J. (1994), Discourse obligations in dialogue processing, ACL’94. • Traum, D., Hinkelman, E.A. (1992), Conversation acts in task-oriented spoken dialogue, Computational Linguistics 8(3). • * Van Kuppevelt, J., Heid, U., Kamp, H. (2000), Natural Language Engineering, 6, Special Issue on Best Practice in Spoken Language Dialogue System Engineering. • Walker, M., Litman, D., Kamm, C., Abella, A. (1997), PARADISE: A general framework for evaluating spoken dialogue agents, 35th ACL. • Walker, M., Kamm, C., Litman, D. (2000), Towards developing general models of usability with PARADISE, in (Van Kuppevelt et al. 2000) • Wu, L., Oviatt, S., Cohen, P. (1999), Multimodal integration – a statistical view, IEEE Transactions on Multimedia, 1. Concluding remarks • Dialogue is a rich source of very interesting problems – Presents some novel challenges for LT – Truly the intersection of LT, AI, HCI, Cognitive Science • Spoken dialogue in particular raises a host of challenges – A pain for the dialogue system application builder – But a joy for the dialogue researcher! • Research into dialogue modeling and dialogue systems – Models of dialogue as collaborative action (actually source of much of original research into BDI agent architectures and models of collaboration) – Some researchers focus on accounting for phenomena that arise in human-human dialogue interaction – Another possibility is to focus on robust and effective human-system dialogue • Dialogue systems and applications – Different domains/applications throw up their own challenges A date with a dialogue system

Related docs
Microsoft PowerPoint Tutorial
Views: 158  |  Downloads: 23
Microsoft PowerPoint Tutorial
Views: 194  |  Downloads: 14
Microsoft PowerPoint Tutorial
Views: 162  |  Downloads: 26
Microsoft PowerPoint Tutorial
Views: 248  |  Downloads: 8
PowerPoint Tutorial
Views: 218  |  Downloads: 20
PowerPoint Tutorial
Views: 164  |  Downloads: 34
Microsoft PowerPoint - tutorial-ims-6.ppt
Views: 142  |  Downloads: 11
Microsoft PowerPoint - icdm-tutorial
Views: 13  |  Downloads: 0
Microsoft PowerPoint - CPS Tutorial
Views: 32  |  Downloads: 1
Microsoft PowerPoint - NIPS Tutorial
Views: 55  |  Downloads: 4
premium docs
Other docs by techmaster