Multimodal Dialogue Interaction Systems
Document Sample


Multimodal Dialogue
Interaction Systems
Prof. Alexandros Potamianos
Technical Univ. of Crete
Spring 2007-2008
Part A: Introduction
Part A: Outline
1. Introduction to Human-Computer
Interfaces
2. Introduction to Natural Language
3. Introduction to Spoken Dialogue
4. Architectures and Standards
5. The Speech Business
Bibliography
HCI Books
Alan Dix - Janet Finlay - Gregory Abowd - Russell
Beale, Human Computer Interaction, 3E, Prentice-Hall,
2004.
Ben Shneiderman, Catherine Plaisant, Designing the
User Interface: Strategies for Effective Human-
Computer Interaction, 4/E, Pearson, 2004
Introduction to Human-
Computer Interfaces
Part A.1
Outline
1. The human
2. The computer
3. The interface
[material mostly from Dix et al. HCI book, ch 1-3]
The human
Information i/o …
visual, auditory, haptic, movement
Information stored in memory
sensory, short-term, long-term
Information processed and applied
reasoning, problem solving, skill, error
Emotion influences human capabilities
Each person is different
Reading
Several stages:
visual pattern perceived
decoded using internal representation of
language
interpreted using knowledge of syntax,
semantics, pragmatics
Reading involves saccades and fixations
Perception occurs during fixations
Word shape is important to recognition
Negative contrast improves reading from
computer screen
Hearing
Provides information about environment:
distances, directions, objects etc.
Physical apparatus:
outer ear – protects inner and amplifies sound
middle ear – transmits sound waves as
vibrations to inner ear
inner ear – chemical transmitters are released
and cause impulses in auditory nerve
Sound
pitch – sound frequency
loudness – amplitude
timbre – type or quality
Hearing (cont)
Humans can hear frequencies from 20Hz
to 15kHz
less accurate distinguishing high frequencies
than low.
Auditory system filters sounds
can attend to sounds over background noise.
for example, the cocktail party phenomenon.
Touch
Provides important feedback about environment.
May be key sense for someone who is visually
impaired.
Stimulus received via receptors in the skin:
thermoreceptors – heat and cold
nociceptors – pain
mechanoreceptors – pressure
(some instant, some continuous)
Some areas more sensitive than others e.g.
fingers.
Kinethesis - awareness of body position
affects comfort and performance.
Movement
Time taken to respond to stimulus:
reaction time + movement time
Movement time dependent on age, fitness etc.
Reaction time - dependent on stimulus type:
visual ~ 200ms
auditory~ 150 ms
pain ~ 700ms
Increasing reaction time decreases accuracy in
the unskilled operator but not in the skilled
operator.
Movement (cont)
Fitts' Law describes the time taken to hit
a screen target:
Mt = a + b log2(D/S + 1)
where: a and b are empirically determined
constants
Mt is movement time
D is Distance
S is Size of target
⇒ targets as large as possible
distances as small as possible
Memory
There are three types of memory function:
Sensory memories
Short-term memory or working memory
Long-term memory
Selection of stimuli governed by level of arousal.
sensory memory
Buffers for stimuli received through
senses
iconic memory: visual stimuli
echoic memory: aural stimuli
haptic memory: tactile stimuli
Examples
“sparkler” trail
stereo sound
Continuously overwritten
Short-term memory (STM)
Scratch-pad for temporary recall
rapid access ~ 70ms
rapid decay ~ 200ms
limited capacity - 7± 2 chunks
Long-term memory (LTM)
Repository for all our knowledge
slow access ~ 1/10 second
slow decay, if any
huge or unlimited capacity
Two types
episodic– serial memory of events
semantic – structured memory of facts,concepts,
skills
semantic LTM derived from episodic LTM
Long-term memory (cont.)
Semantic memory structure
provides access to information
represents relationships between bits of
information
supports inference
Model: semantic network
inheritance – child nodes inherit properties of
parent nodes
relationships between bits of information
explicit
supports inference through inheritance
LTM - semantic network
Models of LTM - Frames
Information organized in data structures
Slots in structure instantiated with values for
instance of data
Type–subtype relationships
DOG COLLIE
Fixed Fixed
legs: 4 breed of: DOG
type: sheepdog
Default
diet: carniverous Default
sound: bark size: 65 cm
Variable Variable
size: colour
colour
Thinking
Reasoning
deduction, induction,
abduction
Problem solving
Errors and mental models
Types of error
slips
right intention, but failed to do it right
causes: poor physical skill,inattention etc.
change to aspect of skilled behaviour can
cause slip
mistakes
wrong intention
cause: incorrect understanding
humans create mental models to explain behaviour.
if wrong (different from actual system) errors can occur
Emotion
Various theories of how emotion works
James-Lange, Cannon, Schacter-Singer
Emotion clearly involves both cognitive and
physical responses to stimuli
The biological response to physical stimuli is
called affect
Affect influences how we respond to situations
positive → creative problem solving
negative → narrow thinking
“Negative affect can make it harder to do even easy
tasks; positive affect can make it easier to do
difficult tasks”
(Donald Norman)
Emotion (cont.)
Implications for interface design
stress will increase the difficulty of
problem solving
relaxed users will be more forgiving of
shortcomings in design
aesthetically pleasing and rewarding
interfaces will increase positive affect
The Computer
a computer system is made up of various
elements
each of these elements affects the interaction
input devices – text entry and pointing
output devices – screen (small&large), digital paper
virtual reality – special interaction and display devices
physical interaction – e.g. sound, haptic, bio-sensing
paper – as output (print) and input (scan)
text entry devices
keyboards (QWERTY
et al.)
chord keyboards,
phone pads, touch,
handwriting, speech
Chord keyboards
only a few keys - four or 5
letters typed as combination of keypresses
compact size
– ideal for portable applications
short learning time
– keypresses reflect letter shape
fast
– once you have trained
BUT - social resistance, plus fatigue after extended use
NEW – niche market for some wearables
phone pad and T9 entry
use numeric keys with
multiple presses
2 – abc mno
6 -
3 - def pqrs
7 -
4 - ghi tuv
8 -
5 - jkl wxyz
9 -
hello = 4433555[pause]555666
surprisingly fast!
T9 predictive entry
type as if single key for each letter
use dictionary to ‘guess’ the right word
hello = 43556 …
but 26 -> menu ‘am’ or ‘an’
Handwriting recognition
Text can be input into the computer, using a pen
and a digesting tablet
natural interaction
Technical problems:
capturing all useful information - stroke path, pressure,
etc. in a natural manner
segmenting joined up writing into individual letters
interpreting individual letters
coping with different styles of handwriting
Used in PDAs, and tablet computers …
… leave the keyboard on the desk!
Speech recognition
Improving rapidly
Most successful when:
single user – initial training and learns
peculiarities
limited vocabulary systems
Problems with
external noise interfering
imprecision of pronunciation
large vocabularies
different speakers
positioning, pointing and
drawing
mouse, touchpad
trackballs, joysticks
etc.
touch screens,
tablets
eyegaze, cursors
Eyegaze
control interface by eye gaze direction
e.g. look at a menu item to select it
uses laser beam reflected off retina
… a very low power laser!
mainly used for evaluation (ch x)
potential for hands-free control
high accuracy requires headset
cheaper and lower accuracy devices available
sit under the screen like a small webcam
display devices
bitmap screens (CRT
& LCD)
large & situated
displays
digital paper
virtual reality and 3D
interaction
positioning in 3D
space
moving and
grasping
seeing 3D (helmets
and caves)
physical controls, sensors
etc.
special displays and
gauges
sound, touch, feel,
smell
physical controls
environmental and
bio-sensing
paper: printing and
scanning
print technology
fonts, page
description,
WYSIWYG
scanning, OCR
The Interaction
interaction models
translations between user and system
interaction styles
the nature of user/system dialog
context
social, organizational, motivational
Some terms of interaction
domain – the area of work under study
e.g. graphic design
goal – what you want to achieve
e.g. create a solid red triangle
task – how you go about doing it
– ultimately in terms of operations or
actions
e.g. … select fill tool, click over triangle
execution/evaluation loop
goal
execution evaluation
system
• user establishes the goal
• formulates intention
• specifies actions at interface
• executes action
• perceives system state
• interprets system state
• evaluates system state with respect to goal
execution/evaluation loop
goal
execution evaluation
system
• user establishes the goal
• formulates intention
• specifies actions at interface
• executes action
• perceives system state
• interprets system state
• evaluates system state with respect to goal
execution/evaluation loop
goal
execution evaluation
system
• user establishes the goal
• formulates intention
• specifies actions at interface
• executes action
• perceives system state
• interprets system state
• evaluates system state with respect to goal
Human error - slips and
mistakes
slip
understand system and goal
correct formulation of action
incorrect action
mistake
may not even have right goal!
Fixing things?
slip – better interface design
mistake – better understanding of system
Abowd and Beale framework
extension of Norman…
their interaction framework has 4 parts O
user output
input
system S U
core task
output
I
input
each has its own unique language
interaction ⇒ translation between languages
problems in interaction = problems in translation
Using Abowd & Beale’s model
user intentions
→ translated into actions at the interface
→ translated into alterations of system state
→ reflected in the output display
→ interpreted by the user
general framework for understanding
interaction
not restricted to electronic computer systems
identifies all major components involved in
interaction
allows comparative assessment of systems
an abstraction
Indirect manipulation
office– direct manipulation
user interacts
with artificial world system
industrial – indirect manipulation
user interacts
with real world
through interface interface plant
issues .. immediate
feedback feedback
delays instruments
Common interaction styles
command line interface
menus
natural language
question/answer and query dialogue
form-fills and spreadsheets
WIMP
point and click
three–dimensional interfaces
Command line interface
Way of expressing instructions to the computer
directly
function keys, single characters, short abbreviations,
whole words, or a combination
suitable for repetitive tasks
better for expert users than novices
offers direct access to system functionality
command names/abbreviations should be
meaningful!
Typical example: the Unix system
Menus
Set of options displayed on the screen
Options visible
less recall - easier to use
rely on recognition so names should be meaningful
Selection by:
numbers, letters, arrow keys, mouse
combination (e.g. mouse plus accelerators)
Often options hierarchically grouped
sensible grouping is needed
Restricted form of full WIMP system
Natural language
Familiar to user
speech recognition or typed natural
language
Problems
vague
ambiguous
hard to do well!
Solutions
try to understand a subset
pick on key words
Query interfaces
Question/answer interfaces
user led through interaction via series of
questions
suitable for novice users but restricted
functionality
often used in information systems
Query languages (e.g. SQL)
used to retrieve information from database
requires understanding of database structure
and language syntax, hence requires some
expertise
Form-fills
Primarily for data entry or data retrieval
Screen like paper form.
Data put in relevant place
Requires
good design
obvious correction
facilities
Spreadsheets
first spreadsheet VISICALC, followed by
Lotus 1-2-3
MS Excel most common today
sophisticated variation of form-filling.
grid of cells contain a value or a formula
formula can involve values of other cells
e.g. sum of all cells in this column
user can enter and alter data spreadsheet
maintains consistency
WIMP Interface
Windows
Icons
Menus
Pointers
… or windows, icons, mice, and pull-down
menus!
default style for majority of interactive
computer systems, especially PCs and
desktop machines
Point and click interfaces
used in ..
multimedia
web browsers
hypertext
just click something!
icons, text links or location on map
minimal typing
Three dimensional
interfaces
virtual reality
‘ordinary’ window systems
highlighting flat buttons …
visual affordance
indiscriminate use click me!
just confusing!
3D workspaces … or sculptured
use for extra virtual space
light and occlusion give depth
distance effects
Speech–driven interfaces
rapidly improving …
… but still inaccurate
how to have robust dialogue?
… interaction of course!
e.g. airline reservation:
reliable “yes” and “no”
+ system reflects back its understanding
“you want a ticket from New York to Boston?”
Look and … feel
WIMP systems have the same elements:
windows, icons., menus, pointers, buttons, etc.
but different window systems
… behave differently
e.g. MacOS vs Windows menus
appearance + behaviour = look and feel
Initiative
who has the initiative?
old question–answer – computer
WIMP interface – user
WIMP exceptions …
pre-emptive parts of the interface
modal dialog boxes
come and won’t go away!
good for errors, essential steps
but use with care
Error and repair
can’t always avoid errors …
… but we can put them right
make it easy to detect errors
… then the user can repair them
hello, this is the Go Faster booking system
what would you like?
(user) I want to fly from New York to London
you want a ticket from New York to Boston
(user) no
sorry, please confirm one at a time
do you want to fly from New York
(user) yes
………
Context
Interaction affected by social and
organizational context
other people
desire to impress, competition, fear of failure
motivation
fear, allegiance, ambition, self-satisfaction
inadequate systems
cause frustration and lack of motivation
Experience, engagement
and fun
designing
experience
physical
engagement
managing value
Other HCI concepts
App. Development: Waterfall model
Cognitive Models
Design Principles
HCI aspects of speech
The waterfall model
Requirements
Requirements
specification
specification
Architectural
Architectural
design
design
Detailed
Detailed
design
design
Coding and
Coding and
unit testing
unit testing
Integration
Integration
and testing
and testing
Operation and
Operation and
maintenance
maintenance
Activities in the life cycle
Requirements specification
designer and customer try capture what the system is
expected to provide can be expressed in natural language
or more precise languages, such as a task analysis would
provide
Architectural design
high-level description of how the system will provide the
services required factor system into major components of
the system and how they are interrelated needs to satisfy
both functional and nonfunctional requirements
Detailed design
refinement of architectural components and interrelations
to identify modules to be implemented separately the
refinement is governed by the nonfunctional requirements
The life cycle for interactive
systems
Requirements
Requirements
cannot assume a linear
specification
specification
sequence of activities
Architectural
Architectural
as in the waterfall model
design
design
Detailed
Detailed
design
design
Coding and
Coding and
unit testing
unit testing
Integration
Integration
and testing
lots of feedback!
and testing
Operation and
Operation and
maintenance
maintenance
GOMS
Goals
what the user wants to achieve
Operators
basic actions user performs
Methods
decomposition of a goal into
subgoals/operators
Selection
means of choosing between competing
Keystroke Level Model
(KLM)
lowest level of (original) GOMS
six execution phase operators
Physical motor: K - keystroking
P - pointing
H - homing
D - drawing
Mental M - mental preparation
System R - response
times are empirically determined.
Texecute = TK + TP + TH + TD + TM + TR
Principles to support
usability
Learnability
the ease with which new users can begin effective
interaction and achieve maximal performance
Flexibility
the multiplicity of ways the user and system exchange
information
Robustness
the level of support provided the user in determining
successful achievement and assessment of goal-directed
behaviour
Principles of learnability
Predictability
determining effect of future actions
based on past interaction history
operation visibility
Synthesizability
assessing the effect of past actions
immediate vs. eventual honesty
Principles of learnability (ctd)
Familiarity
how prior knowledge applies to new system
guessability; affordance
Generalizability
extending specific interaction knowledge to
new situations
Consistency
likeness in input/output behaviour arising from
similar situations or task objectives
Principles of flexibility
Dialogue initiative
freedom from system imposed constraints on
input dialogue
system vs. user pre-emptiveness
Multithreading
ability of system to support user interaction for
more than one task at a time
concurrent vs. interleaving; multimodality
Task migratability
passing responsibility for task execution
between user and system
Principles of flexibility (ctd)
Substitutivity
allowing equivalent values of input and output
to be substituted for each other
representation multiplicity; equal opportunity
Customizability
modifiability of the user interface by user
(adaptability) or system (adaptivity)
Principles of robustness
Observability
ability of user to evaluate the internal state of
the system from its perceivable representation
browsability; defaults; reachability;
persistence; operation visibility
Recoverability
ability of user to take corrective action once an
error has been recognized
reachability; forward/backward recovery;
commensurate effort
Principles of robustness
(ctd)
Responsiveness
how the user perceives the rate of
communication with the system
Stability
Task conformance
degree to which system services support all of
the user's tasks
task completeness; task adequacy
Shneiderman’s 8 Golden
Rules
1. Strive for consistency
2. Enable frequent users to use shortcuts
3. Offer informative feedback
4. Design dialogs to yield closure
5. Offer error prevention and simple error
handling
6. Permit easy reversal of actions
7. Support internal locus of control
8. Reduce short-term memory load
Norman’s 7 Principles
1. Use both knowledge in the world and
knowledge in the head.
2. Simplify the structure of tasks.
3. Make things visible: bridge the gulfs of
Execution and Evaluation.
4. Get the mappings right.
5. Exploit the power of constraints, both
natural and artificial.
6. Design for error.
7. When all else fails, standardize.
HCI aspects of speech
Speech modality does not “respect”
fundamental human-computer
interface design principles(!)
Control
Efficiency
Consistency
Familiarity and Transparency
Forgiveness and Recovery
Introduction to Spoken
Dialogue Systems
Part A.3
Outline
Discourse
Definition
Speech Acts
Cognitive Aspects
Spoken Dialogue Systems
Multimodal Systems
Examples
Definitions and Concepts
Discourse
Monologue
Dialogue
Human-human vs Human-computer
discourse
Turn-taking
Dialogue Segmentation
Definitions and Concepts
Grounding
Backchannel, e.g., ‘Mm Hmm’
Acknowledgment
Explicit/implicit confirmation
Implicature
“What time are you flying”
“Well, I have a meeting at three”
Initiative
“What time are you flying?”
“Don’t feel like booking a flight. Lets look at hotels”
Speech Acts
Speech Acts (Austin 1962, Searle 1975)
Assertive (conclude), Directive (ask, order), Commissive
(promise), Expressive(apologize, thank), Declarations
Dialogue Acts
Statement, Info-Request, Wh-Question, Yes-No Question,
Opening, Closing, Open-Option, Action-Directive, Offer,
Commit, Agree etc.
Application Acts
Domain specific but general, e.g., Info-Request into
system’s semantic state, Info-Request into database,
Info-Request into database results
An example agent-client
interaction (Zue & Glass, 2000)
Human-Human statistics
(Zue & Glass 2000)
Words per turn
Discourse: Research Issues
Reference resolution, e.g., “That was a
lie”
Anaphora, e.g., “John left …. He was bored.”
Co-reference, e.g., “John” and “He” refer to
the same entity
Text coherence, e.g.,
Coherence: “John left early. He was tired”
Incoherence: “John left early. He likes
spinach”
Cognitive Aspects
Speech is a strong correlate for
Gender, Emotion, Personality, Speaker’s face
In human-human communication people
expect
Reciprocity, Symmetry, Collaboration
Speech communication is a social act that
implies presence
Spoken Dialogue System
Speech Semantic NL Under Pragmatic
Recognition Parsing standing Analysis
speech text semantics
Text to Speech Language Dialogue
Synthesis Generation Manager
Speech Natural Language
Processing Processing
SDS module interaction
(Zue & Glass 2000)
SDS Components
Speech: ASR, TTS, audio
Semantics
Semantic Parser
Semantic Interpreter
Pragmatics & Inference
Context Tracking
Pragmatic Interpreter
Application Control
Speech Interface
Dialogue
Generation
Component Portability
Application independent
Controller
Application dependent
Semantics Pragmatics Dialogue Generation
Manager
Semantic Initiative
Parser
Interpreter Tracking
Context Pragmatic Expert Utterance Surface
Tracker Interpreter Domain Planner Realizer
Knowledge
Examples SDS Applications
(Zue & Glass 2000)
Application Turn Statistics
(Zue & Glass 2000)
Data, data, data!
Data
Collection
Multi-stage data
collection.
Wizard of Oz data
collection scenario
Advanced Dialogue Systems
Mixed Initiative:
Allow user to say anything (global grammar
active at all states), e.g., “What date are you
flying”
“I am flying next Tuesday in the morning”
Allow user to navigate the systems state
machine, e.g.,
“I would like to look at hotels first”
Open prompts, give user the initiative, e.g.,
“What next?”
Advanced Dialogue Systems
Advanced dialogue features
Corrections, e.g., “No not Boston, Atlanta”
Negation, e.g., “Anything but Olympic”
Complex semantic expressions, e.g.,
“tomorrow evening or Sunday morning”
Ambiguity resolution and representation, e.g.,
“next Tuesday”
Persistent Semantics, e.g., “Info about his
organization”
Emotion/Cognitive state recognition
Statistical Dialogue Modeling
Multimodal Systems
Definitions
Input Modalities/Output Media
Research Issues
Examples
Multimodal Input &
Multimedia Output
More that one input modalities and/or
output modalities
Fusion of Inputs
Fission of Outputs
Advantages:
Increased robustness, naturalness, freedom of
choice
Disadvantages:
Complexity, design issues.
Input Modalities/Output Media
S D P S+D S+P
Unimodal: S
Speech input/Speech output. G
Multimodal: S+G
Speech+DTMF input/Speech output.
Speech input/Speech and GUI output.
Speech and pen/touch input w. Speech and GUI output.
Definitions:
Pen input: buttons, pull-down menus, graffiti, pen
gestures.
GUI output: text and graphics
Multimodal Issues
Semantic/Pragmatic Module:
Merging semantic information from different modalities,
e.g., “Draw a line from here to there”
Ambiguity representation and resolution
User Interface:
Synergies between input modalities
Turn-taking and appropriate mix of modalities
Maintain interface consistency
Focus/context visualization
System issues:
Synchronization and latency
Example: Flight Reservation
ASR: I want to fly from
Boston to New York on
September 6th.
field disabled
new focus
navigation buttons
Example: Ambiguity Resolution
Architectures and
Standards
Part A.4
monolithic vs. components
Seeheim has big components
often easier to use smaller ones
esp. if using object-oriented toolkits
Smalltalk used MVC – model–view–
controller
model – internal logical state of component
view – how it is rendered on screen
controller – processes user input
MVC
model - view - controller
view
model
controller
MVC issues
MVC is largely pipeline model:
input → control → model → view → output
but in graphical interface
input only has meaning in relation to output
e.g. mouse click
need to know what was clicked
controller has to decide what to do with click
but view knows what is shown where!
in practice controller ‘talks’ to view
separation not complete
PAC model
PAC model closer to Seeheim
abstraction – logical state of component
presentation – manages input and output
control – mediates between them
manages hierarchy and multiple views
control part of PAC objects communicate
PAC cleaner in many ways …
but MVC used more in practice
(e.g. Java Swing)
PAC
presentation - abstraction - control
A P A P
C C
abstraction presentation
control
A P
C A P
C
Galaxy Hub Architecture
Dialog Manager
Parser
ASR Generation
TTS Controller AI
Telephony Interpreter/Context Tr.
Database
SDS Research Architecture
ASR Parser DM/Initiative
Generation
TTS Platform Controller App. Controller AI
…
Telephony Interpreter/Context Tr.
Database
Other SDS architectures
Agent architectures
Components are agents
Read/Write from a common white-
board
The Voice Web
[R. Pieraccini, SpeechCycle] SCXML?
EMMA?
Voice
Browser
Internet
Web Server
MRCP
ASR TTS
Telephony VoiceXML
Platform /SALT
SSML, SRGF
Telephone
CCXML
W3C Standards
SDS standard: VoiceXML 1.0, 2.0
Multimodal Standards
EMMA, SALT, HTML+Voice
Grammar Standards
Contoller Standards
….
The Speech Business
Part A.5
Voice User Interface (VUI)
Design—the Quantum Leap
[R. Pieraccini, SpeechCycle]
1995 -- The WildFire Effect
Change of perspective: From technology driven to user
centered
RESEARCH: Natural Language free form
COMMERCIAL: Task completion and usability.
Persona: the personality of the application (TTS vs.
Recording)
Speech recognition accuracy is important, but success
is determined by the VUI.
The importance of a repeatable, streamlined,
teachable, development process
The Speech Application Lifecycle
[R. Pieraccini, SpeechCycle] Speech Scientist
VUI Designer
usability
8
speech science
Analyst full
7 deployment
VUI Designer Project
Manager
2 3
1 VUI design 10
6 9
VUI development
4 5 partial
requirements
deployment
high level system integration
system design engineering
Architect,
App Developer
Engineer
PROMPTS
Voice User Interface Design
Type
Initial
Wording
Please say the amount you would like to transfer from your
Source
get_amount_I_1.wav
[R. Pieraccini, SpeechCycle]
<origin-account> TTS
to your get_amount_I_2.wav
<destination-account> TTS
in dollars and cents. get_amount_I_3.wav
Retry 1 Please say the amount you would like to transfer from your get_amount_I_1.wav
Enter Transfer
<origin-account> TTS
to your
Get Origin get_amount_I_2.wav
Account
<destination-account> TTS
in dollars and cents. get_amount_I_3.wav
Get Destination origin
Retry 2 Please say
Account the amount you would like to have transferred, like one
account
hundred dollars and fifty cents. get_amount_R_2_1.wav
Timeou I'm sorry, I didn't hear you. get_amount_T_1_1.wav
t 1 Get Amount
Please say the amount you would like to transfer from your
destination get_amount_I_1.wav
account
<origin-account> TTS
Play Wrong to >
amount your get_amount_I_2.wav
YES
Amount origin
Message <destination-account>
account? TTS
Timeou I didn't hear you this time either. Please say the amount you would
NO amount
t2 like to have transferred, like one hundred dollars and fifty cents. get_amount_T_2_1.wav
Play
Please
Confirmation say how much do you wish to transfer. You can say the
amount in dollars and cents, like, for instance, one hundred dollars
Help and fifty cents. get_amount_H.wav
ACTIONS
NO What is wrong?
confirmed?
CONDITION ACTION
YES
Go to "Play Wrong Amount
if amount greater than amount in <origin-account>
Go to Main Menu
Message"
else Go to "Play Confirmation"
The Architectural Evolution
of Spoken Dialog [R. Pieraccini]
1994 1998 2000 2005
Standard Standard
Native Proprietary
Clients Application
Code IVR Systems
(VoiceXML) servers
The Evolution of the Interface
and the Semantic Gap [R. Pieraccini]
Natural
Language
Research Systems a-la DARPA
Spoken dialog as an
Communicator anthropomorphic
system
Spoken dialog
as a tool SLU: Statistical Language
Understanding
Large Vocabulary, Dialog Modules
Directed
Dialog Small Vocabulary Menu Based
1994 1996 1998 2000 2002 2004 2006
The evolution of the industry
[R. Pieraccini, SpeechCycle]
HOSTING 600 to
1,000M$
APPLICATION DEVELOPERS revenue
PROFESSIONAL SERVICES
> 8000 apps
worldwide
TOOLS – AUTHORING, TUNING,
PREPACKAGED APPLICATIONS
New evolving
PLATFORM INTEGRATORS
standards
IVR, VoiceXML, CTI,…
guarantee
interoperability of
TECHNOLOGY VENDORS engines and
SPEECH RECOGNITION, TTS platforms.
Some Players
Nuance: all
Loquendo: all
Tell-me: app-dev, hosting
IBM, AT&T: core tech. ++
…
3rd generation dialog systems
[R. Pieraccini, SpeechCycle]
1st Generation 2nd Generation 3RD Generation
INFORMATIONAL TRANSACTIONAL PROBLEM SOLVING
BANKING CUSTOMER
CARE
PACKAGE
TRACKING STOCK
TRADING TECHNICAL
SUPPORT
FLIGHT
STATUS
FLIGHT/TRAIN
RESERVATION
LOW MEDIUM HIGH
COMPLEXITY
SDS telephone interface ☺
[SNL 2005]
SpeechRecoDate.wmv
Part A: Conclusions
1. Introduction to Human-Computer
Interfaces
2. Introduction to Natural Language
3. Introduction to Spoken Dialogue
4. Architectures and Standards
5. The Speech Business
Related docs
Get documents about "