Multimodal Dialogue Interaction Systems

Document Sample
scope of work template
							   Multimodal Dialogue
   Interaction Systems


Prof. Alexandros Potamianos
Technical Univ. of Crete
Spring 2007-2008
Part A: Introduction
Part A: Outline
1.   Introduction to Human-Computer
     Interfaces
2.   Introduction to Natural Language
3.   Introduction to Spoken Dialogue
4.   Architectures and Standards
5.   The Speech Business
Bibliography

 HCI Books
   Alan Dix - Janet Finlay - Gregory Abowd - Russell
   Beale, Human Computer Interaction, 3E, Prentice-Hall,
   2004.
   Ben Shneiderman, Catherine Plaisant, Designing the
   User Interface: Strategies for Effective Human-
   Computer Interaction, 4/E, Pearson, 2004
Introduction to Human-
   Computer Interfaces

              Part A.1
Outline
1.   The human
2.   The computer
3.   The interface
[material mostly from Dix et al. HCI book, ch 1-3]
The human
Information i/o …
  visual, auditory, haptic, movement
Information stored in memory
  sensory, short-term, long-term
Information processed and applied
  reasoning, problem solving, skill, error
Emotion influences human capabilities
Each person is different
Reading
 Several stages:
   visual pattern perceived
   decoded using internal representation of
   language
   interpreted using knowledge of syntax,
   semantics, pragmatics
 Reading involves saccades and fixations
 Perception occurs during fixations
 Word shape is important to recognition
 Negative contrast improves reading from
 computer screen
Hearing
 Provides information about environment:
   distances, directions, objects etc.
 Physical apparatus:
   outer ear – protects inner and amplifies sound
   middle ear –        transmits sound waves as
                vibrations to inner ear
   inner ear – chemical transmitters are released
                and cause impulses in auditory nerve
 Sound
   pitch    – sound frequency
   loudness – amplitude
   timbre   – type or quality
Hearing (cont)
 Humans can hear frequencies from 20Hz
 to 15kHz
   less accurate distinguishing high frequencies
   than low.


 Auditory system filters sounds
   can attend to sounds over background noise.
   for example, the cocktail party phenomenon.
Touch
 Provides important feedback about environment.
 May be key sense for someone who is visually
 impaired.
 Stimulus received via receptors in the skin:
    thermoreceptors – heat and cold
    nociceptors      – pain
    mechanoreceptors – pressure
                         (some instant, some continuous)

 Some areas more sensitive than others e.g.
 fingers.
 Kinethesis - awareness of body position
    affects comfort and performance.
Movement
 Time taken to respond to stimulus:
           reaction time + movement time
 Movement time dependent on age, fitness etc.
 Reaction time - dependent on stimulus type:
    visual   ~ 200ms
    auditory~ 150 ms
    pain     ~ 700ms

 Increasing reaction time decreases accuracy in
 the unskilled operator but not in the skilled
 operator.
Movement (cont)
    Fitts' Law describes the time taken to hit
    a screen target:
           Mt = a + b log2(D/S + 1)
     where: a and b are empirically determined
     constants
           Mt is movement time
           D is Distance
           S is Size of target

⇒   targets as large as possible
    distances as small as possible
 Memory
There are three types of memory function:

Sensory memories

      Short-term memory or working memory


                  Long-term memory

Selection of stimuli governed by level of arousal.
sensory memory
 Buffers for stimuli received through
 senses
   iconic memory: visual stimuli
   echoic memory: aural stimuli
   haptic memory: tactile stimuli
 Examples
   “sparkler” trail
   stereo sound
 Continuously overwritten
Short-term memory (STM)
 Scratch-pad for temporary recall

  rapid access ~ 70ms

  rapid decay ~ 200ms

  limited capacity - 7± 2 chunks
Long-term memory (LTM)
 Repository for all our knowledge
   slow access ~ 1/10 second
   slow decay, if any
   huge or unlimited capacity


 Two types
   episodic– serial memory of events
   semantic     – structured memory of facts,concepts,
   skills

  semantic LTM derived from episodic LTM
Long-term memory (cont.)
 Semantic memory structure
   provides access to information
   represents relationships between bits of
   information
   supports inference

 Model: semantic network
   inheritance – child nodes inherit properties of
   parent nodes
   relationships between bits of information
   explicit
   supports inference through inheritance
LTM - semantic network
Models of LTM - Frames
 Information organized in data structures
 Slots in structure instantiated with values for
 instance of data
 Type–subtype relationships
                 DOG               COLLIE

    Fixed                  Fixed
       legs: 4                breed of: DOG
                              type: sheepdog
    Default
      diet: carniverous    Default
      sound: bark            size: 65 cm
    Variable               Variable
      size:                  colour
      colour
        Thinking


Reasoning
    deduction, induction,
abduction
Problem solving
Errors and mental models
Types of error
  slips
     right intention, but failed to do it right
     causes: poor physical skill,inattention etc.
     change to aspect of skilled behaviour can
     cause slip

  mistakes
     wrong intention
     cause: incorrect understanding
      humans create mental models to explain behaviour.
      if wrong (different from actual system) errors can occur
Emotion
Various theories of how emotion works
   James-Lange, Cannon, Schacter-Singer
Emotion clearly involves both cognitive and
physical responses to stimuli
The biological response to physical stimuli is
called affect
Affect influences how we respond to situations
   positive → creative problem solving
   negative → narrow thinking

 “Negative affect can make it harder to do even easy
   tasks; positive affect can make it easier to do
   difficult tasks”
                                          (Donald Norman)
Emotion (cont.)
 Implications for interface design
   stress will increase the difficulty of
   problem solving
   relaxed users will be more forgiving of
   shortcomings in design
   aesthetically pleasing and rewarding
   interfaces will increase positive affect
The Computer
a computer system is made up of various
elements

each of these elements affects the interaction
   input devices – text entry and pointing
   output devices – screen (small&large), digital paper
    virtual reality – special interaction and display devices
    physical interaction – e.g. sound, haptic, bio-sensing
    paper – as output (print) and input (scan)
text entry devices
       keyboards (QWERTY
       et al.)
       chord keyboards,
       phone pads, touch,
       handwriting, speech
Chord keyboards
only a few keys - four or 5
letters typed as combination of keypresses
compact size
   – ideal for portable applications
short learning time
   – keypresses reflect letter shape
fast
   – once you have trained


BUT - social resistance, plus fatigue after extended use
NEW – niche market for some wearables
phone pad and T9 entry
 use numeric keys with
 multiple presses
 2   –   abc   mno
               6   -
 3   -   def   pqrs
               7   -
 4   -   ghi   tuv
               8   -
 5   -   jkl   wxyz
               9   -
 hello = 4433555[pause]555666
 surprisingly fast!
 T9 predictive entry
     type as if single key for each letter
     use dictionary to ‘guess’ the right word
     hello = 43556 …
     but 26 -> menu ‘am’ or ‘an’
Handwriting recognition
 Text can be input into the computer, using a pen
 and a digesting tablet
   natural interaction

 Technical problems:
   capturing all useful information - stroke path, pressure,
   etc. in a natural manner
   segmenting joined up writing into individual letters
   interpreting individual letters
   coping with different styles of handwriting

 Used in PDAs, and tablet computers …
 … leave the keyboard on the desk!
Speech recognition
 Improving rapidly
 Most successful when:
   single user – initial training and learns
   peculiarities
   limited vocabulary systems

 Problems with
   external noise interfering
   imprecision of pronunciation
   large vocabularies
   different speakers
positioning, pointing and
         drawing
          mouse, touchpad
          trackballs, joysticks
          etc.
          touch screens,
          tablets
          eyegaze, cursors
Eyegaze
 control interface by eye gaze direction
   e.g. look at a menu item to select it
 uses laser beam reflected off retina
   … a very low power laser!
 mainly used for evaluation (ch x)
 potential for hands-free control
 high accuracy requires headset
 cheaper and lower accuracy devices available
     sit under the screen like a small webcam
display devices
     bitmap screens (CRT
     & LCD)
     large & situated
     displays
     digital paper
virtual reality and 3D
      interaction
         positioning in 3D
         space
         moving and
         grasping
         seeing 3D (helmets
         and caves)
physical controls, sensors
           etc.
           special displays and
           gauges
           sound, touch, feel,
           smell
           physical controls
           environmental and
           bio-sensing
paper: printing and
     scanning
       print technology
       fonts, page
       description,
       WYSIWYG
       scanning, OCR
The Interaction
 interaction models
   translations between user and system
 interaction styles
   the nature of user/system dialog
 context
   social, organizational, motivational
Some terms of interaction
domain – the area of work under study
              e.g. graphic design
goal    – what you want to achieve
              e.g. create a solid red triangle
task    – how you go about doing it
        – ultimately in terms of operations or
          actions
          e.g. … select fill tool, click over triangle
execution/evaluation loop
                        goal

   execution                         evaluation
                      system

    •   user establishes the goal
    •   formulates intention
    •   specifies actions at interface
    •   executes action
    •   perceives system state
    •   interprets system state
    •   evaluates system state with respect to goal
execution/evaluation loop
                        goal

   execution                         evaluation
                      system

    •   user establishes the goal
    •   formulates intention
    •   specifies actions at interface
    •   executes action
    •   perceives system state
    •   interprets system state
    •   evaluates system state with respect to goal
execution/evaluation loop
                        goal

   execution                         evaluation
                      system

    •   user establishes the goal
    •   formulates intention
    •   specifies actions at interface
    •   executes action
    •   perceives system state
    •   interprets system state
    •   evaluates system state with respect to goal
 Human error - slips and
 mistakes
slip
       understand system and goal
       correct formulation of action
       incorrect action

mistake
       may not even have right goal!

Fixing things?
   slip – better interface design
   mistake – better understanding of system
Abowd and Beale framework
extension of Norman…
their interaction framework has 4 parts      O
       user                                output
       input
       system                       S               U
                                   core             task
       output
                                              I
                                            input
each has its own unique language
  interaction ⇒ translation between languages

problems in interaction = problems in translation
Using Abowd & Beale’s model
user intentions
  → translated into actions at the interface
     → translated into alterations of system state
        → reflected in the output display
          → interpreted by the user
general framework for understanding
interaction
   not restricted to electronic computer systems
   identifies all major components involved in
   interaction
   allows comparative assessment of systems
   an abstraction
Indirect manipulation
 office– direct manipulation
   user interacts
   with artificial world                             system



industrial – indirect manipulation
  user interacts
  with real world
  through interface                      interface     plant
issues ..                  immediate
  feedback                  feedback

  delays                   instruments
Common interaction styles
 command line interface
 menus
 natural language
 question/answer and query dialogue
 form-fills and spreadsheets
 WIMP
 point and click
 three–dimensional interfaces
Command line interface
  Way of expressing instructions to the computer
  directly
    function keys, single characters, short abbreviations,
    whole words, or a combination

  suitable for repetitive tasks
  better for expert users than novices
  offers direct access to system functionality
  command names/abbreviations should be
  meaningful!
Typical example: the Unix system
Menus
 Set of options displayed on the screen
 Options visible
   less recall - easier to use
   rely on recognition so names should be meaningful
 Selection by:
   numbers, letters, arrow keys, mouse
   combination (e.g. mouse plus accelerators)
 Often options hierarchically grouped
   sensible grouping is needed
 Restricted form of full WIMP system
Natural language
 Familiar to user
 speech recognition or typed natural
 language
 Problems
   vague
   ambiguous
   hard to do well!
 Solutions
   try to understand a subset
   pick on key words
Query interfaces
 Question/answer interfaces
   user led through interaction via series of
   questions
   suitable for novice users but restricted
   functionality
   often used in information systems


 Query languages (e.g. SQL)
   used to retrieve information from database
   requires understanding of database structure
   and language syntax, hence requires some
   expertise
Form-fills
 Primarily for data entry or data retrieval
 Screen like paper form.
 Data put in relevant place
 Requires
   good design
   obvious correction
   facilities
Spreadsheets
 first spreadsheet VISICALC, followed by
 Lotus 1-2-3
 MS Excel most common today
 sophisticated variation of form-filling.
   grid of cells contain a value or a formula
   formula can involve values of other cells
          e.g. sum of all cells in this column
   user can enter and alter data spreadsheet
   maintains consistency
WIMP Interface
    Windows
      Icons
         Menus
            Pointers
 … or windows, icons, mice, and pull-down
 menus!


 default style for majority of interactive
 computer systems, especially PCs and
 desktop machines
Point and click interfaces
 used in ..
   multimedia
   web browsers
   hypertext

 just click something!
   icons, text links or location on map

 minimal typing
Three dimensional
interfaces
 virtual reality
 ‘ordinary’ window systems
   highlighting                flat buttons …
   visual affordance
   indiscriminate use click me!
   just confusing!
 3D workspaces                         … or sculptured
   use for extra virtual space
   light and occlusion give depth
   distance effects
Speech–driven interfaces
 rapidly improving …
    … but still inaccurate

 how to have robust dialogue?
   … interaction of course!

 e.g. airline reservation:
     reliable “yes” and “no”
     + system reflects back its understanding
    “you want a ticket from New York to Boston?”
Look and … feel
 WIMP systems have the same elements:
    windows, icons., menus, pointers, buttons, etc.


 but different window systems
    … behave differently
   e.g. MacOS vs Windows menus


 appearance + behaviour        =   look and feel
Initiative
 who has the initiative?
     old question–answer     – computer
     WIMP interface      – user
 WIMP exceptions …
    pre-emptive parts of the interface
 modal dialog boxes
   come and won’t go away!
   good for errors, essential steps
   but use with care
Error and repair
can’t always avoid errors …
 … but we can put them right
make it easy to detect errors
 … then the user can repair them
           hello, this is the Go Faster booking system
           what would you like?
           (user) I want to fly from New York to London
           you want a ticket from New York to Boston
           (user) no
           sorry, please confirm one at a time
           do you want to fly from New York
           (user) yes
           ………
Context
Interaction affected by social and
  organizational context

 other people
    desire to impress, competition, fear of failure
 motivation
    fear, allegiance, ambition, self-satisfaction
 inadequate systems
    cause frustration and lack of motivation
Experience, engagement
                and fun
        designing
        experience
        physical
        engagement
        managing value
Other HCI concepts
 App. Development: Waterfall model
 Cognitive Models
 Design Principles
 HCI aspects of speech
The waterfall model
    Requirements
    Requirements
     specification
     specification


                 Architectural
                 Architectural
                    design
                    design


                                 Detailed
                                 Detailed
                                 design
                                  design


                                            Coding and
                                            Coding and
                                            unit testing
                                            unit testing


                                                           Integration
                                                            Integration
                                                           and testing
                                                           and testing


                                                                          Operation and
                                                                          Operation and
                                                                          maintenance
                                                                           maintenance
Activities in the life cycle
Requirements specification
  designer and customer try capture what the system is
  expected to provide can be expressed in natural language
  or more precise languages, such as a task analysis would
  provide

Architectural design
  high-level description of how the system will provide the
  services required factor system into major components of
  the system and how they are interrelated needs to satisfy
  both functional and nonfunctional requirements

Detailed design
  refinement of architectural components and interrelations
  to identify modules to be implemented separately the
  refinement is governed by the nonfunctional requirements
The life cycle for interactive
systems
   Requirements
   Requirements
                                             cannot assume a linear
    specification
    specification
                                                sequence of activities
                Architectural
                Architectural
                                            as in the waterfall model
                   design
                   design


                                Detailed
                                Detailed
                                design
                                 design


                                           Coding and
                                           Coding and
                                           unit testing
                                           unit testing


                                                          Integration
                                                           Integration
                                                          and testing
lots of feedback!
                                                          and testing


                                                                         Operation and
                                                                         Operation and
                                                                         maintenance
                                                                          maintenance
GOMS
Goals
    what the user wants to achieve

Operators
    basic actions user performs

Methods
    decomposition of a goal into
    subgoals/operators

Selection
    means of choosing between competing
Keystroke Level Model
(KLM)
 lowest level of (original) GOMS
 six execution phase operators
   Physical motor: K - keystroking
                   P - pointing
                   H - homing
                   D - drawing
   Mental         M - mental preparation
   System         R - response

 times are empirically determined.
     Texecute = TK + TP + TH + TD + TM + TR
Principles to support
usability
Learnability
  the ease with which new users can begin effective
  interaction and achieve maximal performance

Flexibility
  the multiplicity of ways the user and system exchange
  information

Robustness
  the level of support provided the user in determining
  successful achievement and assessment of goal-directed
  behaviour
Principles of learnability
Predictability
   determining effect of future actions
   based on past interaction history
   operation visibility

Synthesizability
   assessing the effect of past actions
   immediate vs. eventual honesty
Principles of learnability (ctd)
Familiarity
   how prior knowledge applies to new system
   guessability; affordance


Generalizability
   extending specific interaction knowledge to
   new situations


Consistency
   likeness in input/output behaviour arising from
   similar situations or task objectives
Principles of flexibility
Dialogue initiative
   freedom from system imposed constraints on
   input dialogue
   system vs. user pre-emptiveness

Multithreading
   ability of system to support user interaction for
   more than one task at a time
   concurrent vs. interleaving; multimodality

Task migratability
   passing responsibility for task execution
   between user and system
Principles of flexibility (ctd)
Substitutivity
   allowing equivalent values of input and output
   to be substituted for each other
   representation multiplicity; equal opportunity


Customizability
   modifiability of the user interface by user
   (adaptability) or system (adaptivity)
Principles of robustness
Observability
   ability of user to evaluate the internal state of
   the system from its perceivable representation
   browsability; defaults; reachability;
   persistence; operation visibility


Recoverability
   ability of user to take corrective action once an
   error has been recognized
   reachability; forward/backward recovery;
   commensurate effort
Principles of robustness
(ctd)
Responsiveness
  how the user perceives the rate of
  communication with the system
  Stability


Task conformance
  degree to which system services support all of
  the user's tasks
  task completeness; task adequacy
Shneiderman’s 8 Golden
Rules
1. Strive for consistency
2. Enable frequent users to use shortcuts
3. Offer informative feedback
4. Design dialogs to yield closure
5. Offer error prevention and simple error
  handling
6. Permit easy reversal of actions
7. Support internal locus of control
8. Reduce short-term memory load
Norman’s 7 Principles
1. Use both knowledge in the world and
  knowledge in the head.
2. Simplify the structure of tasks.
3. Make things visible: bridge the gulfs of
  Execution and Evaluation.
4. Get the mappings right.
5. Exploit the power of constraints, both
  natural and artificial.
6. Design for error.
7. When all else fails, standardize.
HCI aspects of speech
 Speech modality does not “respect”
 fundamental human-computer
 interface design principles(!)
  Control
  Efficiency
  Consistency
  Familiarity and Transparency
  Forgiveness and Recovery
Introduction to Spoken
     Dialogue Systems

             Part A.3
Outline
  Discourse
    Definition
    Speech Acts
    Cognitive Aspects
  Spoken Dialogue Systems
  Multimodal Systems
  Examples
Definitions and Concepts
 Discourse
   Monologue
   Dialogue
 Human-human vs Human-computer
 discourse
 Turn-taking
   Dialogue Segmentation
Definitions and Concepts
Grounding
 Backchannel, e.g., ‘Mm Hmm’
 Acknowledgment
 Explicit/implicit confirmation
Implicature
 “What time are you flying”
 “Well, I have a meeting at three”
Initiative
 “What time are you flying?”
 “Don’t feel like booking a flight. Lets look at hotels”
Speech Acts
Speech Acts (Austin 1962, Searle 1975)
  Assertive (conclude), Directive (ask, order), Commissive
  (promise), Expressive(apologize, thank), Declarations
Dialogue Acts
  Statement, Info-Request, Wh-Question, Yes-No Question,
  Opening, Closing, Open-Option, Action-Directive, Offer,
  Commit, Agree etc.
Application Acts
  Domain specific but general, e.g., Info-Request into
  system’s semantic state, Info-Request into database,
  Info-Request into database results
An example agent-client
interaction (Zue & Glass, 2000)
Human-Human statistics
(Zue & Glass 2000)




              Words per turn
Discourse: Research Issues
 Reference resolution, e.g., “That was a
 lie”
   Anaphora, e.g., “John left …. He was bored.”
   Co-reference, e.g., “John” and “He” refer to
   the same entity
 Text coherence, e.g.,
   Coherence: “John left early. He was tired”
   Incoherence: “John left early. He likes
   spinach”
Cognitive Aspects
 Speech is a strong correlate for
   Gender, Emotion, Personality, Speaker’s face
 In human-human communication people
 expect
   Reciprocity, Symmetry, Collaboration
 Speech communication is a social act that
 implies presence
    Spoken Dialogue System


     Speech                Semantic   NL Under      Pragmatic
    Recognition             Parsing   standing       Analysis

speech              text                            semantics

   Text to Speech               Language         Dialogue
     Synthesis                  Generation       Manager


     Speech                      Natural Language
     Processing                  Processing
SDS module interaction
(Zue & Glass 2000)
SDS Components
 Speech: ASR, TTS, audio
 Semantics
   Semantic Parser
   Semantic Interpreter
 Pragmatics & Inference
   Context Tracking
   Pragmatic Interpreter
 Application Control
 Speech Interface
   Dialogue
   Generation
  Component Portability
                                                    Application independent
                                Controller
                                                    Application dependent


   Semantics              Pragmatics          Dialogue          Generation
                                              Manager


          Semantic                            Initiative
Parser
         Interpreter                          Tracking


                Context   Pragmatic       Expert           Utterance   Surface
                Tracker   Interpreter    Domain             Planner    Realizer
                                        Knowledge
Examples SDS Applications
(Zue & Glass 2000)
Application Turn Statistics
(Zue & Glass 2000)
Data, data, data!
Data
Collection
Multi-stage data
collection.
Wizard of Oz data
collection scenario
Advanced Dialogue Systems
 Mixed Initiative:
   Allow user to say anything (global grammar
   active at all states), e.g., “What date are you
   flying”
    “I am flying next Tuesday in the morning”
   Allow user to navigate the systems state
   machine, e.g.,
     “I would like to look at hotels first”
   Open prompts, give user the initiative, e.g.,
   “What next?”
Advanced Dialogue Systems
 Advanced dialogue features
   Corrections, e.g., “No not Boston, Atlanta”
   Negation, e.g., “Anything but Olympic”
   Complex semantic expressions, e.g.,
   “tomorrow evening or Sunday morning”
   Ambiguity resolution and representation, e.g.,
   “next Tuesday”
   Persistent Semantics, e.g., “Info about his
   organization”
   Emotion/Cognitive state recognition
   Statistical Dialogue Modeling
Multimodal Systems
 Definitions
 Input Modalities/Output Media
 Research Issues
 Examples
Multimodal Input &
Multimedia Output
 More that one input modalities and/or
 output modalities
 Fusion of Inputs
 Fission of Outputs
 Advantages:
   Increased robustness, naturalness, freedom of
   choice
 Disadvantages:
   Complexity, design issues.
Input Modalities/Output Media
                                          S   D   P   S+D   S+P



 Unimodal:                          S

   Speech input/Speech output.      G


 Multimodal:                        S+G


   Speech+DTMF input/Speech output.
   Speech input/Speech and GUI output.
   Speech and pen/touch input w. Speech and GUI output.
 Definitions:
   Pen input: buttons, pull-down menus, graffiti, pen
   gestures.
   GUI output: text and graphics
Multimodal Issues
 Semantic/Pragmatic Module:
   Merging semantic information from different modalities,
   e.g., “Draw a line from here to there”
   Ambiguity representation and resolution
 User Interface:
   Synergies between input modalities
   Turn-taking and appropriate mix of modalities
   Maintain interface consistency
   Focus/context visualization
 System issues:
   Synchronization and latency
   Example: Flight Reservation

ASR: I want to fly from
Boston to New York on
September 6th.


 field disabled



           new focus

navigation buttons
Example: Ambiguity Resolution
Architectures and
       Standards

        Part A.4
monolithic vs. components
 Seeheim has big components

 often easier to use smaller ones
   esp. if using object-oriented toolkits

 Smalltalk used MVC – model–view–
 controller
   model – internal logical state of component
   view – how it is rendered on screen
   controller – processes user input
MVC
model - view - controller


                view


     model


              controller
MVC issues
 MVC is largely pipeline model:
   input → control → model → view → output
 but in graphical interface
   input only has meaning in relation to output
 e.g. mouse click
   need to know what was clicked
   controller has to decide what to do with click
   but view knows what is shown where!
 in practice controller ‘talks’ to view
   separation not complete
PAC model
 PAC model closer to Seeheim
   abstraction – logical state of component
   presentation – manages input and output
   control – mediates between them

 manages hierarchy and multiple views
   control part of PAC objects communicate

 PAC cleaner in many ways …
    but MVC used more in practice
      (e.g. Java Swing)
PAC
presentation - abstraction - control
          A       P        A       P
              C                C



          abstraction        presentation


                        control


      A       P
          C                        A       P
                                       C
Galaxy Hub Architecture

                        Dialog Manager
               Parser
      ASR                       Generation

     TTS         Controller      AI


   Telephony                    Interpreter/Context Tr.
                    Database
  SDS Research Architecture

       ASR                       Parser     DM/Initiative
                                                         Generation

TTS      Platform   Controller            App. Controller    AI
                                                            …

      Telephony                           Interpreter/Context Tr.
                       Database
Other SDS architectures
 Agent architectures
  Components are agents
  Read/Write from a common white-
  board
The Voice Web
[R. Pieraccini, SpeechCycle]                      SCXML?

                                  EMMA?



               Voice
              Browser
                                    Internet
                                                       Web Server
MRCP
         ASR       TTS


              Telephony           VoiceXML
               Platform            /SALT

                                          SSML, SRGF

  Telephone
                          CCXML
W3C Standards
 SDS standard: VoiceXML 1.0, 2.0
 Multimodal Standards
  EMMA, SALT, HTML+Voice
 Grammar Standards
 Contoller Standards
 ….
The Speech Business

           Part A.5
Voice User Interface (VUI)
Design—the Quantum Leap
[R. Pieraccini, SpeechCycle]

 1995 -- The WildFire Effect

 Change of perspective: From technology driven to user
 centered
    RESEARCH: Natural Language free form

    COMMERCIAL: Task completion and usability.

 Persona: the personality of the application (TTS vs.
 Recording)

 Speech recognition accuracy is important, but success
 is determined by the VUI.

 The importance of a repeatable, streamlined,
 teachable, development process
 The Speech Application Lifecycle
 [R. Pieraccini, SpeechCycle] Speech Scientist
                                                                VUI Designer
                      usability

                                    8
                         speech science
   Analyst                                                                         full
                                           7                                   deployment
 VUI Designer                                                                         Project
                                                                                     Manager
                              2            3

        1                     VUI design                                       10
                                                      6            9
                                    VUI development

                          4                5                              partial
requirements
                                                                        deployment

                 high level          system               integration
               system design       engineering
                                                                             Architect,
                                                                           App Developer
                                                                             Engineer
                                                                    PROMPTS


Voice User Interface Design
               Type
               Initial
                               Wording
                               Please say the amount you would like to transfer from your
                                                                                                      Source
                                                                                                      get_amount_I_1.wav



[R. Pieraccini, SpeechCycle]
                               <origin-account>                                                       TTS
                               to your                                                                get_amount_I_2.wav
                               <destination-account>                                                  TTS
                               in dollars and cents.                                                  get_amount_I_3.wav
               Retry 1         Please say the amount you would like to transfer from your             get_amount_I_1.wav
                         Enter Transfer
                               <origin-account>                                                       TTS
                              to your
                          Get Origin                                                                  get_amount_I_2.wav
                           Account
                               <destination-account>                                                  TTS
                               in dollars and cents.                                                  get_amount_I_3.wav
                            Get Destination       origin
               Retry 2        Please say
                                Account  the    amount you would like to have transferred, like one
                                                 account
                               hundred dollars and fifty cents.                                       get_amount_R_2_1.wav
               Timeou     I'm sorry, I didn't hear you.                                               get_amount_T_1_1.wav
                 t 1 Get Amount
                          Please say the amount you would like to transfer from your
                                       destination                                                    get_amount_I_1.wav
                                              account
                               <origin-account>                                                       TTS
  Play Wrong                   to >
                           amount your                                                                get_amount_I_2.wav
                YES
   Amount                    origin
   Message                     <destination-account>
                           account?                                                                   TTS
               Timeou          I didn't hear you this time either. Please say the amount you would
                                  NO            amount
               t2              like to have transferred, like one hundred dollars and fifty cents.    get_amount_T_2_1.wav
                             Play
                               Please
                          Confirmation say how much do you wish to transfer. You can say the
                               amount in dollars and cents, like, for instance, one hundred dollars
               Help            and fifty cents.                                                       get_amount_H.wav
                                                                      ACTIONS
                                       NO          What is wrong?
                   confirmed?
               CONDITION                                                                              ACTION
                                 YES
                                                                                                      Go to "Play Wrong Amount
               if amount greater than amount in <origin-account>
                         Go to Main Menu
                                                                                                      Message"
               else                                                                                   Go to "Play Confirmation"
 The Architectural Evolution
 of Spoken Dialog [R. Pieraccini]

1994        1998     2000           2005
                             Standard      Standard
   Native      Proprietary
                             Clients       Application
   Code        IVR Systems
                             (VoiceXML)    servers
    The Evolution of the Interface
    and the Semantic Gap [R. Pieraccini]
 Natural
Language
    Research Systems a-la DARPA
                                            Spoken dialog as an
    Communicator                             anthropomorphic
                                                  system

                  Spoken dialog
                    as a tool              SLU: Statistical Language
                                           Understanding
                     Large Vocabulary, Dialog Modules
Directed
 Dialog     Small Vocabulary Menu Based


           1994    1996    1998    2000     2002        2004   2006
The evolution of the industry
[R. Pieraccini, SpeechCycle]

                HOSTING                  600 to
                                        1,000M$
        APPLICATION DEVELOPERS          revenue
        PROFESSIONAL SERVICES
                                      > 8000 apps
                                       worldwide
       TOOLS – AUTHORING, TUNING,
       PREPACKAGED APPLICATIONS
                                      New evolving
         PLATFORM INTEGRATORS
                                        standards
           IVR, VoiceXML, CTI,…
                                        guarantee
                                    interoperability of
         TECHNOLOGY VENDORS            engines and
        SPEECH RECOGNITION, TTS         platforms.
Some Players
 Nuance: all
 Loquendo: all
 Tell-me: app-dev, hosting
 IBM, AT&T: core tech. ++
 …
 3rd generation dialog systems
 [R. Pieraccini, SpeechCycle]
1st Generation        2nd Generation          3RD Generation
INFORMATIONAL         TRANSACTIONAL           PROBLEM SOLVING

                 BANKING                 CUSTOMER
                                           CARE
   PACKAGE
   TRACKING                 STOCK
                           TRADING            TECHNICAL
                                               SUPPORT
    FLIGHT
    STATUS
                               FLIGHT/TRAIN
                               RESERVATION


  LOW                  MEDIUM                    HIGH

                     COMPLEXITY
SDS telephone interface ☺
[SNL 2005]




 SpeechRecoDate.wmv
Part A: Conclusions
1.   Introduction to Human-Computer
     Interfaces
2.   Introduction to Natural Language
3.   Introduction to Spoken Dialogue
4.   Architectures and Standards
5.   The Speech Business

						
Related docs