Docstoc

Designing_Robust_Multimodal_Systems_For_Diverse_Users_And_Mobile_Environments

Document Sample
Designing_Robust_Multimodal_Systems_For_Diverse_Users_And_Mobile_Environments Powered By Docstoc
					Designing Robust Multimodal Systems for
Diverse Users and Mobile Environments

                 Sharon Oviatt
               oviatt@cse.ogi.edu;
         http://www.cse.ogi.edu/CHCC/




         Center for Human Computer Communication
            Department of Computer Science, OG I   1
   Introduction to Perceptive Multimodal
                 Interfaces
• Multimodal interfaces recognize combined natural
  human input modes (speech & pen, speech & lip
  movements)
• Radical departure from GUIs in basic features,
  interface design & architectural underpinnings
• Rapid development in 1990s of bimodal systems
• New fusion & language processing techniques
• Diversification of mode combinations & applications
• More general & robust hybrid architectures


              Center for Human Computer Communication
                 Department of Computer Science, OG I   2
    Advantages of Multimodal
           Interfaces
• Flexibility & expressive power
• Support for users’ preferred interaction style
• Accommodate more users,** tasks, environments**
• Improved error handling & robustness**
• Support for new forms of computing, including mobile
  & pervasive interfaces
• Permit multifunctional & tailored mobile interfaces,
  adapted to user, task & environment



              Center for Human Computer Communication
                 Department of Computer Science, OG I   3
      The Challenge of Robustness:
  Unimodal Speech Technology’s Achilles’
                  Heel

• Recognition errors currently limit commercialization
  of speech technology, especially for:
   – Spontaneous interactive speech
   – Diverse speakers & speaking styles (e.g.,
     accented)
   – Speech in natural field environments (e.g.,
     mobile)
• 20-50% drop in accuracy typical for real-world
  usage conditions

             Center for Human Computer Communication
                Department of Computer Science, OG I   4
        Improved Error Handling in
       Flexible Multimodal Interfaces
• Users can avoid errors through mode selection
• Users’ multimodal language is simplified, which
  reduces complexity of NLP & avoids errors
• Users mode switch after system errors, which
  undercuts error spirals & facilitates recovery
• Multimodal architectures potentially can support
  “mutual disambiguation” of input signals


              Center for Human Computer Communication
                 Department of Computer Science, OG I   5
    Example of Mutual Disambiguation:
QuickSet Interface during Multimodal ―PAN‖
                Command
               Multimodal
              Input on User
                Interface
  Speech                           Gesture
Recognition                      Recognition
                                                 Processing &
    Spoken
   Language
                                   Gestural
                                  Language
                                                 Architecture
 Interpretation                 Interpretation
                                                 • Speech & gestures
                  Multimodal
                   Integrator
                                                   processed in parallel
                                                 • Statistically ranked
                  Multimodal                       unification of semantic
                   Bridge
                                                   interpretations
                    System
                                                 • Multi-agent architecture
                  Confirmation
                    to User
                                                   coordinates signal
                                                   recognition, language
                                                   processing, & multimodal
                                                   integration
        General Research Questions

• To what extent can a multimodal system support
  mutual disambiguation of input signals?
• How much is robustness improved in a multimodal
  system, compared with a unimodal one?
• In what usage contexts and for what user groups is
  robustness most enhanced by a multimodal
  system?
• What are the asymmetries between modes in
  disambiguation likelihoods?
              Center for Human Computer Communication
                 Department of Computer Science, OG I   8
        Study 1- Research Method

• Quickset testing with map-based tasks
  (community fire & flood management)
• 16 users— 8 native speakers & 8 accented
  (varied Asian, European & African accents)
• Research design— completely-crossed factorial
  with between-subjects factors:
                     (1) Speaker status (accented, native)
                     (2) Gender
• Corpus of 2,000 multimodal commands
  processed by QuickSet
              Center for Human Computer Communication
                 Department of Computer Science, OG I   9
              Videotape


Multimodal system processing
for accented and mobile users




     Center for Human Computer Communication
        Department of Computer Science, OG I   10
                 Study 1- Results

• 1 in 8 multimodal commands succeeded due to
  mutual disambiguation (MD) of input signals
• MD levels significantly higher for accented speakers
  than native ones—
           15% vs 8.5% of utterances
• Ratio of speech to total signal pull-ups differed for
  users—
           .65 accented vs .35 native
• Results replicated across signal & parse-level MD
              Center for Human Computer Communication
                 Department of Computer Science, OG I   11
Table 1—Mutual Disambiguation Rates for
    Native versus Accented Speakers


                                   NATIVE               ACCENTED
                                  SPEAKERS              SPEAKERS
   MD LEVELS:
    Signal MD level                 8.5%                 15.0%*
    Parse MD level                  25.5%                31.7%*
    Ratio of speech                  .35                  .65*
      signal pull-ups



              Center for Human Computer Communication
                 Department of Computer Science, OG I              12
Table 2- Recognition Rate Differentials between
  Native and Accented Speakers for Speech,
     Gesture and Multimodal Commands


                               NATIVE                ACCENTED
                              SPEAKERS               SPEAKERS
 RECOGNITION RATE
 DIFFERENTIAL:
   Speech                          —                  -9.5%*
   Gesture                      -3.4%*                   —
   Multimodal                      —                     —



           Center for Human Computer Communication
              Department of Computer Science, OG I              13
         Study 1- Results (cont.)


Compared to traditional speech processing,
spoken language processed within a multimodal
architecture yielded:

  41.3% reduction in total speech error
rate

No gender or practice effects found in MD rates
            Center for Human Computer Communication
               Department of Computer Science, OG I   14
         Study 2- Research Method

• QuickSet testing with same 100 map-based tasks
• Main study:
   – 16 users with high-end mic (close-talking, noise-
     canceling)
   – Research design completely-crossed factorial:
      (1) Usage Context- Stationary vs Mobile (within
  subjects)
      (2) Gender
• Replication:
  – 6 users with low-end mic (built-in, no noise cancellation)
  – Compared stationary vs mobile
               Center for Human Computer Communication
                  Department of Computer Science, OG I   15
       Study 2- Research Analyses


• Corpus of 2,600 multimodal commands
• Signal amplitude, background noise & SNR
  estimated for each command
• Mutual disambiguation & multimodal system
  recognition rates analyzed in relation to dynamic
  signal data


             Center for Human Computer Communication
                Department of Computer Science, OG I   16
   Mobile user with hand-held system & close-
talking headset in moderately noisy environment
                (40-60 dB noise)




            Center for Human Computer Communication
               Department of Computer Science, OG I   17
Mobile research infrastructure, with user
 instrumentation and researcher field
                station




         Center for Human Computer Communication
            Department of Computer Science, OG I   18
                 Study 2- Results


• 1 in 7 multimodal commands succeeded due to
  mutual disambiguation of input signals
• MD levels significantly higher during mobile than
  stationary system use—
           16% vs 9.5% of utterances
• Results replicated across signal and parse-level MD

              Center for Human Computer Communication
                 Department of Computer Science, OG I   19
 Table 3- Mutual Disambiguation Rates
during Stationary and Mobile System Use



                                 STATIONARY          MOBILE
     SIGNAL MD LEVELS:
      Noise-canceling mic            7.5%            11.0%*
      Built-in mic                   11.4%           21.5%*


     RATIO OF SPEECH                   .26            .34*
     PULL-UPS


           Center for Human Computer Communication
              Department of Computer Science, OG I            20
 Table 4- Recognition Rate Differentials during
Stationary and Mobile System Use for Speech,
   Table 5. Recognition Rate Differentials during Sta-
      Gesture and Multimodal Speech, Gesture,
   tionary and Mobile System Use for Commands
    and Multimodal Commands.
    RECOGNITION RATE
    DIFFERENTIAL                 STATIONARY            MOBILE
    NOISE-CANCELING MIC:
     Speech                            —               -5.0%*
     Gesture                           —                 —
     Multimodal                        —               -3.0%*

    BUILT-IN MIC:
     Speech                            —               -15.0%*
     Gesture                           —                  —
     Multimodal                        —               -13.0%*

             Center for Human Computer Communication
                Department of Computer Science, OG I             21
             Study 2- Results (cont.)


Compared to traditional speech processing,
spoken language processed within a multimodal
architecture yielded:

 19-35% reduction in total speech error
rate
    (for noise-canceling & built-in mics, respectively)

No gender effects found in MD
              Center for Human Computer Communication
                 Department of Computer Science, OG I     22
                              Conclusions

• Multimodal architectures can support mutual
  disambiguation & improved robustness over
  unimodal processing
• Error rate reduction can be substantial— 20-40%
• Multimodal systems can reduce or close the
  recognition rate gap for challenging users (accented
  speakers) & usage contexts (mobile)
• Error-prone recognition technologies can be
  stabilized within a multimodal architecture, which
  functionmore reliably in real-world contexts
              Center for Human Computer Communication
                 Department of Computer Science, OG I   23
      Future Directions & Challenges

• Intelligently adaptive processing, tailored for mobile
  usage patterns & diverse users
• Improved language & dialogue processing
  techniques, and hybrid multimodal architectures
• Novel mobile & pervasive multimodal concepts
• Break the robustness barrier— reduce error rate
 (For more information— http://www.cse.ogi.edu/CHCC/)



              Center for Human Computer Communication
                 Department of Computer Science, OG I   24