Distributed Intelligent Network Management by dffhrtcv3

VIEWS: 5 PAGES: 15

									       The 2000 NRL Evaluation for
Recognition of Speech in Noisy Environments

                MITRE / MS State - ISIP



 Burhan Necioglu               Ramasubramanian Sundaram
   Bryan George                            Joe Picone
  George Shuttic                        Mississippi State U.
 The MITRE Corporation        Inst. for Signal & Information Processing
INTRODUCTION

 Collaboration between The MITRE Corporation and Mississippi
  State Institute for Signal and Information Processing (ISIP)
   – Primary goal: Evaluate the impact of noise pre-processing
     developed for other DoD applications
 MITRE:
   – Focus on robust speech recognition using noise reduction
     techniques, including effects of tactical communications links
   – Distributed information access systems for military
     applications (DARPA Communicator)
 Mississippi State:
   – Focus on stable, practical, advanced LVCSR technology
   – Open source large vocabulary speech recognition tools
   – Training, education and dissemination of information related
     to all aspects of speech research
 ISIP-STT System utilized combination of technologies from both
  organizations
OVERVIEW OF THE SYSTEM


     Standard MFCC front-end with side-based CMS
     Acoustic modeling:
       – Left-right model topology
       – Skip states for special models like silence
       – Continuous density mixture Gaussian HMMs
       – Both Baum-Welch and Viterbi training supported
       – Phonetic decision tree-based state-tying
     Hierarchical search Viterbi decoder
STATE-TYING: MOTIVATION




  Context-dependent models for better performance
  Increased parameter count
  Need to reduce computations without degrading performance
FEATURES AND PERFORMANCE

  Batch processing
  Real-time performance of the training process during various
   stages:
DECODER: OVERVIEW


  Algorithmic features:
    – Single-pass decoding
    – Hierarchical Viterbi search
    – Dynamic network expansion
  Functional features:
    – Cross-word context-dependent acoustic models
    – Word graph rescoring, forced alignments, N-gram decoding
  Structural features:
    – Word graph compaction
    – Multiple pronunciations
    – Memory management
EVALUATION SYSTEM - NOISE
PREPROCESSING

 Using Harsh Environment Noise Pre-Processor (HENPP) front-
    end to remove noise from input speech
   HENPP developed by AT&T to address background noise effects
    in DoD speech coding environments (see Accardi and Cox,
    Malah et al, ICASSP 1999)
   Multiplicative spectral processing - minimal distortion,
    eliminates “doodley-doos” (aka “musical noise”)
   “Minimum statistics” noise adaptation - handles quasi-stationary
    additive noise (random and stochastic) without assumptions
   Limitations:
     – Not designed to address transient noise
     – Noise adaptation sensitive to “push-to-talk” effects
   Integrated 2.4 kbps MELP/HENPP demonstrated successfully in
    low- to moderate-perplexity ASR:
     LPC-10                 MELP               MELP/HENPP
EVALUATION SYSTEM - DATA AND
TRAINING




  10 hours of SPINE data used for training - no DRT words
  100 frames per second, 25msec Hamming window
  12 base FFT-derived mel cepstra with side-based CMS and log-
   energy
  Delta and acceleration coefficients
  44 phone set to cover SPINE data
  909 models, 2725 states
EVALUATION SYSTEM - LM and LEXICON



     5226 words in the SPINE lexicon, provided by CMU
     CMU language model
     Bigrams obtained by throwing away the trigrams
     LM size: 5226 unigrams, 12511 bigrams
EVALUATION SYSTEM - DECODING




  Single stage decoding using word-internal acoustic models and
   bigram LM
RESULTS AND ANALYSIS


      Experiment          WER (%) Subs (%) Dels (%) Ins (%)
   Baseline ISIP-STT        56.2      26.0      21.1     9.0
  Noise pre-processed       58.4      27.1      24.9     6.5
  training & evaluation
           data



   Lattice generation/lattice rescoring will improve results.
   Informal analysis of evaluation data and results:
     – Negative correlation between recognition
       performance and SNR
RESULTS AND ANALYSIS (cont.)

  Clean speech : “B” side of spine_eval_033 (281 total words)

          Experiment          Correct Subs Dels    Ins Tot err
       Baseline ISIP-STT       221     36    24     4     64
      Noise pre-processed      198     37    46     6     89
      training & evaluation
               data

  Low SNR example: “A” side of spine_eval_021 (115 total words):

          Experiment          Correct Subs Dels    Ins Tot err
       Baseline ISIP-STT        72    25     18     4     47
      Noise pre-processed       80    18     17     3     38
      training & evaluation
               data
RESULTS AND ANALYSIS (cont.)

  HENPP designed for human listening purposes
    – Optimized to raise DRT scores in presence of noise and
      coding
    – DRT scores, WER tend to be poorly correlated; minor
      perceptual distortions often have magnified adverse effect on
      speech recognizers
  Need to retune the HENPP
    – Algorithm is very effective for robust recognition of noisy
      speech at low SNR’s
    – Too aggressive when applied to clean speech - some
      information is lost
    – Minor adjustments will preserve noisy speech performance
      and boost clean speech performance
ISSUES


  Decoding slow on this task
    – 100x real-time (on 600 MHz Pentium)
    – Newer version of ISIP-STT decoder will be faster
    – Had to use bigram LM in the allowed time frame
  Large amount of eval data
    – With slow decoding, seriously limited experiments
  The devil is in the details:
    – Certain training data problematic “Noise field is
      <long silence> up”
    – Automatic segmentation (having eval segmentations would
      help)
CONCLUSIONS

 MITRE / MS State-ISIP system; standard recognition approach
    using advanced noise preprocessing front end
   Time limitation: could only officially report on the baseline
    system
   Performed initial experiment with noise-preprocessing (AT&T
    HENPP)
     – Overall word error rate did not improve
     – Informal analysis suggests that for low SNR conversations,
       noise pre-processing does help.
     – Difficulty with high SNR conversations
   There is potential for improvement with application specific
    tuning of HENPP.
   Approach is very promising for coded speech in commercial and
    military environments

								
To top