Biologically inspired noise robust speech recognition for both by sanmelody


									Biologically Inspired Noise-
Robust Speech Recognition
for Both Man and Machine

         Mark D. Skowronski
  Computational NeuroEngineering Lab
         University of Florida
            March 26, 2004
    Speech Recognition Motivation

     Speech #1 real-time communication
     medium among humans.

     Advantages of voice interface to machines:
     • Hands-free operation
     • Speed
     • Ease of use

                   Man vs. Machine
    Man is a high-performance
    existence proof for speech
    processing in noisy

    Can we emulate man’s
    performance by leveraging
    expert information into our

                                  Wall Street Journal/Broadcast news readings, 5000 words
                                  Untrained human listeners vs. Cambridge HTK LVCSR system
      Biologically Inspired Algorithms
    Expert Information is added in three applications:

       Speech enhancement
       for human listeners

      Feature extraction for
      automatic speech recognition

      Classification for automatic
      speech recognition
             Speech Enhancement

    • Noisy cell phone conversations
    • Public address systems
    • Aircraft cockpit

    What can we do to increase
    intelligibility when turning up the
    volume is not an option?
                                                Lombard effect
                               This work funded by the iDEN Technology Group of Motorola
    The Lombard Effect
    Psychophysical changes in vocal characteristics, produced
    by a speaker in the presence of background acoustic noise:

         •   Vocal effort (amplitude) increases
         •   Duration increases
         •   Pitch increases
         •   Formant frequencies increase
         •   Energy center of gravity increases
         •   Consonant-to-Noise ratio increases

             Result: Intelligibility increases
     Psychoacoustic Experiments

      Miller and Nicely (1955): AWGN to speech affects place of articulation
       and frication most, less so for voicing and nasality.

      Furui (1986): Truncated vowels in consonant-vowel pairs dramatically
       decreased in intelligibility beyond a certain point of truncation. These
       points correspond to spectrally dynamic regions.

    Bottom Line:
    Speech contains regions of relatively high phonetic information, and
    emphasis of these regions increases intelligibility.

    Solution: Energy Redistribution
    We redistribute energy from regions of low information content to regions
    of high information content while conserving overall energy.
    From Miller and Nicely:                                     SFM of “clarification”
    ER for Voiced/Unvoiced (ERVU) regions.

      Voicing determined by the Spectral
      Flatness Measure (SFM):

    Xj(k) is the magnitude of the short-term Fourier
    transform of the jth speech window of length N.
                                     M. D. Skowronski, J. G. Harris, and T. Reinke, J. Acoust. Soc. Am., 2002
    Listening Tests

                      Confusable set test, from Junqua*
                             I f, s, x, yes
                            II a, h, k, 8
                           III b, c, d, e, g, p, t, v, z, 3
                           IV m, n

                      • 500 trials forced decision
                      • 3 algorithms (control, ERVU, HPF)
                      • 0 dB and -10 dB SNR, AWGN
                      • unlimited playback over headphones
                      • 25 participants, 30-45 minutes

                                   J. C. Junqua, J. Acoust. Soc. Am., 1993*
     Listening Tests Results
        -10 dB SNR, white noise

                                          20% compared
                                          to control.

                 “S”   “A”    “E”   “M”

Energy Redistribution Summary

     • Developed a real-time algorithm for cell phone
       applications using biological inspiration,
     • Increased intelligibility while maintaining
       naturalness and conserving energy,
     • Effective because everyday speech is not clearly
     • ERVU is a novel approach to speech enhancement
       that works on either clean speech or noise-reduced

                         M. D. Skowronski and J. G. Harris, J. Acoust. Soc. Am., 2004b (in preparation)
                   Feature Extraction
     ASR: Input         Feature Extraction   Classification

     Information: phonetic,
     gender, age, emotion, pitch,
     accent, physical state,
     additive/channel noise.

     HFCC filter bank
     Existing Algorithms
       Goal: emphasize phonetic information over other info streams.
        Feature algorithms:
        • Acoustic: formant frequencies, bandwidths
        • Model based: linear prediction
        • Filter bank based: mel freq cepstral coeff (MFCC)
       Provides dimensionality reduction on quasi-stationary windows.

     MFCC Filter Bank

      • Design parameters: FB freq range, number of filters
      • Center freqs equally-spaced in mel frequency
      • Triangle endpoints set by center freqs of adjacent filters

           Although filter spacing is determined by perceptual mel
           frequency scale, bandwidth is set more for convenience
           than by biological arguments.

     HFCC Filter Bank
      HFCC: human factor cepstral coefficients

      • Decouples filter bandwidth from filter spacing,
      • Sets filter width according to the critical bandwidth of the
        human auditory system,
      • Uses Moore and Glasberg approximation of critical
        bandwidth, defined in Equivalent Rectangular Bandwidth

      fc is critical band center frequency (KHz).
                                                    M. D. Skowronski and J. G. Harris, ICASSP, 2002
     HFCC with E-factor
      Linear ERB scale factor (E-factor) controls filter bandwidth

     E-factor = 1

     E-factor = 3

      • Controls tradeoff between local SNR and spectral resolution,
      • Exemplifies the benefits of decoupling filter bandwidth from filter

                                   M. D. Skowronski and J. G. Harris, J. Acoust. Soc. Am., 2004a (submitted)
     ASR Experiments

        • Isolated English digits “zero” through “nine” from
          TI-46 corpus, 8 male speakers,
        • HMM word models, 8 states per model, diagonal
          covariance matrix,
        • Three MFCC versions (different filter banks),
        • Linear ERB scale factor (E-factor),
        • HFCC with E-factor (HFCC-E).

      Total: 37.9 million frames of speech, (>100 hours)

     ASR Results

         White noise (global SNR), HFCC-E vs. D&M,
         Linear ERB scale factor (E-factor).
                                       M. D. Skowronski and J. G. Harris, ISCAS, 2003
     HFCC Summary

     • Adds biologically inspired bandwidth to filter bank
       of popular speech feature extractor,
     • Provides superior noise-robust performance over
       MFCC and variants,
     • Allows for further filter bank design modifications,
       demonstrated by HFCC with E-factor,
     • HFCC has the same computational cost as MFCC,
       only the filter bank coefficients are adjusted: easy to

     •   HMM Limitations & Variations
     •   Freeman Model Introduction
     •   Model Hierarchy
     •   Associative Memory
     •   ASR Experiments

     Freeman’s Reduced KII Network
                   This work funded by the Office of Naval Research grant N00014-1-1-0405
 HMM Limitations & Variations

     • HMM is piece-wise stationary; speech is nonstationary,
     • Assumes frames are i.i.d.; speech is coarticulated,
     • State PDFs are data-driven; curse of dimensionality.
     Deng (1992): trended HMM                           ~ HMM
     Rabiner (1986): autoregressive HMM
     Morgan & Bourlard (1995): HMM/MLP hybrid
     Robinson (1994): context-dependent RNN
     Herrmann (1993): transient attractor network
     Liaw & Berger (1996): dynamic synapse RNN          Nonlinear
     Freeman (1997): non-convergent dynamic biological model
     Freeman Model
     Hierarchical nonlinear dynamic model of cortical signal
     processing from rabbit olfactory neo-cortex.

     K0 cell, H(s) 2nd order low pass filter

     Reduced KII (RKII) cell (stable oscillator)

     RKII Network
      High-dimensional, scalable network of stable oscillators.
      Fully connected M-cell and G-cell weight matrices (zero diagonal).

      Capable of several dynamic behaviors:
      • Stable attractors (limit cycle, fixed point)
      • Chaos
      • Spatio-temporal patterns
      • Synchronization
                  Generalization                Associative Memory
     Oscillator Network
     Two regimes of operation as an associative memory of binary patterns:

             Energy                             Synchronization Through
                                                Stimulation (STS)
       Network weights for each regime set by outer
       product rule variation and by hand.
                                  M. D. Skowronski and J. G. Harris, Phys. Rev. E, 2004 (in preparation)
     Associative Memory
             Input   Output   Input   Output




     ASR with RKII Network
     Two-Class Case
     • \IY\ from “she”
                             Classifier          \IY\   \AA\   Correct
     • \AA\ from “dark”
     • 10 HFCC-E coeffs.     Bayes,       \IY\   2705    0      99.9
       converted to binary   continuous   \AA\    8     4340
     • Energy-based RKII
       associative memory    Bayes,       \IY\   2701    4      98.4
     • No overlap between    binary       \AA\   110    4238
       learned centroids     Hamming      \IY\   2658   47      93.7
                             distance     \AA\   394    3954
                             RKII,        \IY\   2593    6      87.3
                             exact        \AA\   202    3564
                             RKII,        \IY\   2666   39      92.7
                             Hamming      \AA\   479    3869

     ASR with RKII Network
     Three-Class Case
     • \IY\ from “she”
     • \AA\ from “dark”
     • \AE\ from “ask”
     • 18 HFCC-E coeffs.
       converted to binary
     • Energy-based RKII
       associative memory
     • Variable overlap between
       learned centroids

     Overlap controlled by binary feature conversion
     More overlap       more spurious outputs
     Freeman Model Summary
      • Documented impulse invariance discretization,
      • Developed software tools, enabling large-scale
      • Demonstrated stable attractors in Freeman model,
      • Explained attractor instability by transient chaos,
      • Proposed two regimes of associative memory,
      • Invented novel synchronization mechanism (STS),
      • Devised variation of outer product rule for oscillator
        network learning rule,
      • Proved practical probabilities concerning overlap in
        three-class case,
      • Applied novel static pattern classifier to ASR.

     Developed novel speech enhancement algorithm,
      - Lombard effect indicates what to modify,
      - Psychoacoustic experiments indicate where to modify,
      - ERVU reduces human recognition error 20-40% in noisy
     Extended existing speech feature extraction algorithm,
      - Critical bandwidth used to decouple filter bandwidth and spacing,
      - HFCC-E demonstrates research tangent for ind. filter bandwidth,
      - HFCC-E improves ASR by 7 dB SNR.
     Advanced knowledge of NLD models for info processing.
      - Applied model to ASR of static speech features,
      - Near-optimum performance of RKII network associative memory
        using first-order statistics.


To top