Speech Discrimination Based on
           Multiscale Spectro–Temporal
   Nima Mesgarani, Shihab Shamma,         Malcolm Slaney
        University of Maryland                 IBM

                    Reporter : Chen, Hung-Bin

ICASSP 2004                                                1

•   Introduction VAD ( Voice Activity Detection and Speech Segmentation )
     – discriminate speech from non-speech which consists of noise sounds
     – multiscale spectro-temporal modulation features extracted using a
       model of auditory cortex

•   Two state-of-the-art systems
     – Robust Multifeature Speech/Music Discriminator
     – Robust Speech Recognition In Noisy Environments

•   Auditory model

•   Experimental results

•   Summary and Conclusions

ICASSP 2004                                                                 2
                       Introduction - VAD

•   significance
    – Speech recognition systems designed for real world conditions, a
      robust discrimination of speech from other sounds is a crucial step.

•   advantage
    – Speech discrimination can also be used for coding or
      telecommunication applications.

•   proposed system
    – a feature set inspired by investigations of various stages of the auditory

ICASSP 2004                                                                    3
              Two state-of-the-art systems

•   Multi–feature System
    – Features
         • Thirteen features in Time,
           Frequency, and Cepstrum domain
           are used to model speech and music
    – Classification
         • A Gaussian mixture model (GMM)
           models each class of data as the
           union of several Gaussian clusters in
           the feature space.

•   Reference:
    – [1] E. Scheirer, M. Slaney, ”Construction
      and evaluation of a robust multifeature
      speech/music discriminator”, ICASSP’97,
ICASSP 2004                                        4
         Two state-of-the-art systems (cont)

•   Voicing–energy System
    – Features
          • frame-by-frame maximum autocorrelation and log-energy features is making
            the speech/non-speech decision.
          • PLP
          • LDA+MLLT

    – Segmentation
          • use an HMM-based segmentation procedure with two models, one for
            speech segments and one for non-speech segments.

•   Reference:
    –   [2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in
        noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002,

ICASSP 2004                                                                                                  5
                            Auditory model

•     The computational auditory model is based on neurophysiological,
      biophysical, and psychoacoustical investigations at various
      stages of the auditory system.

•     transformation of the acoustic signal into an internal neural
      representation (auditory spectrogram)

    ICASSP 2004                                                          6
                  Auditory model (cont)

•   a complex spatiotemporal pattern
    – vibrations along the basilar membrane of the cochlea

•   3–step process
    1) highpass filter, by an instantaneous nonlinear compression
    2) lowpass filter (hair cell membrane leakage)
    3) detects discontinuities in the responses across the tonotopic
       axis of the auditory nerve array

    – computationally via a bank of modulation-selective filters
      centered at each frequency along the tonotopic axis.

ICASSP 2004                                                            7
                   Auditory model (cont)

•   Sound is analyzed by a model of the cochlea (depicted on the left)
    consisting of a bank of 128 constant-Q bandpass .lters with center
    frequencies equally spaced on a logarithmic frequency axis

ICASSP 2004                                                              8
              Multilinear Analysis Of Cortical
•   auditory model is a multidimensional array.
•   the time dimension is averaged over a given time window which
    results in a three mode tensor for each time window with each
    elements representing the overall modulations at corresponding
    frequency, rate and scale (128(frequency channels) ×26 (rates) ×6

ICASSP 2004                                                             9
              Multilinear Analysis Of Cortical
                  Representation (cont)
•   Using multi-dimensional PCA to tailor the amount of reduction in each
    subspace independently.
•   To generalize the multidimensional tensors concept, we consider a
    generalization of SVD (Singular Value Decomposition) to tensors.

•   D = S×1Ufrequency×2Urate×3Uscale×4Usamples
     – D : The resulting data
     – S : I1 × I2 × ... × IN

•   Original : (128(frequency channels) ×26 (rates) ×6 (scales)
•   The resulting tensor whose retained singular vectors in each mode (
    7 for frequency , 5 for rate and 3 for scale dimensions) is used for

•   Classification was performed using a Support Vector Machine (SVM)

ICASSP 2004                                                           10
                       Experimental Results

•   Audio Database from TIMIT
     – Training data : 300 samples
     – Testing data : 150 different sentences spoken by 50 different speakers
       (25 male, 25 female)
     – training and test sets were different.
•   To make the non-speech class
     – from BBC Sound Effects audio CD, RWC Genre Database and Noisex
       and Aurora databases were assembled together.
•   The training set
     – 300 speech and 740 non-speech samples
•   the testing set
     – 150 speech and 450 non-speech samples
•   The audio length is equal.

ICASSP 2004                                                                 11
              Experimental Results (cont)

• speech detection/discrimination
    – Table 1 and 2 shows the effect

ICASSP 2004                                 12
              Experimental Results (cont)

• tests white and pink noise were added to speech with
  specified signal to noise ratio (SNR).

ICASSP 2004                                              13
              Experimental Results (cont)

• different levels of reverberation on the performance

ICASSP 2004                                              14
              Summary and Conclusions

• This work is but one in a series of efforts at incorporating
  multi–scale cortical representations (and more broadly,
  perceptual insights) in a variety of audio and speech
  processing applications.

•   Applications such as
    – automatic classification
    – segmentation of animal sounds
    – an efficient encoding of speech and music

ICASSP 2004                                                  15

•   Two state-of-the-art systems
     – [1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature
       speech/music discriminator”, ICASSP’97, 1997.
     – [2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust
       speech recognition in noisy environments: The 2001 IBM SPINE evaluation
       system”, ICASSP 2002, vol. I, pp. 53–56, 2002.

•   Central Auditory System
     – [4] K. Wang and S. A. Shamma, ”Spectral shape analysis in the central auditory
       system”, IEEE Trans. Speech Audio Proc. vol. 3 (5), pp. 382–395, 1995.
     – [6] M. Elhilali, T. Chi and S. A. Shamma, ”A spectro-temporal modulation index
       (STMI) for assessment of speech intelligibility”, Speech comm., vol. 41, pp. 331–
       348, 2003.
     – Auditory cortical representation of complex acoustic spectra as inferred from the
       ripple analysis method

     – http://www.isr.umd.edu/People/faculty/Shamma.html

ICASSP 2004                                                                           16

To top