Document Sample
IS2010_SSF_v00 Powered By Docstoc
					                       Suppression of Slowly-Varying Components and
                    Offset of Power Contour for Robust Speech Recognition
                    Chanwoo Kim 1 , Khistiz Kumar2 , Bhiksha Raj3 , and Richard M. Stern4
                                 Department of Electrical and Computer Engineering,
                                      and 1,3,4 Language Technologies Institute
                               Carnegie Mellon University, Pittsburgh, PA 15213 USA,,,

                          Abstract                                   This first order IIR filter can be achieved by subtracting an ex-
                                                                     ponentially weighted moving average from the current power.
In this paper, we present a novel algorithm called Suppression            For robust speech recognition, the other common difficulty
of Slowly-varying and Falling power envelope (SSF) to enhance        is reverberation. To tackle reverberation , one of the most
spectral features for robust speech recognition especially in re-    widely used approaches is based on “precedence effect”; the
verberant environments. This algorithm is motivated by the           auditory system focuses on the direction of the first wave-front
precedence effect and the modulation frequency characteristic        and largely ignores the late wave components coming from dif-
of a human auditory system. We present two slightly differ-          ferent directions [14] [15]. To detect the first wave-front , we
ent types of SSFs (SSF type-I and SSF type-II) depending on          can either measure the envelope of the signal or energy in the
whether or not we use a low-passed power envelope signal to          frame [16].
suppress falling edges. SSF algorithms can be implemented for
                                                                          In this paper, we introduce SSF processing, which stands
on-line processing. Experimental results show that this algo-
                                                                     for Suppression of Slowly-varying and Falling edge of the
rithm shows especially good performance in reverberant envi-
                                                                     power envelope; This processing mimics aspects of both the
                                                                     precedence effect and modulation frequency. In SSF process-
Index Terms: Robust speech recognition, speech enhancement,
                                                                     ing, we first remove the DC-bias term in each frequency band
precedence effect, modulation frequency
                                                                     by subtracting an exponentially weighted moving average. The
                                                                     region whose power is smaller than this average are suppressed.
                     1. Introduction                                 For this suppression, we propose two different approaches. In
In spite of continued efforts by a large number of researchers,      the first approach, this portion is suppressed by scaling by a
enhancing noise robustness of automatic speech recognition           small constant. In the second approach, they are replaced by
systems still remains a very challenging problem. For stationary     scaled moving average. The former results in better sound qual-
noise such as white or pink noise, algorithms such as Histogram      ity for non-reverberated speech, but the latter results in bet-
Normalization (HN) (e.g. [1]) or Vector Taylor Series (VTS)          ter speech recognition accuracy in reverberant environments.
[2] have been shown to be effective. However, in more realistic      When performing SSF processing, we apply SSF processing
noise like music noise, the same kind of improvement has not         both for training and test sets.
been observed [3]. In these more difficult environments, it has            In speech signal analysis, we usually use a short-window
been frequently observed that algorithms motivated by human          with duration between 20 ms and 30 ms. With the SSF al-
auditory processing (e.g. [4]) or missing feature algorithms are     gorithm, we observe that windows longer than this length are
more promising (e.g. [5]).                                           more appropriate for estimating or compensating noise compo-
     It has long been believed that modulation frequency plays       nents, which is consistent with what we observed in our previ-
an important role in human listening. For example, it has been       ous works (e.g. [17][18][9]). However, even if we use a longer
observed that a human auditory system is more sensitive to           duration window, for speech feature extraction, we still need to
modulation frequencies less than 20 Hz (e.g. [6] [7] [8]). On        use a short-window that is suitable for speech signal analysis.
the other hand, very slowly changing components (e.g. less           Thus, as in [10], after performing frequency-domain process-
than 5 Hz) are usually related to noisy sources (e.g.[9] [10]        ing, we use IFFT and Overlap Addition (OLA) to re-synthesize
[11]). Based on these observations, researchers have tried to        speech. Using this re-synthesized speech, we perform feature
utilize modulation frequency information to enhance the speech       extraction. We will call this procedure Medium-Duration Anal-
recognition performance in noisy environments. Typical ap-           ysis and Synthesis approach (MDAS).
proaches use high-pass or band-pass filtering in either spectral,
log-spectral, or cepstral domains. In [12], Hirsch et al. investi-          2. Structure of the SSF algorithm
gated the effects of high-pass filtering of spectral envelopes of
each frequency subband. Unlike the RASTA (Relative Spec-             Figure 1 shows the structure of the SSF algorithm. The in-
tral)processing proposed by Hermansky in [13] , Hirsch con-          put speech signal is pre-emphasized and then multiplied by a
ducted high-pass filtering in the power domain, which can be          medium duration (50-ms) Hamming window as we did in [10].
expressed using the following transfer function:                     This signal is represented by xm [n] in Fig. 1 where m denotes
                                                                     the frame index. We use a 50-ms window length and 10 ms be-
                              1 − z −1                               tween frames. After windowing, the FFT is computed and in-
                  H(z) =                                      (1)    tegrated over frequency using gammatone weighting functions
                            1 − 0.7z −1

                 Power (dB)
                                                                                                                                                   P[m, l]
                                                                                                                                                   P1[m, l]
                                                                                                                                                   P [m, l]
                              0                                         0.5                           1.0                         1.5                         2.0
                                                                                                    Time (s)

                                                                                        Reverberation with RT        = 0.5
                 Power (dB)

                                                                                                                                                   P[m, l]
                                                                                                                                                   P1[m, l]
                                                                                                                                                   P2[m, l]

                              0                                         0.5                           1.0                         1.5                         2.0
                                                                                                    Time (s)

Figure 2: Power contour P [m, l], P1 [m, l] (processed by SSF Type-I processing), and P2 [m, l] (processed by SSF Type-II processing)
for the 10-th channel in clean environment (a) and in the reverberant environment (b).

                                                            x[n]                                             SSF processing, which will be explained in the following sub-
                                                                                                             sections, we perform spectral reshaping and compute the IFFT
                                                            xm [n]
                                                                                                             using OLA to obtain enhanced speech. We call such kind of
                                                                                                             a structure Medium-Duration Analysis-by-Synthesis (MDAS)
                                                       X[m, ejωk )

                                          Magnitude Squared                                                    3. SSF Type-I and SSF Type-II Processing
                                                     |X[m, ejωk )|2
                                                                                                             In SSF processing, we first obtain low-passed power M [m, l]
             |H0 (ejωk )|2                |H1 (ejωk )|2              |HL−1 (ejωk )|2
                                                                                                             from each channel:
               Squared                      Squared                     Squared
            Band Integration
                                         Band Integration
                                                                     Band Integration
                                                                                                                      M [m, l] = λM [m − 1, l] + (1 − λ)P [m, l]            (3)
                              P [m, 0]            P [m, 1]P [m, L − 1]                                       where λ is a forgetting factor adjusting the bandwidth of the
                                                                                                             low-pass filter. The processed power is obtained by the follow-
             SSF Processing              SSF Processing              SSF Processing                          ing equation:

                              w[m, 0]             w[m, 1]w[m, L − 1]                                             P1 [m, l] = max (P [m, l] − M [m, l], c0 P [m, l])         (4)
                                          Spectral Reshaping                                                 where c0 is a small fixed coefficient to prevent P [m, l] from
                                                       X[m, ejωk )                                           having a negative value. In our experiments, we find c0 = 0.01
                                                                                                             is appropriate for suppression purpose. As obvious from (4),
                                                            xm [n]
                                                            ˆ                                                P1 [m, l] is intrinsically a high-pass filter signal, since the low-
                                                                                                             passed power M [m, l] is subtracted from the original signal
                                                                                                             power P [m, l]. From (4), we observe that if power P [m, l] is
                                                                                                             larger than M [m, l] + c0 P1 [m, l] then, P1 [m, l] is the high-
                                                                                                             pass filter output. However, if P [m, l] is smaller than the latter,
  Figure 1: The block diagram of the SSF processing system                                                   then the power is suppressed. It has the effect of suppressing
                                                                                                             the falling edge of the power contour. We will call processing
                                                                                                             using (4) SSF Type-I.
to obtain the power P [m, l] in the m − th frame and l − th                                                       A similar approach is using the following equation instead
frequency band as shown below:                                                                               of (4):
                               N−1                                                                               P2 [m, l] = max (P [m, l] − M [m, l], c0 M [m, l])         (5)
        P [m, l] =                       |X[m, ejωk )Hl (ejωk )|2                             (2)
                                                                                                             We call this processing SSF Type-II.
                                                                                                                 The difference between (4) and (5) is only one-term, but
where k is a dummy variable representing the discrete fre-                                                   as shown in Fig 3 and 4, they show big difference in recogni-
quency index, and N is the FFT size. The discrete frequency                                                  tion accuracy in reverberant environments. Another interesting
ωk is defined by ωk = 2πk . Since we are using a 50-ms win-
                         N                                                                                   thing is in case of SSF Type-I, if 0.2 ≤ λ ≤ 0.4, then it shows
dow, for 16-kHz audio samples N is 2048. Hl (ejωk ) is the                                                   improvements for clean speech (up to relatively 31 % WER re-
spectrum of the gammatone filter bank for the l − th channel                                                  duction). In the power contour of Fig. 2, we observe that if
evaluated at frequency index k, and X[m, ejωk ) is the short-                                                we use SSF Type-II, then falling edge becomes smoothed out
time spectrum of the speech signal for this l − th frame. L in                                               (since M [m, l] is basically low-passed signal) , which signifi-
Fig. 1 denotes the total number of gammatone channels, and                                                   cantly reduces spectral distortion between clean and reverberant
we are using L = 40 for obtaining the spectral power. After                                                  environments.
                                     SSF Type−I : RM1 (Clean)                                          SSF Type−I : RM1 (Music 0 dB)                                            SSF Type−I : RM1 (RT = 0.5 (s))

                                                                           Accuracy (100 − WER)
  Accuracy (100 − WER)

                                                                                                                                                    Accuracy (100 − WER)
                                                              100 ms                                                                   100 ms
                                                              75 ms                               60                                   75 ms                               60
                                                              50 ms                                                                    50 ms
                                                              25 ms                               40                                   25 ms                               40
                          95                                                                                                                                                                                 100 ms
                                                                                                                                                                                                             75 ms
                                                                                                  20                                                                       20                                50 ms
                                                                                                                                                                                                             25 ms
                          90                                                                      0                                                                        0
                            0       0.2     0.4        0.6   0.8       1                           0    0.2         0.4        0.6    0.8       1                           0     0.2     0.4        0.6    0.8       1
                                                   λ                                                                       λ                                                                     λ
                                             (a)                                                                     (b)                                                                   (c)
                                    SSF Type−II : RM1 (Clean)                                          SSF Type−II : RM1 (Music 0 dB)                                           SSF Type−II : RM1 (RT = 0.5 (s))

                                                                           Accuracy (100 − WER)
  Accuracy (100 − WER)

                                                                                                                                                    Accuracy (100 − WER)
                                                              100 ms                                                                   100 ms
                                                              75 ms                               60                                   75 ms                               60
                                                              50 ms                                                                    50 ms
                                                              25 ms                               40                                   25 ms                               40
                          95                                                                                                                                                                                 100 ms
                                                                                                                                                                                                             75 ms
                                                                                                  20                                                                       20                                50 ms
                                                                                                                                                                                                             25 ms
                          90                                                                      0                                                                        0
                            0       0.2     0.4        0.6   0.8       1                           0    0.2         0.4        0.6    0.8       1                           0     0.2     0.4        0.6    0.8       1
                                                   λ                                                                       λ                                                                     λ
                                             (d)                                                                     (e)                                                                   (f)

Figure 3: Speech recognition accuracy depending on the forgetting factor λ and the window length. In (a), (b), and (c), we used (4)
for normalization. In (d), (e), and (f), we used (5) for normalization. The filled triangles at the y-axis represent the baseline MFCC
performance in the same environment

     Fig. 3 show the performance dependence on the forgetting                                                                                      ˆ
                                                                                                                               The enhanced speech x[n]is re-synthesized using the IFFT and
factor λ and the window length. For additive noise, window                                                                     OverLap Addition (OLA) method.
length of 75 or 100 ms showed the best performance. However,
for reverberation, 50 ms showed the best performance. Thus we
use λ = 0.4 and the window length of 50 ms.
                                                                                                                                             5. Experimental Results
                                                                                                                               In this section, we describe experimental results obtained on
                                          4. Spectral reshaping                                                                the DARPA Resource Management (RM) database using the
                                                                                                                               SSF algorithm. For quantitative evaluation of SSF we used
After obtaining processed power P [m, l] (which is either                                                                      1,600 utterances from the DARPA Resource Management (RM)
P1 [m, l] in (4) or P2 [m, l] (5)), we obtain a processed spectrum                                                             database for training and 600 utterances for testing. We used
X[m, ejωk ]. To achieve this goal, we use a similar spectral re-                                                               SphinxTrain 1.0 for training the acoustic models, and
shaping approach as we did in [10] and [18]. Assuming that the                                                                 Sphinx 3.8 for decoding. For feature extraction we used
phase of the original and the processed spectra are the same, we                                                               sphinx fe which is included in sphinxbase 0.4.1.
modify the magnitude spectrum.                                                                                                 Even though, SSF is primailty targeted for reverberant environ-
    First, for each time-frequency bin, we obtain the weighting                                                                ment, we also conducted experiments in additive noise environ-
coefficient w[m, l] as a ratio of the processed power P [m, l] to                                                               ment as well. In Fig. 4(a), we used test utterances corrupted
P [m, l].                                                                                                                      by additive white Gaussian noise, and in Fig. 4(b), we used test
                                            P [m, l]                                                                           utterances corrupted by musical segments of the DARPA Hub 4
                                w[m, l] =            ,       0 ≤ l ≤ L−1                                      (6)              Broadcast News database.
                                            P [m, l]
                                                                                                                                    We prefer to characterize improvement as amount by which
Each of these channels is associated with Hl , the frequency re-                                                               curves depicting WER as a function of SNR shift laterally when
sponse of one of a set of gammatone filters with center frequen-                                                                processing is applied. We refer to this statistic as the “threshold
cies distributed according to the Equivalent Rectangular Band-                                                                 shift”. As shown in these figures, SSF provided 8-dB thresh-
width (ERB) scale [19]. The final spectrum weighting µ[m, k]is                                                                  old shifts for white noise and 3.5-dB shifts for background mu-
obtained using the above weight w[m, l]                                                                                        sic. Note that obtaining improvements for background music
                                                       w[m, l] Hl ejωk                                                         is not easy. For comparison, we also obtained similar results
                         µ[m, k]     =                 L−1
                                                                       ,                                                       using the state-of-the-art noise compensation algorithm Vector
                                                         |Hl (ejωk )|
                                                       l=0                                                                     Taylor series (VTS) [2]. We also conducted experiments us-
                                                   0 ≤ k ≤ N/2 − 1, 0 ≤ l ≤ L − 1 (7)                                          ing a open source RASTA-PLP implementation [20]. For white
                                                                                                                               noise, VTS and SSF show almost the same performance , but for
After obtaining µ[m, k] for the lower half frequency region 0 ≤
                                                                                                                               music noise, SSF shows significantly better performance. For
k ≤ N/2, we can obtain the upper half from the symmetric
                                                                                                                               additive noise, both SSF Type-I and SSF Type-II show almost
                                                                                                                               the same performance. For clean utterance, SSF Type-I is doing
                           µ[m, k] = µ[m, N − k],                  0 ≤ k ≤ N/2                                (8)              slightly better than SSF Type-II.
Using µ[m, k], the reconstructed spectrum is obtained by:                                                                           In reverberant environments, as shown in Fig. 4(c), SSF
                                                                                                                               Type-II shows the best performance by a very large margin. SSF
 X[m, ejωk ) = µg [m, k]X[m, ejωk ), 0 ≤ k ≤ N − 1                                                            (9)              Type-I shows the next performance, but the performance dif-
                                                  RM1 (White Noise)                                                 8. References
                                                                                             [1] S. Molau, M. Pitz, and H. Ney, “Histogram based normalization in
                                                                                                 the acoustic feature space,” in IEEE Automatic Speech Recogntion

       Accuracy (100 − WER)
                                                                                                 and Understanding Workshop, Nov. 2001, pp. 21–24.
                              60                                                             [2] P. J. Moreno, B. Raj, and R. M. Stern, “A vector Taylor series ap-
                                                                                                 proach for environment-independent speech recognition,” in IEEE
                              40                                                                 Int. Conf. Acoust., Speech and Signal Processing, May. 1996, pp.
                              20                                                             [3] B. Raj, V. N. Parikh, and R. M. Stern, “The effects of back-
                                                                                                 ground music on speech recognition accuracy,” in IEEE Int. Conf.
                                0          5         10       15          20     Clean
                                                                                                 Acoust., Speech and Signal Processing, vol. 2, Apr. 1997, pp.
                                                       SNR (dB)                                  851–854.
                                                                                             [4] C. Kim, Y.-H. Chiu, and R. M. Stern, “Physiologically-motivated
                                                  RM1 (Music Noise)                              synchrony-based processing for robust automatic speech recogni-
                              100                                                                tion,” in INTERSPEECH-2006, Sept. 2006, pp. 1975–1978.
                                                                                             [5] B. Raj and R. M. Stern, “Missing-Feature Methods for Robust Au-
       Accuracy (100 − WER)

                                                                                                 tomatic Speech Recognition,” IEEE Signal Processing Magazine,
                                                                                                 vol. 22, no. 5, pp. 101–116, Sept. 2005.
                                                                                             [6] B. E. D. Kingsbury, N. Morgan, and, S. Greenberg, “Robust
                              40                             SSF Type−II                         speech recognition using the modulation spectrogram,” Speech
                                                             SSF Type−I                          Communication, vol. 25, no. 1.
                              20                             MFCC with VTS and CMN
                                                             Baseline MFCC with CMN          [7] R. Drullman, J. M. Festen and R. Plomp, “Effect of temporal enve-
                                                             RASTA−PLP with CMN                  lope smearing on speech recognition,” J. Acoust. Soc. Am., vol. 95,
                               0                                                                 no. 2, pp. 1053–1064, Feb. 1994.
                                0          5         10       15          20     Clean
                                                       SNR (dB)
                                                       (b)                                   [8] ——, “Effect of reducing slow temporal modulations on speech
                                                                                                 recognition,” J. Acoust. Soc. Am., vol. 95, no. 5, pp. 2670–2680,
                                                 RM1 (Reverberation)                             May 1994.
                                                                                             [9] C. Kim and R. M. Stern, “Feature extraction for robust speech
                                                                                                 recognition based on maximizing the sharpness of the power dis-
       Accuracy (100 − WER)

                                                                                                 tribution and on power flooring,” in IEEE Int. Conf. on Acoustics,
                               60                                                                Speech, and Signal Processing, March 2010, pp. 4574–4577.
                                                                                            [10] ——, “Power function-based power distribution normalization al-
                               40                                                                gorithm for robust speech recognition,” in IEEE Automatic Speech
                                                                                                 Recognition and Understanding Workshop, Dec. 2009, pp. 188–
                               20                                                                193.
                                                                                            [11] C. Kim, K. Kumar and R. M. Stern, “Robust speech recognition
                                 0   0.1 0.2 0.3 0.4 0.5 0.6           0.9            1.2        using small power boosting algorithm,” in IEEE Automatic Speech
                                                Reverberation Time (s)                           Recognition and Understanding Workshop, Dec. 2009, pp. 243–
Figure 4: Speech recognition accuracy using different algo-
                                                                                            [12] H. G. Hirsch, P. Meyer , and H. W. Ruehl, “Improved speech
rithms (a) for white noise (b) for musical noise, and (c) under                                  recognition using high-pass filtering of subband envelopes,” in
reverberant environments                                                                         EUROSPEECH ’91, Sept. 1991, pp. 413–416.
                                                                                            [13] H. Hermansky and N. Morgan, “RASTA processing of speech,”
ference between SSF Type-I and SSF-Type-II is large. On the                                      IEEE. Trans. Speech Audio Process., vol. 2, no. 4, Oct. 1994.
contrary, VTS does not show performance improvement, and                                    [14] K. D. Martin, “Echo suppression in a computational model of the
PLP-RASTA shows poorer performance than MFCC.                                                    precedence effect,” in IEEE ASSP Workshop on Applications of
                                                                                                 Signal Processing to Audio and Acoustics, Oct. 1997.
                                               6. Conclusion                                [15] P. M. Zurek, The precedence effect.     New York, NY: Springer-
                                                                                                 Verlag, 1987, ch. 4, pp. 85–105.
In this paper, we present a new algorithm which is especially ro-                           [16] Y. Park and H. Park, “Non-stationary sound source localization
bust for reverberation. Motivated by the modulation frequency                                    based on zero crossings with the detection of onset intervals,” IE-
concept and precedence effect, we apply a first-order high pass                                   ICE Electronics Express, vol. 5, no. 24, pp. 1054–1060, 2008.
filtering. The falling edges of power contours are suppressed in                             [17] C. Kim and R. M. Stern, “Feature extraction for robust speech
two different ways. We observe that using the low-passed signal                                  recognition using a power-law nonlinearity and power-bias sub-
for the falling edge is especially helpful for reducing spectral                                 traction,” in INTERSPEECH-2009, Sept. 2009.
distortion for reverberant environments. Experimental results                               [18] C. Kim, K. Kumar, B. Raj, and R. M. Stern, “Signal separation
show that this approach is more effective than known algorithms                                  for robust speech recognition based on phase difference informa-
in reverberant environments.                                                                     tion obtained in the frequency domain,” in INTERSPEECH-2009,
                                                                                                 Sept. 2009.
                               7. Open Source Matlab Code                                   [19] B. C. J. Moore and B. R. Glasberg, “A revision of Zwicker’s loud-
                                                                                                 ness model,” Acustica - Acta Acustica, vol. 82, pp. 335–345, 1996.
You can download the open source matlab code at the following                               [20] D. Ellis. (2006) Plp and rasta (and mfcc, and inversion) in
URL˜robust/archive/                                                       matlab using melfcc.m and invmelfcc.m. [Online]. Available:
algorithms/SSF_IS2010/. This matlab code was used                                      
in obtaining experimental results in Section 5.

Shared By: