Suppression of Slowly-Varying Components and Offset of Power Contour for Robust Speech Recognition Chanwoo Kim 1 , Khistiz Kumar2 , Bhiksha Raj3 , and Richard M. Stern4 2,4 Department of Electrical and Computer Engineering, and 1,3,4 Language Technologies Institute Carnegie Mellon University, Pittsburgh, PA 15213 USA email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org Abstract This ﬁrst order IIR ﬁlter can be achieved by subtracting an ex- ponentially weighted moving average from the current power. In this paper, we present a novel algorithm called Suppression For robust speech recognition, the other common difﬁculty of Slowly-varying and Falling power envelope (SSF) to enhance is reverberation. To tackle reverberation , one of the most spectral features for robust speech recognition especially in re- widely used approaches is based on “precedence effect”; the verberant environments. This algorithm is motivated by the auditory system focuses on the direction of the ﬁrst wave-front precedence effect and the modulation frequency characteristic and largely ignores the late wave components coming from dif- of a human auditory system. We present two slightly differ- ferent directions  . To detect the ﬁrst wave-front , we ent types of SSFs (SSF type-I and SSF type-II) depending on can either measure the envelope of the signal or energy in the whether or not we use a low-passed power envelope signal to frame . suppress falling edges. SSF algorithms can be implemented for In this paper, we introduce SSF processing, which stands on-line processing. Experimental results show that this algo- for Suppression of Slowly-varying and Falling edge of the rithm shows especially good performance in reverberant envi- power envelope; This processing mimics aspects of both the ronments. precedence effect and modulation frequency. In SSF process- Index Terms: Robust speech recognition, speech enhancement, ing, we ﬁrst remove the DC-bias term in each frequency band precedence effect, modulation frequency by subtracting an exponentially weighted moving average. The region whose power is smaller than this average are suppressed. 1. Introduction For this suppression, we propose two different approaches. In In spite of continued efforts by a large number of researchers, the ﬁrst approach, this portion is suppressed by scaling by a enhancing noise robustness of automatic speech recognition small constant. In the second approach, they are replaced by systems still remains a very challenging problem. For stationary scaled moving average. The former results in better sound qual- noise such as white or pink noise, algorithms such as Histogram ity for non-reverberated speech, but the latter results in bet- Normalization (HN) (e.g. ) or Vector Taylor Series (VTS) ter speech recognition accuracy in reverberant environments.  have been shown to be effective. However, in more realistic When performing SSF processing, we apply SSF processing noise like music noise, the same kind of improvement has not both for training and test sets. been observed . In these more difﬁcult environments, it has In speech signal analysis, we usually use a short-window been frequently observed that algorithms motivated by human with duration between 20 ms and 30 ms. With the SSF al- auditory processing (e.g. ) or missing feature algorithms are gorithm, we observe that windows longer than this length are more promising (e.g. ). more appropriate for estimating or compensating noise compo- It has long been believed that modulation frequency plays nents, which is consistent with what we observed in our previ- an important role in human listening. For example, it has been ous works (e.g. ). However, even if we use a longer observed that a human auditory system is more sensitive to duration window, for speech feature extraction, we still need to modulation frequencies less than 20 Hz (e.g.   ). On use a short-window that is suitable for speech signal analysis. the other hand, very slowly changing components (e.g. less Thus, as in , after performing frequency-domain process- than 5 Hz) are usually related to noisy sources (e.g.  ing, we use IFFT and Overlap Addition (OLA) to re-synthesize ). Based on these observations, researchers have tried to speech. Using this re-synthesized speech, we perform feature utilize modulation frequency information to enhance the speech extraction. We will call this procedure Medium-Duration Anal- recognition performance in noisy environments. Typical ap- ysis and Synthesis approach (MDAS). proaches use high-pass or band-pass ﬁltering in either spectral, log-spectral, or cepstral domains. In , Hirsch et al. investi- 2. Structure of the SSF algorithm gated the effects of high-pass ﬁltering of spectral envelopes of each frequency subband. Unlike the RASTA (Relative Spec- Figure 1 shows the structure of the SSF algorithm. The in- tral)processing proposed by Hermansky in  , Hirsch con- put speech signal is pre-emphasized and then multiplied by a ducted high-pass ﬁltering in the power domain, which can be medium duration (50-ms) Hamming window as we did in . expressed using the following transfer function: This signal is represented by xm [n] in Fig. 1 where m denotes the frame index. We use a 50-ms window length and 10 ms be- 1 − z −1 tween frames. After windowing, the FFT is computed and in- H(z) = (1) tegrated over frequency using gammatone weighting functions 1 − 0.7z −1 Clean Power (dB) P[m, l] P1[m, l] P [m, l] 2 0 0.5 1.0 1.5 2.0 Time (s) (a) Reverberation with RT = 0.5 60 Power (dB) P[m, l] P1[m, l] P2[m, l] 0 0.5 1.0 1.5 2.0 Time (s) (b) Figure 2: Power contour P [m, l], P1 [m, l] (processed by SSF Type-I processing), and P2 [m, l] (processed by SSF Type-II processing) for the 10-th channel in clean environment (a) and in the reverberant environment (b). x[n] SSF processing, which will be explained in the following sub- Pre-Emphasis sections, we perform spectral reshaping and compute the IFFT xm [n] using OLA to obtain enhanced speech. We call such kind of a structure Medium-Duration Analysis-by-Synthesis (MDAS) STFT structure. X[m, ejωk ) Magnitude Squared 3. SSF Type-I and SSF Type-II Processing |X[m, ejωk )|2 In SSF processing, we ﬁrst obtain low-passed power M [m, l] |H0 (ejωk )|2 |H1 (ejωk )|2 |HL−1 (ejωk )|2 from each channel: Squared Squared Squared Gammatone Band Integration Gammatone Band Integration Gammatone Band Integration M [m, l] = λM [m − 1, l] + (1 − λ)P [m, l] (3) P [m, 0] P [m, 1]P [m, L − 1] where λ is a forgetting factor adjusting the bandwidth of the low-pass ﬁlter. The processed power is obtained by the follow- SSF Processing SSF Processing SSF Processing ing equation: w[m, 0] w[m, 1]w[m, L − 1] P1 [m, l] = max (P [m, l] − M [m, l], c0 P [m, l]) (4) Spectral Reshaping where c0 is a small ﬁxed coefﬁcient to prevent P [m, l] from ˆ X[m, ejωk ) having a negative value. In our experiments, we ﬁnd c0 = 0.01 IFFT is appropriate for suppression purpose. As obvious from (4), xm [n] ˆ P1 [m, l] is intrinsically a high-pass ﬁlter signal, since the low- passed power M [m, l] is subtracted from the original signal Post-Deemphasis power P [m, l]. From (4), we observe that if power P [m, l] is ˆ x[n] larger than M [m, l] + c0 P1 [m, l] then, P1 [m, l] is the high- pass ﬁlter output. However, if P [m, l] is smaller than the latter, Figure 1: The block diagram of the SSF processing system then the power is suppressed. It has the effect of suppressing the falling edge of the power contour. We will call processing using (4) SSF Type-I. to obtain the power P [m, l] in the m − th frame and l − th A similar approach is using the following equation instead frequency band as shown below: of (4): N−1 P2 [m, l] = max (P [m, l] − M [m, l], c0 M [m, l]) (5) P [m, l] = |X[m, ejωk )Hl (ejωk )|2 (2) k=0 We call this processing SSF Type-II. The difference between (4) and (5) is only one-term, but where k is a dummy variable representing the discrete fre- as shown in Fig 3 and 4, they show big difference in recogni- quency index, and N is the FFT size. The discrete frequency tion accuracy in reverberant environments. Another interesting ωk is deﬁned by ωk = 2πk . Since we are using a 50-ms win- N thing is in case of SSF Type-I, if 0.2 ≤ λ ≤ 0.4, then it shows dow, for 16-kHz audio samples N is 2048. Hl (ejωk ) is the improvements for clean speech (up to relatively 31 % WER re- spectrum of the gammatone ﬁlter bank for the l − th channel duction). In the power contour of Fig. 2, we observe that if evaluated at frequency index k, and X[m, ejωk ) is the short- we use SSF Type-II, then falling edge becomes smoothed out time spectrum of the speech signal for this l − th frame. L in (since M [m, l] is basically low-passed signal) , which signiﬁ- Fig. 1 denotes the total number of gammatone channels, and cantly reduces spectral distortion between clean and reverberant we are using L = 40 for obtaining the spectral power. After environments. SSF Type−I : RM1 (Clean) SSF Type−I : RM1 (Music 0 dB) SSF Type−I : RM1 (RT = 0.5 (s)) 60 100 Accuracy (100 − WER) Accuracy (100 − WER) Accuracy (100 − WER) 100 ms 100 ms 75 ms 60 75 ms 60 50 ms 50 ms 25 ms 40 25 ms 40 95 100 ms 75 ms 20 20 50 ms 25 ms 90 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 λ λ λ (a) (b) (c) SSF Type−II : RM1 (Clean) SSF Type−II : RM1 (Music 0 dB) SSF Type−II : RM1 (RT = 0.5 (s)) 60 100 Accuracy (100 − WER) Accuracy (100 − WER) Accuracy (100 − WER) 100 ms 100 ms 75 ms 60 75 ms 60 50 ms 50 ms 25 ms 40 25 ms 40 95 100 ms 75 ms 20 20 50 ms 25 ms 90 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 λ λ λ (d) (e) (f) Figure 3: Speech recognition accuracy depending on the forgetting factor λ and the window length. In (a), (b), and (c), we used (4) for normalization. In (d), (e), and (f), we used (5) for normalization. The ﬁlled triangles at the y-axis represent the baseline MFCC performance in the same environment Fig. 3 show the performance dependence on the forgetting ˆ The enhanced speech x[n]is re-synthesized using the IFFT and factor λ and the window length. For additive noise, window OverLap Addition (OLA) method. length of 75 or 100 ms showed the best performance. However, for reverberation, 50 ms showed the best performance. Thus we use λ = 0.4 and the window length of 50 ms. 5. Experimental Results In this section, we describe experimental results obtained on 4. Spectral reshaping the DARPA Resource Management (RM) database using the SSF algorithm. For quantitative evaluation of SSF we used ˜ After obtaining processed power P [m, l] (which is either 1,600 utterances from the DARPA Resource Management (RM) P1 [m, l] in (4) or P2 [m, l] (5)), we obtain a processed spectrum database for training and 600 utterances for testing. We used ˜ X[m, ejωk ]. To achieve this goal, we use a similar spectral re- SphinxTrain 1.0 for training the acoustic models, and shaping approach as we did in  and . Assuming that the Sphinx 3.8 for decoding. For feature extraction we used phase of the original and the processed spectra are the same, we sphinx fe which is included in sphinxbase 0.4.1. modify the magnitude spectrum. Even though, SSF is primailty targeted for reverberant environ- First, for each time-frequency bin, we obtain the weighting ment, we also conducted experiments in additive noise environ- ˜ coefﬁcient w[m, l] as a ratio of the processed power P [m, l] to ment as well. In Fig. 4(a), we used test utterances corrupted P [m, l]. by additive white Gaussian noise, and in Fig. 4(b), we used test ˜ P [m, l] utterances corrupted by musical segments of the DARPA Hub 4 w[m, l] = , 0 ≤ l ≤ L−1 (6) Broadcast News database. P [m, l] We prefer to characterize improvement as amount by which Each of these channels is associated with Hl , the frequency re- curves depicting WER as a function of SNR shift laterally when sponse of one of a set of gammatone ﬁlters with center frequen- processing is applied. We refer to this statistic as the “threshold cies distributed according to the Equivalent Rectangular Band- shift”. As shown in these ﬁgures, SSF provided 8-dB thresh- width (ERB) scale . The ﬁnal spectrum weighting µ[m, k]is old shifts for white noise and 3.5-dB shifts for background mu- obtained using the above weight w[m, l] sic. Note that obtaining improvements for background music L−1 w[m, l] Hl ejωk is not easy. For comparison, we also obtained similar results l=0 µ[m, k] = L−1 , using the state-of-the-art noise compensation algorithm Vector |Hl (ejωk )| l=0 Taylor series (VTS) . We also conducted experiments us- 0 ≤ k ≤ N/2 − 1, 0 ≤ l ≤ L − 1 (7) ing a open source RASTA-PLP implementation . For white noise, VTS and SSF show almost the same performance , but for After obtaining µ[m, k] for the lower half frequency region 0 ≤ music noise, SSF shows signiﬁcantly better performance. For k ≤ N/2, we can obtain the upper half from the symmetric additive noise, both SSF Type-I and SSF Type-II show almost characteristic: the same performance. For clean utterance, SSF Type-I is doing µ[m, k] = µ[m, N − k], 0 ≤ k ≤ N/2 (8) slightly better than SSF Type-II. Using µ[m, k], the reconstructed spectrum is obtained by: In reverberant environments, as shown in Fig. 4(c), SSF Type-II shows the best performance by a very large margin. SSF ˜ X[m, ejωk ) = µg [m, k]X[m, ejωk ), 0 ≤ k ≤ N − 1 (9) Type-I shows the next performance, but the performance dif- RM1 (White Noise) 8. References 100  S. Molau, M. Pitz, and H. Ney, “Histogram based normalization in the acoustic feature space,” in IEEE Automatic Speech Recogntion Accuracy (100 − WER) 80 and Understanding Workshop, Nov. 2001, pp. 21–24. 60  P. J. Moreno, B. Raj, and R. M. Stern, “A vector Taylor series ap- proach for environment-independent speech recognition,” in IEEE 40 Int. Conf. Acoust., Speech and Signal Processing, May. 1996, pp. 733–736. 20  B. Raj, V. N. Parikh, and R. M. Stern, “The effects of back- ground music on speech recognition accuracy,” in IEEE Int. Conf. 0 0 5 10 15 20 Clean Acoust., Speech and Signal Processing, vol. 2, Apr. 1997, pp. SNR (dB) 851–854. (a)  C. Kim, Y.-H. Chiu, and R. M. Stern, “Physiologically-motivated RM1 (Music Noise) synchrony-based processing for robust automatic speech recogni- 100 tion,” in INTERSPEECH-2006, Sept. 2006, pp. 1975–1978.  B. Raj and R. M. Stern, “Missing-Feature Methods for Robust Au- Accuracy (100 − WER) 80 tomatic Speech Recognition,” IEEE Signal Processing Magazine, 60 vol. 22, no. 5, pp. 101–116, Sept. 2005.  B. E. D. Kingsbury, N. Morgan, and, S. Greenberg, “Robust 40 SSF Type−II speech recognition using the modulation spectrogram,” Speech SSF Type−I Communication, vol. 25, no. 1. 20 MFCC with VTS and CMN Baseline MFCC with CMN  R. Drullman, J. M. Festen and R. Plomp, “Effect of temporal enve- RASTA−PLP with CMN lope smearing on speech recognition,” J. Acoust. Soc. Am., vol. 95, 0 no. 2, pp. 1053–1064, Feb. 1994. 0 5 10 15 20 Clean SNR (dB) (b)  ——, “Effect of reducing slow temporal modulations on speech recognition,” J. Acoust. Soc. Am., vol. 95, no. 5, pp. 2670–2680, RM1 (Reverberation) May 1994. 100  C. Kim and R. M. Stern, “Feature extraction for robust speech recognition based on maximizing the sharpness of the power dis- Accuracy (100 − WER) 80 tribution and on power ﬂooring,” in IEEE Int. Conf. on Acoustics, 60 Speech, and Signal Processing, March 2010, pp. 4574–4577.  ——, “Power function-based power distribution normalization al- 40 gorithm for robust speech recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop, Dec. 2009, pp. 188– 20 193.  C. Kim, K. Kumar and R. M. Stern, “Robust speech recognition 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.2 using small power boosting algorithm,” in IEEE Automatic Speech Reverberation Time (s) Recognition and Understanding Workshop, Dec. 2009, pp. 243– (c) 248. Figure 4: Speech recognition accuracy using different algo-  H. G. Hirsch, P. Meyer , and H. W. Ruehl, “Improved speech rithms (a) for white noise (b) for musical noise, and (c) under recognition using high-pass ﬁltering of subband envelopes,” in reverberant environments EUROSPEECH ’91, Sept. 1991, pp. 413–416.  H. Hermansky and N. Morgan, “RASTA processing of speech,” ference between SSF Type-I and SSF-Type-II is large. On the IEEE. Trans. Speech Audio Process., vol. 2, no. 4, Oct. 1994. contrary, VTS does not show performance improvement, and  K. D. Martin, “Echo suppression in a computational model of the PLP-RASTA shows poorer performance than MFCC. precedence effect,” in IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 1997. 6. Conclusion  P. M. Zurek, The precedence effect. New York, NY: Springer- Verlag, 1987, ch. 4, pp. 85–105. In this paper, we present a new algorithm which is especially ro-  Y. Park and H. Park, “Non-stationary sound source localization bust for reverberation. Motivated by the modulation frequency based on zero crossings with the detection of onset intervals,” IE- concept and precedence effect, we apply a ﬁrst-order high pass ICE Electronics Express, vol. 5, no. 24, pp. 1054–1060, 2008. ﬁltering. The falling edges of power contours are suppressed in  C. Kim and R. M. Stern, “Feature extraction for robust speech two different ways. We observe that using the low-passed signal recognition using a power-law nonlinearity and power-bias sub- for the falling edge is especially helpful for reducing spectral traction,” in INTERSPEECH-2009, Sept. 2009. distortion for reverberant environments. Experimental results  C. Kim, K. Kumar, B. Raj, and R. M. Stern, “Signal separation show that this approach is more effective than known algorithms for robust speech recognition based on phase difference informa- in reverberant environments. tion obtained in the frequency domain,” in INTERSPEECH-2009, Sept. 2009. 7. Open Source Matlab Code  B. C. J. Moore and B. R. Glasberg, “A revision of Zwicker’s loud- ness model,” Acustica - Acta Acustica, vol. 82, pp. 335–345, 1996. You can download the open source matlab code at the following  D. Ellis. (2006) Plp and rasta (and mfcc, and inversion) in URL http://www.cs.cmu.edu/˜robust/archive/ matlab using melfcc.m and invmelfcc.m. [Online]. Available: algorithms/SSF_IS2010/. This matlab code was used http://labrosa.ee.columbia.edu/matlab/rastamat/ in obtaining experimental results in Section 5.