VIEWS: 25 PAGES: 4 POSTED ON: 6/22/2010 Public Domain
International Workshop on Acoustic Echo and Noise Control (IWAENC2003), Sept. 2003, Kyoto, Japan ON THE APPLICATION OF THE UNSCENTED KALMAN FILTER TO SPEECH PROCESSING Sharon Gannot Faculty of Electrical Engineering, Technion, Technion City, 32000 Haifa, Israel e-mail: gannot@siglab.technion.ac.il Marc Moonen Dept. of Elect. Eng. (ESAT-SISTA), K.U.Leuven, B-3001 Leuven, Belgium e-mail: Marc.Moonen@esat.kuleuven.ac.be ABSTRACT y should be calculated. We brieﬂy summarize the method. In a series of recent studies a new approach for applying the The mean and covariance of x are represented by 2L + 1 Kalman ﬁlter to nonlinear system, referred to as Unscented points and weights Kalman ﬁlter (UKF), was proposed. In this contribution1 ' $ we apply the UKF to several speech processing problems, X0 = x ¯ in which a model with unknown parameters is given to the p measured signals. We show that the nonlinearity arises nat- Xl = x + ¯ (L + λ)Pxx ; l = 1, . . . , L l urally in these problems. Preliminary simulation results for p Xl+L = x− ¯ (L + λ)Pxx ; l = 1, . . . , L artiﬁcial signals manifests the potential of the method. l (m) 1 INTRODUCTION W0 = λ/(L + λ) (c) The recently proposed unscented transform (UT) is a method W0 = λ/(L + λ) + (1 − α2 + β) (m) (c) for calculating the statistics of a random variable undergo- Wl = Wl = 1/2(L + λ); l = 1, 2, . . . , 2L ing a nonlinear transformation that was ﬁrst suggested by & % Julier et al. [1]. This method was used to generalize the p Kalman ﬁlter to nonlinear systems by Julier et al. [1] and where, (L + λ)Pxx is the l-th row or column of the l was further extended by Wan et al. [2] to problems where corresponding matrix square root, and λ = α2 (L + κ) − L. both signals and parameters are jointly estimated. In [2] α determines the spread of the sigma points. α = 1 was (and other contributions) the nonlinearity arises from the used throughout our simulations . κ is a secondary scaling parameter production model. parameter. The choice κ = 3 − L maintains the kurtosis of a In this contribution we further apply the UKF to sev- Gaussian vector. Throughout our simulations κ is set to 0. eral speech processing problems, namely single microphone β is used to incorporate prior knowledge of the distribution speech enhancement and multi-microphone speech derever- (β = 2 for Gaussian distributions). A proper choice of these beration. We show that in these applications the nonlinear- parameters and its inﬂuence on the obtainable performance ity arises naturally, due to the signals and parameters multi- is still an open topic. The mean and covariance of the vector plication, if both are given a dynamic model. The technique y are calculated using the following procedure, is demonstrated by several simple examples. # In Section 2 the unscented transform and its application to 1. Construct the sigma points Xl , l = 0, . . . , 2L. nonlinear Kalman ﬁlter are reviewed. Sections 3.1 and 3.2 discuss the application of the method to the problems of 2. Transform each point: Yl = f (Xl ) , l = 0, . . . , 2L. P 3. Mean: Use weighted averaging, y ≈ 2L Wl Yl . (m) single microphone speech enhancement and two microphone ¯ l=0 speech dereverberation, respectively. We draw some conclu- 4. Covariance: Use weighted outer product, sions and discuss some further directions in Section 4. P Pyy ≈ 2L Wl (Yl − y ) (Yl − y )T . (c) l=0 ¯ ¯ 2 PRELIMINARIES " ! The beneﬁts of using the UT are presented in [1] and [2]. 2.1 The Unscented Transform (UT) Let x be an L-dimensional random vector with mean x and ¯ 2.2 The Application of the Unscented Transform covariance matrix Pxx . Let, y = f (x) be a nonlinear trans- to the Nonlinear Kalman Filtering Problem formation from the random vector x to another random vec- The Kalman ﬁlter is a recursive and causal solution for min- tor y . The ﬁrst and second order statistics of the vector imum mean square error (MMSE) state estimation in the 1 This research work was carried out at the ESAT laboratory Gaussian and linear case. The Kalman equations are for- of the Katholieke Universiteit Leuven, in the frame of the In- mulated with the state-space notation and consist of two teruniversity Attraction Pole IUAP P4-02, Modeling, Identiﬁca- stages. A propagation stage in which the mean and a priori tion, Simulation and Control of Complex Systems, the Concerted covariance of the respective state are predicted based on the Research Action Mathematical Engineering Techniques for Infor- system dynamics and on the previous time instant estimate, mation and Communication Systems (GOA-MEFISTO-666) of the Flemish Government and the IT-poject Multi-microphone Sig- and an update stage in which this prediction is optimally nal Enhancement Techniques for handsfree telephony and voice weighted with the new measurement. The error covariance, controlled systems (MUSETTE-2) of the I.W.T., and was par- interpreted as the amount of conﬁdence we have in the esti- tially sponsored by Philips-ITCL. mate, is propagated in a similar fashion. 27 When the system dynamics and the measurement equa- 3 APPLICATION TO SPEECH PROCESSING tion are linear, all the calculations involved are straightfor- In many model-based problems in speech processing (e.g. ward. The situation is more complex when the involved single microphone speech enhancement, multi-microphone equations are nonlinear. In this case, a method for propagat- speech enhancement and dereverberation) a problem of esti- ing mean and covariance through nonlinearities is needed. mating both the speech signal and various parameters arises. Let s(t) and (t) be a signal state space vector and a pa- This problem can be addressed in two ways. In the ﬁrst, re- rameter vector, respectively . u(t) and v (t) are innovation ferred to by Wan et al. [2] as dual estimation, a two step ap- and measurement noise sequences, respectively. Deﬁne also proach is taken. In each time instant a Kalman ﬁltering step the augmented state vector xT (t) = sT (t) T (t) . Nonlin- for the signal is applied based on the current estimate of the ear transition and measurement equations are given by, parameters. In parallel a parameter estimate step is applied x(t) = Φ (x(t − 1), u(t)) based on the current signal state estimate. The parameter estimation might be conducted using recursive methods such z(t) = h (x(t − 1), v (t)) . as RLS or LMS. Alternatively, under the Bayesian frame- work, the parameters can be given a dynamic model and the In the past the extended Kalman ﬁlter (EKF), based on the Kalman ﬁlter can be applied. This approach will be used linearization of the equations, was used. This method might throughout this work. The dual estimation method can be be quite complex, as it involves the calculation of derivatives, seen as a sequential variant of the estimate-maximize (EM) but yet not accurate enough, as only ﬁrst-order approxima- procedure, but no claims of optimality are valid. Discussion tion is applied. on the subject can be found in [3]. The method is summa- A better method, proposed in [1], is to use the previ- rized in Fig. 2 (top). The same problem can be reformulated ously mentioned unscented transform in order to propagate the mean and covariance through the nonlinearities. Fig. 1 D summarizes the steps involved in Unscented Kalman ﬁlter s(t − 1|t − 1) ˆ ˆ s(t|t) (UKF). The method consists of calculating the mean and co- Speech Kalman Filter Current Sigma Points z(t) ˆ s(t − 1|t − 1) Parameters Kalman Filter Ps (t − 1|t − 1) X (t − 1|t − 1) ˆ θ(t − 1|t − 1) UT ˆ θ(t − 1|t − 1) ˆ θ(t|t) D Pθ (t − 1|t − 1) D (a) s(t − 1|t − 1) ˆ ˆ s(t|t) Predicted Sigma Points Current Sigma Points Signal & Measurement Speech + Parameters z(t) X (t|t − 1) Non-Linear System Kalman Filter X (t − 1|t − 1) Dynamics & Measurement Z(t|t − 1) {Φ, h} ˆ θ(t − 1|t − 1) ˆ θ(t|t) D (b) X (t|t − 1) Figure 2: Dual (top) and joint (bottom) estimation proce- ˆ x(t|t − 1), Px (t|t − 1) dures. Z(t|t − 1) UT−1 ˆ z (t), Pxz (t), Pzz (t) into a joint estimation problem. Note that most operations involve parameter and state vector multiplications. Thus, (c) the problem of joint estimation of the speech and the param- Predicted New eters becomes nonlinear if both are modelled as stochastic Signal & Error Covariance Signal Estimate processes. We remark that as this nonlinearity is separable & Measurment & Error Covariance this formulation might lead to the same performance as in ˆ x(t|t − 1), Px (t|t − 1) ˆ x(t|t) → the dual scheme. This subject is still under investigation. Optimal Weighting ˆ ˆ s(t|t), θ(t|t) The approach of jointly estimating speech signal and its pa- K(t) = Pxz (t)Pzz (t) −1 rameters is summarized in Fig. 2 (bottom). z(t),ˆ(t) z Px (t|t) → Ps (t|t), Pθ (t|t) 3.1 Single Microphone Speech Enhancement The problem of single-microphone speech enhancement was (d) extensively studied. Speciﬁcally, the use of Kalman ﬁlter for estimating both the signal and the parameters is presented Figure 1: Unscented Kalman ﬁlter: (a) Unscented transform. by Gannot et al. [3]. By assuming AR model to the speech (b) Propagation equations. (c) Inverse unscented transform. signal and giving dynamic model to the AR parameters both (d) Update equations. dual and joint schemes can be formulated. Each of the two steps comprising the dual scheme is linear, while the joint scheme consists of a single nonlinear step. variance of the augmented state vectors undergoing a known nonlinear transform by virtue of the unscented transform. 3.1.1 Signals Model The complexity of the suggested method is quite low as only Let the signal measured by the microphone be given by an increase of dimensions by a factor of 2L + 1 is required. z(t) = s(t) + v(t), where s(t) represents the sampled speech 28 signal and v(t) represents an additive background noise. We Update equations: shall assume a time varying AR model for the speech signal, h i i.e. p sp (t|t) ˆ = sp (t|t − 1) + k(t) z(t) − hT sp (t|t − 1) ˆ s ˆ (7) X h i s(t) = − αk (t)s(t − k) + gs (t)us (t) (1) P (t|t) = P (t|t − 1) − k(t) hT P (t|t − 1)hs + gv kT (t) 2 s k=1 where the excitation us (t) is a normalized (zero mean unit 3.1.5 Parameters Kalman Filter variance) white noise. gs (t) represents the innovation gain, Propagation equations: and α1 (t), α2 (t), . . . , αp (t) are the AR coeﬃcients. The ad- ditive noise v(t) is assumed to be a realization from a zero ˆ (t|t − 1) = Φ ˆ (t − 1|t − 1) (8) 2 mean white Gaussian stochastic with variance gv . Deﬁne, gT (t) = [ gs (t) 0 . . . 0 ] and hT = [ 1 0 . . . 0 ]. Then a state- P (t|t − 1) = Φ P (t − 1|t − 1)ΦT + Q s s space form is given by, Kalman gain: sp (t) = Φs (t)sp (t − 1) + g s (t)us (t) (2) P (t|t − 1)H z(t) = hT sp (t) + v(t) k (t) = (9) s hT P (t|t − 1)h + gs (t) + gv 2 2 where sT (t) = s(t) s(t − 1) . . . s(t − p) . The signal state p transition matrix Φs (t) is given by: Update equations: h i 2 3 ˆ (t|t) = ˆ (t|t − 1) + k (t) z(t) − hT ˆ (t|t − 1) (10) −α1 (t) −α2 (t) ··· · · · −αp (t) 0 6 1 0 0 ··· ··· 07 6 P (t|t) = P (t|t − 1) − 6 . . .. .. .7 .7 h i 6 . . . .7 6 Φs (t) = 6 . 7 . 7. (3) k (t) hT P (t|t − 1)h + gs (t) + gv kT (t). 2 2 6 . .. .. .7 6 . . . .7 6 . .. .. .. .7 The dual scheme suggested in Fig. 2 (top) is then used. 4 . . . . .5 . . 0 ··· ··· ··· 1 0 3.1.6 Joint Scheme An augmented state vector of the speech and the parameters 3.1.2 Parameter model is constructed, xT (t) = sp (t) (t) . Then, Deﬁne the parameter vector T (t) = [ α1 (t) α2 (t) . . . αp (t) ] and the innovation vector uT (t) = uα1 (t) uα2 (t) . . . uαp (t) x(t) = Φs Φ x(t − 1) + gsu (t)(t) 0 0 (t)us (11) with the respective covariance matrix Q (t) = | {z } E{u (t)uT (t)}. The parameter state-space equations are, nonlinearity (t) = Φ (t − 1) + u (t) (4) z(t) = 1 0 0 ... 0 x(t) + v(t). z(t) = hT (t) (t) + g s (t)us (t) + v(t), This set of equation is nonlinear since it involves a multipli- cation of the speech state space and the transition matrix where, h (t) = s(t − 1) s(t − 2) . . . s(t − p) T and Φ = comprised of the parameters process. So, the joint scheme Ip×p or very close to it. suggested in Fig. 2 (bottom) can be used. 3.1.3 Dual Scheme 3.1.7 Results On the one hand, assuming that the signal and all the noise Time varying Gaussian AR process (4 coeﬃcients) embedded parameters are known, which implies that Φs (t), hs and in white Gaussian noise with input SNR level of about 20dB gs (t) are known, the optimal causal MMSE linear state es- is processed by the joint Kalman scheme2 . The noise level is timate, which includes the desired speech signal s(t), is ob- estimated during non-signal portions of the noisy signal. The tained using the Kalman ﬁltering equations. On the other tracking ability of the procedure is presented in Fig. 3. The hand, assuming the speech signal is known, i.e. hT (t) is known, a Kalman ﬁlter for the parameter estimate might be 0.8 AR parameters Gain 2.5 applied. Since both signal and parameters are not known, 0.6 the dual scheme presented in Fig. 2 may be applied. In each 0.4 2 time instant the AR parameters are estimated using the esti- 0.2 mated speech signal and the speech signal is estimated using 0 1.5 the current parameter estimate. −0.2 −0.4 1 3.1.4 Speech Kalman Filter −0.6 Propagation equations: −0.8 0 5000 Samples 10000 15000 0.5 0 2000 4000 6000 8000 10000 12000 14000 16000 Samples sp (t|t − 1) ˆ = Φs sp (t − 1|t − 1) ˆ (5) Figure 3: Tracking ability of the parameters of an AR process P (t|t − 1) = Φs P (t − 1|t − 1)ΦT s +g g T s s embedded in white noise. Kalman gain: performance with real speech signals is still to be determined. k(t) = T P (t|t − 1)hs 2 (6) 2 All Simulations in this paper are implemented by modifying hs P (t|t − 1)hs + gv R. van der Merwe et al. [4] code, written in Matlab c language. 29 3.2 Two Microphone Speech Dereverberation 3.2.3 Results In the two channel dereverberation problem a speech signal, For a low level white noise signal, which variance is esti- modelled as an AR process, is ﬁltered by an acoustical trans- mated from signal free segments, the tracking ability of the fer function (ATF), modelled as an FIR ﬁlter. Noise is then algorithm is presented in Fig. 5. It is worth mentioning that added to the output constructing the noisy and reverberated the presented problem is a very simple one, the order of the speech signals, as depicted in Fig. 4. AR process is 1 and the ﬁlters a1 , a2 are 3 taps long. The SNR value is very high. Even in this simple case convergence g1 e1 (t) is not guaranteed. P AR parameters Gain A1 (ejω ) z1 (t) 1 15 0.8 0.6 s(t) 0.4 10 0.2 jω P A2 (e ) z2 (t) 0 −0.2 5 −0.4 g2 e2 (t) −0.6 −0.8 −1 0 0 500 1000 1500 0 500 1000 1500 2000 Figure 4: Two channel dereverberation problem. Samples Samples A1 coeff. A2 coeff. 5 6 4 4 3 2 3.2.1 Signals Model 2 1 0 The reverberated and noisy signals presented in Fig. 4 are 0 −2 given by the following model, −1 −4 −2 p X −3 −6 s(t) = − αk s(t − k) + gus (t)us (t) (12) −4 −8 0 500 1000 1500 2000 0 500 1000 1500 2000 k=1 Samples Samples X na −1 z1 (t) = a1 (k)s(t − k) + g1 e1 (t) Figure 5: Tracking ability of the parameters of the derever- k=0 beration problem. X na −1 z2 (t) = a2 (k)s(t − k) + g2 e2 (t). k=0 4 DISCUSSION Thus, we have again a problem of estimating both the speech signal and the following model parameters, In this paper we applied the newly proposed UKF to two speech processing problems. Results show that the method is T (t) = (t) gus (t) a1 (t) a2 (t) g1 g2 . applicable to the problems in hand. Nevertheless, for a com- 3.2.2 Joint Speech and Parameters Estimation prehensive test, it should be further applied to real speech signals embedded in higher noise levels. Performance lim- Deﬁne, itations and optimality issues of the suggested method are sTa (t) n = s(t) s(t − 1) · · · s(t − na + 1) under current research. gT (t) s = gs (t) 0 · · · 0 5 * u (t) T = uα1 (t) uα2 (t) · · · uαp (t) References uT1 (t) a = ua1 (t) ua2 (t) · · · uana (t) 1 1 1 [1] S. Julier, J. Uhlmann and H.F. Durrant-Whyte, “A New Method for the Nonlinear Transformation of Means and ua2 (t) T = ua1 (t) ua2 (t) · · · uana (t) 2 2 2 Covariances in Filters and Estimators,” IEEE trans. on and Φs (t) an na × na signal transition matrix having equiv- Automatic Control, vol. 45, no. 3, pp. 477–482, Mar. alent structure to the one presented in (3). Then, the aug- 2000. mented transition-measurement equations can be written as, [2] E. A. Wan and R. van der Merwe, “The Unscented 2 sna (t) 3 2 Φs (t) 0 0 0 3 2 sna (t − 1) 3 2 gs us (t) 3 Kalman Filter for Nonlinear Estimation,” in Sympo- 6 (t) 7 6 0 Ip 0 0 7 6 (t − 1) 7 6 u (t) 7 sium 2000 on Adaptive Systems for Signal Processing, 4 a (t) 5 = 4 0 + 0 Ina 0 5 4 a1 (t − 1) 5 4 ua1 (t) 5 Communication and Control (AS-SPCC), Lake Louise, 1 a2 (t) 0 0 0 Ina a2 (t − 1) ua2 (t) Alberta, Canada, Oct. 2000, IEEE. | {z } [3] S. Gannot, D. Burshtein, and E. Weinstein, “Iterative nonlinearity 2 3 and Sequential Kalman Filter-Based Speech Enhance- sna (t) ment Algorithms,” IEEE Trans. on Speech and Audio z1 (t) = a1 (t) 0 0 0 6 (t) 7 + g1 e1 (t) Proc., vol. 6, no. 4, pp. 373–385, Jul. 1998. z2 (t) a2 (t) 0 0 0 4 a1 (t) 5 g2 e2 (t) [4] R. van der Merwe, “Matlab c code,” a2 (t) /users/sista/sgannot/matlab/Ukf_W/, May 2001. | {z } nonlinearity which is a nonlinear set of equations, ﬁtting the UKF frame- work. 30