International Workshop on Acoustic Echo and Noise Control (IWAENC2003), Sept. 2003, Kyoto, Japan

                                              Sharon Gannot
              Faculty of Electrical Engineering, Technion, Technion City, 32000 Haifa, Israel
                               e-mail: gannot@siglab.technion.ac.il
                                               Marc Moonen
               Dept. of Elect. Eng. (ESAT-SISTA), K.U.Leuven, B-3001 Leuven, Belgium
                             e-mail: Marc.Moonen@esat.kuleuven.ac.be

                         ABSTRACT                                      y should be calculated. We briefly summarize the method.
In a series of recent studies a new approach for applying the          The mean and covariance of x are represented by 2L + 1
Kalman filter to nonlinear system, referred to as Unscented             points and weights
Kalman filter (UKF), was proposed. In this contribution1                      '                                                           $
we apply the UKF to several speech processing problems,                            X0 = x
in which a model with unknown parameters is given to the                                        p        
measured signals. We show that the nonlinearity arises nat-                        Xl = x +
                                                                                        ¯       (L + λ)Pxx ; l = 1, . . . , L
urally in these problems. Preliminary simulation results for                                  p          
                                                                               Xl+L      = x−
                                                                                           ¯    (L + λ)Pxx ; l = 1, . . . , L
artificial signals manifests the potential of the method.                                                             l
1   INTRODUCTION                                                              W0         = λ/(L + λ)
The recently proposed unscented transform (UT) is a method                     W0        = λ/(L + λ) + (1 − α2 + β)
                                                                                 (m)          (c)
for calculating the statistics of a random variable undergo-                  Wl         = Wl       = 1/2(L + λ); l = 1, 2, . . . , 2L
ing a nonlinear transformation that was first suggested by                    &                                                           %
Julier et al. [1]. This method was used to generalize the                        p                   
Kalman filter to nonlinear systems by Julier et al. [1] and             where,            (L + λ)Pxx         is the l-th row or column of the
was further extended by Wan et al. [2] to problems where               corresponding matrix square root, and λ = α2 (L + κ) − L.
both signals and parameters are jointly estimated. In [2]              α determines the spread of the sigma points. α = 1 was
(and other contributions) the nonlinearity arises from the             used throughout our simulations . κ is a secondary scaling
parameter production model.                                            parameter. The choice κ = 3 − L maintains the kurtosis of a
   In this contribution we further apply the UKF to sev-               Gaussian vector. Throughout our simulations κ is set to 0.
eral speech processing problems, namely single microphone              β is used to incorporate prior knowledge of the distribution
speech enhancement and multi-microphone speech derever-                (β = 2 for Gaussian distributions). A proper choice of these
beration. We show that in these applications the nonlinear-            parameters and its influence on the obtainable performance
ity arises naturally, due to the signals and parameters multi-         is still an open topic. The mean and covariance of the vector
plication, if both are given a dynamic model. The technique            y are calculated using the following procedure,
is demonstrated by several simple examples.                            #
   In Section 2 the unscented transform and its application to
                                                                           1. Construct the sigma points Xl , l = 0, . . . , 2L.
nonlinear Kalman filter are reviewed. Sections 3.1 and 3.2
discuss the application of the method to the problems of                   2. Transform each point: Yl = f (Xl ) , l = 0, . . . , 2L.
                                                                           3. Mean: Use weighted averaging, y ≈ 2L Wl Yl .
single microphone speech enhancement and two microphone                                                       ¯        l=0
speech dereverberation, respectively. We draw some conclu-                 4. Covariance: Use weighted outer product,
sions and discuss some further directions in Section 4.                             P
                                                                              Pyy ≈ 2L Wl (Yl − y ) (Yl − y )T .
                                                                                      l=0           ¯          ¯
2   PRELIMINARIES                                                      "                                                                   !
                                                                       The benefits of using the UT are presented in [1] and [2].
2.1 The Unscented Transform (UT)
Let x be an L-dimensional random vector with mean x and
                                                      ¯                2.2   The Application of the Unscented Transform
covariance matrix Pxx . Let, y = f (x) be a nonlinear trans-                 to the Nonlinear Kalman Filtering Problem
formation from the random vector x to another random vec-              The Kalman filter is a recursive and causal solution for min-
tor y . The first and second order statistics of the vector             imum mean square error (MMSE) state estimation in the
    1 This research work was carried out at the ESAT laboratory        Gaussian and linear case. The Kalman equations are for-
of the Katholieke Universiteit Leuven, in the frame of the In-         mulated with the state-space notation and consist of two
teruniversity Attraction Pole IUAP P4-02, Modeling, Identifica-         stages. A propagation stage in which the mean and a priori
tion, Simulation and Control of Complex Systems, the Concerted         covariance of the respective state are predicted based on the
Research Action Mathematical Engineering Techniques for Infor-         system dynamics and on the previous time instant estimate,
mation and Communication Systems (GOA-MEFISTO-666) of
the Flemish Government and the IT-poject Multi-microphone Sig-
                                                                       and an update stage in which this prediction is optimally
nal Enhancement Techniques for handsfree telephony and voice           weighted with the new measurement. The error covariance,
controlled systems (MUSETTE-2) of the I.W.T., and was par-             interpreted as the amount of confidence we have in the esti-
tially sponsored by Philips-ITCL.                                      mate, is propagated in a similar fashion.

   When the system dynamics and the measurement equa-                                    3   APPLICATION TO SPEECH PROCESSING
tion are linear, all the calculations involved are straightfor-
                                                                                         In many model-based problems in speech processing (e.g.
ward. The situation is more complex when the involved
                                                                                         single microphone speech enhancement, multi-microphone
equations are nonlinear. In this case, a method for propagat-
                                                                                         speech enhancement and dereverberation) a problem of esti-
ing mean and covariance through nonlinearities is needed.
                                                                                         mating both the speech signal and various parameters arises.
   Let s(t) and (t) be a signal state space vector and a pa-
                                                                                         This problem can be addressed in two ways. In the first, re-
rameter vector, respectively . u(t) and v (t) are innovation
                                                                                         ferred to by Wan et al. [2] as dual estimation, a two step ap-
and measurement noise sequences, respectively. Define also
                                                                                         proach is taken. In each time instant a Kalman filtering step
the augmented state vector xT (t) = sT (t) T (t) . Nonlin-
                                                                                         for the signal is applied based on the current estimate of the
ear transition and measurement equations are given by,
                                                                                         parameters. In parallel a parameter estimate step is applied
                        x(t)    = Φ (x(t − 1), u(t))                                     based on the current signal state estimate. The parameter
                                                                                         estimation might be conducted using recursive methods such
                        z(t)    = h (x(t − 1), v (t)) .                                  as RLS or LMS. Alternatively, under the Bayesian frame-
                                                                                         work, the parameters can be given a dynamic model and the
In the past the extended Kalman filter (EKF), based on the
                                                                                         Kalman filter can be applied. This approach will be used
linearization of the equations, was used. This method might
                                                                                         throughout this work. The dual estimation method can be
be quite complex, as it involves the calculation of derivatives,
                                                                                         seen as a sequential variant of the estimate-maximize (EM)
but yet not accurate enough, as only first-order approxima-
                                                                                         procedure, but no claims of optimality are valid. Discussion
tion is applied.
                                                                                         on the subject can be found in [3]. The method is summa-
   A better method, proposed in [1], is to use the previ-                                rized in Fig. 2 (top). The same problem can be reformulated
ously mentioned unscented transform in order to propagate
the mean and covariance through the nonlinearities. Fig. 1
summarizes the steps involved in Unscented Kalman filter                                           s(t − 1|t − 1)
                                                                                                  ˆ                                            ˆ
(UKF). The method consists of calculating the mean and co-
                                                                                                                     Speech Kalman Filter
                                                       Current Sigma Points                      z(t)
     s(t − 1|t − 1)
                                                                                                                    Parameters Kalman Filter
     Ps (t − 1|t − 1)                                       X (t − 1|t − 1)

     θ(t − 1|t − 1)
                                      UT                                                          ˆ
                                                                                                  θ(t − 1|t − 1)                               ˆ
     Pθ (t − 1|t − 1)
                                        (a)                                                        s(t − 1|t − 1)
                                                                                                   ˆ                                           ˆ
                                                       Predicted Sigma Points
  Current Sigma Points
                                                       Signal & Measurement                                           Speech + Parameters
                                                                   X (t|t − 1)
                                    Non-Linear System                                                                   Kalman Filter
       X (t − 1|t − 1)
                                 Dynamics & Measurement
                                                                   Z(t|t − 1)
                                         {Φ, h}                                                    ˆ
                                                                                                   θ(t − 1|t − 1)                              ˆ

     X (t|t − 1)                                                                         Figure 2: Dual (top) and joint (bottom) estimation proce-
                                                         x(t|t − 1), Px (t|t − 1)

     Z(t|t − 1)
                                    UT−1                 ˆ
                                                         z (t), Pxz (t), Pzz (t)
                                                                                         into a joint estimation problem. Note that most operations
                                                                                         involve parameter and state vector multiplications. Thus,
                                        (c)                                              the problem of joint estimation of the speech and the param-
 Predicted                                                  New                          eters becomes nonlinear if both are modelled as stochastic
 Signal & Error Covariance                                  Signal Estimate              processes. We remark that as this nonlinearity is separable
 & Measurment                                               & Error Covariance           this formulation might lead to the same performance as in
     x(t|t − 1), Px (t|t − 1)                               ˆ
                                                            x(t|t) →                     the dual scheme. This subject is still under investigation.
                                  Optimal Weighting         ˆ       ˆ
                                                            s(t|t), θ(t|t)               The approach of jointly estimating speech signal and its pa-
                                 K(t) = Pxz (t)Pzz (t)
                                                  −1                                     rameters is summarized in Fig. 2 (bottom).
          z                                                 Px (t|t) →
                                                            Ps (t|t), Pθ (t|t)           3.1 Single Microphone Speech Enhancement
                                                                                         The problem of single-microphone speech enhancement was
                                                                                         extensively studied. Specifically, the use of Kalman filter for
                                                                                         estimating both the signal and the parameters is presented
Figure 1: Unscented Kalman filter: (a) Unscented transform.                               by Gannot et al. [3]. By assuming AR model to the speech
(b) Propagation equations. (c) Inverse unscented transform.                              signal and giving dynamic model to the AR parameters both
(d) Update equations.                                                                    dual and joint schemes can be formulated. Each of the two
                                                                                         steps comprising the dual scheme is linear, while the joint
                                                                                         scheme consists of a single nonlinear step.
variance of the augmented state vectors undergoing a known
nonlinear transform by virtue of the unscented transform.                                3.1.1 Signals Model
The complexity of the suggested method is quite low as only                              Let the signal measured by the microphone be given by
an increase of dimensions by a factor of 2L + 1 is required.                             z(t) = s(t) + v(t), where s(t) represents the sampled speech

signal and v(t) represents an additive background noise. We                     Update equations:
shall assume a time varying AR model for the speech signal,                                                     h                       i
                                                                                  sp (t|t)
                                                                                  ˆ        = sp (t|t − 1) + k(t) z(t) − hT sp (t|t − 1)
                                                                                             ˆ                            s ˆ              (7)
                    X                                                                                           h                       i
           s(t) = −     αk (t)s(t − k) + gs (t)us (t)    (1)                       P (t|t) = P (t|t − 1) − k(t) hT P (t|t − 1)hs + gv kT (t)

where the excitation us (t) is a normalized (zero mean unit                     3.1.5 Parameters Kalman Filter
variance) white noise. gs (t) represents the innovation gain,                   Propagation equations:
and α1 (t), α2 (t), . . . , αp (t) are the AR coefficients. The ad-
ditive noise v(t) is assumed to be a realization from a zero                                   ˆ (t|t − 1) = Φ ˆ (t − 1|t − 1)                                                                 (8)
mean white Gaussian stochastic with variance gv . Define,
gT (t) = [ gs (t) 0 . . . 0 ] and hT = [ 1 0 . . . 0 ]. Then a state-                         P (t|t − 1) = Φ P (t − 1|t − 1)ΦT + Q
 s                                   s
space form is given by,
                                                                                Kalman gain:
              sp (t)   = Φs (t)sp (t − 1) + g s (t)us (t)            (2)
                                                                                                                                 P (t|t − 1)H
               z(t) =       hT sp (t) + v(t)                                                    k      (t) =                                                                                   (9)
                                                                                                                      hT P     (t|t − 1)h + gs (t) + gv
                                                                                                                                              2       2

where sT (t) = s(t) s(t − 1) . . . s(t − p) . The signal state
transition matrix Φs (t) is given by:                                           Update equations:
                                                                                                                      h                      i
                 2                                              3                        ˆ (t|t) = ˆ (t|t − 1) + k (t) z(t) − hT ˆ (t|t − 1)                                                 (10)
                  −α1 (t) −α2 (t)        ···    · · · −αp (t) 0
                6   1       0             0     ··· ··· 07
                6                                                                 P (t|t) = P (t|t − 1) −
                6   .
                    .      ..            ..                   .7
                                                              .7                                  h                             i
                6   .         .             .                 .7
       Φs (t) = 6   .
                                                              . 7.   (3)
                                                                                            k (t) hT P (t|t − 1)h + gs (t) + gv kT (t).
                                                                                                                     2        2

                6   .                    ..      ..           .7
                6   .                       .       .         .7
                6   .                    ..      ..    ..     .7                The dual scheme suggested in Fig. 2 (top) is then used.
                4   .                       .       .     .   .5
                    .                                         .
                    0      ···           ···    ···     1     0                 3.1.6 Joint Scheme
                                                                                An augmented state vector of the speech and the parameters
3.1.2 Parameter model                                                           is constructed, xT (t) = sp (t) (t) . Then,
Define the parameter vector T (t) = [ α1 (t) α2 (t) . . . αp (t) ]                                                             
and the innovation vector uT (t) = uα1 (t) uα2 (t) . . . uαp (t)                         x(t) = Φs Φ x(t − 1) + gsu (t)(t)
                                                                                                        0                (t)us
with the respective covariance matrix Q (t)                    =                                 |        {z      }
E{u (t)uT (t)}. The parameter state-space equations are,                                                                  nonlinearity

               (t) = Φ         (t − 1) + u (t)                       (4)                      z(t) =              1 0 0 ... 0            x(t) + v(t).
              z(t) =    hT
                             (t) (t) + g s (t)us (t) + v(t),                    This set of equation is nonlinear since it involves a multipli-
                                                                                cation of the speech state space and the transition matrix
where, h (t) = s(t − 1) s(t − 2) . . . s(t − p)
                                                             and Φ    =         comprised of the parameters process. So, the joint scheme
Ip×p or very close to it.                                                       suggested in Fig. 2 (bottom) can be used.
3.1.3 Dual Scheme                                                               3.1.7 Results
On the one hand, assuming that the signal and all the noise                     Time varying Gaussian AR process (4 coefficients) embedded
parameters are known, which implies that Φs (t), hs and                         in white Gaussian noise with input SNR level of about 20dB
gs (t) are known, the optimal causal MMSE linear state es-                      is processed by the joint Kalman scheme2 . The noise level is
timate, which includes the desired speech signal s(t), is ob-                   estimated during non-signal portions of the noisy signal. The
tained using the Kalman filtering equations. On the other                        tracking ability of the procedure is presented in Fig. 3. The
hand, assuming the speech signal is known, i.e. hT (t) is
known, a Kalman filter for the parameter estimate might be                         0.8
                                                                                                      AR parameters
applied. Since both signal and parameters are not known,                          0.6

the dual scheme presented in Fig. 2 may be applied. In each                       0.4

time instant the AR parameters are estimated using the esti-                      0.2

mated speech signal and the speech signal is estimated using                       0                                                     1.5

the current parameter estimate.                                                  −0.2

                                                                                 −0.4                                                     1

3.1.4 Speech Kalman Filter                                                       −0.6

Propagation equations:                                                           −0.8
                                                                                     0         5000
                                                                                                                  10000        15000
                                                                                                                                            0   2000   4000   6000    8000 10000 12000 14000 16000

          sp (t|t − 1)
          ˆ              = Φs sp (t − 1|t − 1)
                              ˆ                                      (5)
                                                                                Figure 3: Tracking ability of the parameters of an AR process
           P (t|t − 1) = Φs P (t − 1|t −        1)ΦT
                                                   s   +g   g T
                                                            s s                 embedded in white noise.

Kalman gain:
                                                                                performance with real speech signals is still to be determined.
                  k(t) = T P (t|t − 1)hs 2                           (6)           2 All Simulations in this paper are implemented by modifying
                        hs P (t|t − 1)hs + gv                                   R. van der Merwe et al. [4] code, written in Matlab c language.

3.2 Two Microphone Speech Dereverberation                                           3.2.3 Results
In the two channel dereverberation problem a speech signal,                         For a low level white noise signal, which variance is esti-
modelled as an AR process, is filtered by an acoustical trans-                       mated from signal free segments, the tracking ability of the
fer function (ATF), modelled as an FIR filter. Noise is then                         algorithm is presented in Fig. 5. It is worth mentioning that
added to the output constructing the noisy and reverberated                         the presented problem is a very simple one, the order of the
speech signals, as depicted in Fig. 4.                                              AR process is 1 and the filters a1 , a2 are 3 taps long. The
                                                                                    SNR value is very high. Even in this simple case convergence
                                                          g1 e1 (t)
                                                                                    is not guaranteed.

                                                              P                                             AR parameters                                Gain
                                     A1 (ejω )                        z1 (t)                 1                                             15


    s(t)                                                                                 0.4

                                           jω                 P
                                     A2 (e      )                     z2 (t)                 0


                                                          g2 e2 (t)                     −0.6


                                                                                         −1                                                 0
                                                                                           0          500      1000          1500            0   500    1000       1500   2000
     Figure 4: Two channel dereverberation problem.                                                           Samples                                  Samples

                                                                                                            A1 coeff.                                  A2 coeff.
                                                                                         5                                                  6

3.2.1 Signals Model                                                                      2

                                                                                         1                                                  0

The reverberated and noisy signals presented in Fig. 4 are                               0                                                 −2

given by the following model,                                                           −1
                           X                                                            −3

               s(t) = −           αk s(t − k) + gus (t)us (t)           (12)            −4                                                 −8
                                                                                          0          500     1000           1500    2000     0   500    1000       1500   2000
                           k=1                                                                              Samples                                    Samples

                         na −1
             z1 (t) =            a1 (k)s(t − k) + g1 e1 (t)                         Figure 5: Tracking ability of the parameters of the derever-
                         k=0                                                        beration problem.
                         na −1
             z2 (t) =            a2 (k)s(t − k) + g2 e2 (t).
                                                                                    4            DISCUSSION
Thus, we have again a problem of estimating both the speech
signal and the following model parameters,                                          In this paper we applied the newly proposed UKF to two
                                                                                    speech processing problems. Results show that the method is
           T (t) =       (t) gus (t)     a1 (t) a2 (t)   g1 g2 .                   applicable to the problems in hand. Nevertheless, for a com-
3.2.2 Joint Speech and Parameters Estimation                                        prehensive test, it should be further applied to real speech
                                                                                    signals embedded in higher noise levels. Performance lim-
                                                                                    itations and optimality issues of the suggested method are
           sTa (t)
            n        =   s(t) s(t − 1) · · · s(t − na + 1)                          under current research.
            gT (t)
             s       =   gs (t) 0 · · · 0                                           5            *
           u (t)
                     =   uα1 (t) uα2 (t) · · · uαp (t)                              References
           uT1 (t)
            a        =   ua1 (t) ua2 (t) · · · uana (t)
                            1         1          1
                                                                                    [1] S. Julier, J. Uhlmann and H.F. Durrant-Whyte, “A New
                                                                                        Method for the Nonlinear Transformation of Means and
           ua2 (t)
                     =   ua1 (t) ua2 (t) · · · uana (t)
                            2         2          2                                      Covariances in Filters and Estimators,” IEEE trans. on
and Φs (t) an na × na signal transition matrix having equiv-                            Automatic Control, vol. 45, no. 3, pp. 477–482, Mar.
alent structure to the one presented in (3). Then, the aug-                             2000.
mented transition-measurement equations can be written as,                          [2] E. A. Wan and R. van der Merwe, “The Unscented
  sna (t) 3 2 Φs (t) 0 0 0 3 2 sna (t − 1) 3 2 gs us (t) 3                              Kalman Filter for Nonlinear Estimation,” in Sympo-
6 (t) 7        6 0 Ip 0          0 7 6 (t − 1) 7 6 u (t) 7                              sium 2000 on Adaptive Systems for Signal Processing,
4 a (t) 5 = 4 0                                        +
                         0 Ina 0 5 4 a1 (t − 1) 5 4 ua1 (t) 5                           Communication and Control (AS-SPCC), Lake Louise,
   a2 (t)          0     0 0 Ina          a2 (t − 1)      ua2 (t)                       Alberta, Canada, Oct. 2000, IEEE.
               |                 {z                  }
                                                                                    [3] S. Gannot, D. Burshtein, and E. Weinstein, “Iterative
                                2         3                                             and Sequential Kalman Filter-Based Speech Enhance-
                             sna (t)                                              ment Algorithms,” IEEE Trans. on Speech and Audio
    z1 (t)
                 a1 (t) 0 0 0 6 (t) 7 + g1 e1 (t)                                       Proc., vol. 6, no. 4, pp. 373–385, Jul. 1998.
    z2 (t)       a2 (t) 0 0 0 4 a1 (t) 5 g2 e2 (t)                                  [4] R.   van    der   Merwe,          “Matlab c code,”
                                   a2 (t)                                               /users/sista/sgannot/matlab/Ukf_W/, May 2001.
               |            {z             }
which is a nonlinear set of equations, fitting the UKF frame-


To top