Vocal Tract Normalization Equals Linear Transformation in Cepstral

Document Sample
Vocal Tract Normalization Equals Linear Transformation in Cepstral Powered By Docstoc
					  Vocal Tract Normalization Equals Linear Transformation in Cepstral Space
                               Michael Pitz, Sirko Molau, Ralf Schl¨ ter, Hermann Ney

                        Lehrstuhl f¨ r Informatik VI, Computer Science Department,
                      RWTH Aachen – University of Technology, 52056 Aachen, Germany

                          Abstract                                   In contrast, we will show that there is a general equivalence
                                                                     of VTN frequency warping and a linear transformation of the
We show that vocal tract normalization (VTN) frequency warp-         cepstral vector, independent of these assumptions. A related re-
ing results in a linear transformation in the cepstral domain. For   sult has been reported in [6] in the context of spectral distortion
the special case of a piece-wise linear warping function, the        measures.
transformation matrix is analytically calculated. This approach           The remainder of the paper is organized as follows: In the
enables us to compute the Jacobian determinant of the trans-         second paragraph we show that VTN amounts to a linear trans-
formation matrix, which allows the normalization of the proba-       formation of the acoustic vector. The transformation matrices
bility distributions used in speaker-normalization for automatic     for the cases of linear and piece-wise linear warping are analyti-
speech recognition.                                                  cally derived in the third paragraph, followed by some examples
                                                                     obtained by warping a given spectrum with our approach. Then
                      1. Introduction                                we discuss the implications for the normalization of probabil-
                                                                     ity distributions when transforming the random variables. The
Vocal tract normalization (VTN) tries to compensate for the ef-      paper is summarized in section 6.
fect of speaker dependent vocal tract lengths by warping the
frequency axis of the power spectrum [2, 5, 3, 9, 10]:
                                                                          2. Cepstral Representation of VTN
                  «    ¼ ℄           ¼ ℄                      (1)                    Frequency Warping
                                           «´   µ                    We consider cepstral coefficients 
 defined by:

The warping function « is assumed to be invertible, i.e. strictly
               ½               Ð
                                                                                                               ¬   ´ µ¬
                                                                                                                                                     ¼       Ã
monotonic and continuous (see Figure 1).                                              ¾
Ó×´ µ Ð                   ¬
                                                                                                                           ´ µ¬
             π                                                                                                     ¬          ¬                              (2)
                                 α>1                                                       ¼

                                                                     where may either denote the true physical or the Mel fre-
                                                                     quency scale. Note that the conventional definition of 
¼ differs
             ω                                                       by a factor of 2.
                                                                     The Ò th cepstral coefficient of the warped spectrum is

Ò ´«µ            ½           Ð           ´    « ½µ ´
                                                                                                                                             µ ¾ ¡ 
Ó×´ Òµ

                                ω               π                    In order to obtain the value of the warped power spectrum for a
                                                                     given frequency, we access the unwarped spectrum at the fre-
Figure 1: Example of VTN warping functions « for different           quency determined by the inverse warping function. This is
values of «.                                                         necessary as in practice only the discrete unwarped spectrum
                                                                     is given. Explicit spectral interpolation for warping is avoided
                                                                     this way.
     The relationship between VTN frequency warping and lin-                                                   ¬      ´ ½µ
                                                                                                               ¬              ¬
ear transformations in the cepstral domain has been studied be-           Now we expand the spectrum Ð ¬ ´ « ´ µ µ¬ in a
fore [1, p.199],[4]. However, these investigations were based        Fourier series:
on special assumptions:
    ¯   The VTN frequency warping is restricted to a bilinear                                  ¬
                                                                                                           ¬¾              Ã
        transformation [1, p.119],[4].                                                 Ð       ¬   ´ µ¬
                                                                                                      ¬                ¾           
Ó×´ µ
    ¯   The cepstral representation is based on an all-pass or

        LPC model [4].                                               where   
   denotes the -th cepstral coefficient of the unwarped
spectrum. Interchanging integration and summation yields:
Ò ´«µ                ¾ 
Ó×´ Òµ                        
Ó×´ « ½µ ´ µ µ
                                                                       ´                       We choose the inflexion point                                  ¼   where the slope of the warping
                                                            ¼                                  function changes as follows:
Ó×´ Òµ 
Ó×´ « ½µ ´ µ µ
                                                                 ´                                                                                                    «     ½
                                   ¼            ¼
                                                                                                                                                             ¡«       «     ½
                                           Ò   ´«µ 
                                                                                                                         depends solely on «.
                                                                                               Hence,      ´   «    ¼µ


                   ´«µ             ¾ 
Ó×´ Òµ 
Ó×´ ´ ½µ´ µ µ                                                         π
               Ò                                  «

Thus, the vector of warped cepstral coefficients is a linear trans-                                                                               α>1
formation of the original cepstral coefficients with a transfor-
mation matrix ´«µ of dimension Æ Ã . In the case of con-         ¢                                                 ∼
                                                                                                                   ω                                         α=1
tinuous spectra there may be no upper limit for Æ and à . In
practice, however, we work with discrete spectra. Hence, Æ
and à will be finite, but not necessarily have the same value.
Choosing a smaller value of Æ results in a smoothing of the
power spectrum and eliminates the pitch.
                                                                                                                                                         ω                ω0    π
                  3. Analytic Calculation of the
                     Transformation Matrix
                                                                                               Figure 2: Piece-wise linear warping functions for different val-
3.1. Linear Warping Function                                                                   ues of «
In order to apply a piece-wise linear warping, we first compute
the solution for a strictly linear warping function:                                           The transformation matrix                                 Ò   ´«µ is computed similar to the
                                   «                                «¡                         linear case:
                           ´    ½µ                                  «  ½ ¡                                                           ¼                       ½
                                                                                                                                     ¾ ·

The entries Ò ´«µ of the transformation matrix can be com-                                            Ò    ´«       ¼   µ                                        
Ó×´ Òµ 
Ó×´« ½ µ
puted by elementary integration. For « ½ we obtain:                                                                                      ¼               ¼

         ´«µ       ¾ 
Ó×´ Òµ 
Ó×´« ½ µ                                                                             «¡
  Ò                                                                                            with    ¼                    ¼.

                                                                                               Noting that the solution for «                                             remains the same as
Ó×´ Ò · « ½ µ · 
Ó×´ Ò   « ½ µ
                                                                                               in the linear case, we obtain for «                                   ½:
                                                                                                                  ½                ·  ½

                   × Ò ´Ò · « ½ µ ℄ · × Ò ´Ò   « ½ µ ℄                                           Ò    ´«µ × Ò´Ò  «« ½ µµ ¼ · × Ò´Ò· «« ½ µµ
                                                                                                            ´Ò                 ´Ò  

                     ´Ò · « ½ µ         ´Ò   « ½ µ
               ½ this simplifies to                                                                                                                                               
                                                                                                                     × Ò´Ò  «« µ
For «
                                                                                                                                                                       × Ò´Ò ·«« µ
                                                                                                                                                     ½                                      ½
                                                                                                                                                                 ¼                                  ¼
                                                    ¾           Ò            ¼                                       Ò                           ½
                                                                                                                                                                          Ò·            ½
                                                                                                                                                         ¼                                      ¼

                       Ò   ´½µ                                                                                                                   ¼                                          ¼
                                                ÆÒ              else                                                                                                                                    (4)
because of the orthonormality of the cosine function. Note that
the value for Ò          ¼ results from our special definition of                               This matrix can now be used for VTN alternatively to explicit
the zeroth cepstral coefficient 
¼ .                                                            warping the discrete-frequency power spectrum or the inte-
                                                                                               grated approach described in [5].
3.2. Piece-wise Linear Warping Function
                                                                                               3.3. General Warping Functions
To meet the requirement of invertibility, we now consider a
piece-wise linear warping function [10, 11] with two parame-                                   We would like to stress again that VTN can always be written
ters ´« ¼ µ as shown in Figure 2:                                                              as a linear transformation in the cepstral domain independent
                                                                                               of the functional form of the invertible warping function (see
                                                  «                                        ¼
                                                                                               eqn. (3)). The analytic calculation of the transformation matrix

                                           « ¼ ·  « ¼´
     «                                                                                         for a non–linear warping function, however, is not as straight-
 ´        ¼µ

                                                                                 ¼         ¼   forward as in the piece-wise linear case presented above.
                      4. Examples                                  by calculating only the first 16 cepstral coefficients and warp
In this section we will show some examples of spectra obtained
                                                                   hereafter using a ½  ¢ ½ matrix, we obtain slightly different re-
                                                                   sults. The difference between both methods is shown in Figure
by applying the linear transformation to the cepstral vectors. A
sample spectrum (Figure 3, «      ½ ¼) with Æ        ½¾ spectral   6.
lines was transformed into à       ½¾ cepstral coefficients by a
discrete cosine transform (DCT):
                 Æ ¾  ½       ¬
                                      Æ µ¬ 
Ó×´ ¾
                              ¬        Ò ¬¾      Ò µ
                          Ð       ´
                     ¬          ¬
              Æ Ò     ¼
Then the cepstral vector has been transformed into a piece-wise
linearly warped (4) cepstral vector of 512 coefficients for warp-
ing factors «     ¼ and « ½ ¾, respectively. Afterwards,
the inverse DCT has been applied to the warped cepstral vec-
tor in order to obtain a warped spectrum. This last transfor-
mation has been carried out for demonstration only; in practice
the warped cepstral vector is used for further processing. A
                                                                       0           2000            4000                6000            8000
comparison of the warped cepstral coefficients obtained by the                                 frequency [Hz]
method presented here with those computed from the spectrum
as described in [5] reveals no differences.
                                                                   Figure 4: Example of a smoothed spectrum; the cepstrum was
                                                                   warped with a ½¾     ¢ ½¾ matrix (« ¼ ) and subsequently
                                                                   reduced to 16 coefficients.

                                                     α=1.0                                                                     α=1.0


                                                     α=1.2             0           2000            4000                6000            8000
                                                                                              frequency [Hz]

                                                                   Figure 5: Example of a smoothed spectrum; the cepstrum was
                                                                   warped with a ½¾     ¢ ½¾ matrix (« ½ ¾) and subsequently
                                                                   reduced to 16 coefficients.
0            2000              4000           6000         8000
                          frequency [Hz]                                                                  α=0.8

Figure 3: Example of warped spectra with warping factors «
¼  and « ½ ¾ .                                                                first warp, then smooth
                                                                              first smooth, then warp
    As an additional example we show the effect of cepstral
smoothing in Figures 4 and 5. Again, the spectrum shown in
Figure 3 has been transformed into 512 cepstral coefficents and
has now been smoothed by transforming back with only the                                                       α=1.2
first 16 cepstral coefficients (« ½ in Figs. 4, 5). The warped
spectra have been obtained by calculating 512 cepstral coeffi-
cients, transforming them with (4) into 512 warped cepstral co-
efficients, and subsequent smoothing by transforming back with
only the first 16 warped cepstral coefficients. It should be noted
                                                                       0           2000            4000                6000            8000
that this time we can exactly reproduce the warping obtained                                  frequency [Hz]
from [5] only if we first compute all 512 cepstral coefficients,
warp them using (4), and smooth at this point using only the
first 16 of the obtained cepstral coefficients. If we first smooth     Figure 6: Effect of different order of warping and smoothing
                           5. Speaker Normalization                                            7. References
    In speaker normalization the acoustic observation vector is           [1] A. Acero, “Acoustical and Environmental Robustness in
    modified, whereas speaker adaptation modifies the acoustic                  Automatic Speech Recognition”, Ph. D. Thesis, Carnegie
    model parameters. This will cause the probability distribution            Mellon University, Pittsburgh, PA, USA, September 1990.
    to be not properly normalized anymore. To re-normalize the            [2] E. Eide, H. Gish, “A Parametric Approach to Vocal
    transformed distributions, the Jacobian of the transformation             Tract Length Normalization,” Proc. Int. Conf. on Acoustic,
    must be taken into account [4, 7].                                        Speech and Signal Processing, Vol. 1, pp. 346-349, Atlanta,
         In VTN the speaker normalization is usually not performed            GA, May 1996.
    as a transformation of the acoustic vectors but by warping the
    power spectrum during signal analysis instead. Hence, the Ja-         [3] L. Lee, R. Rose “Speaker Normalization Using Efficient
    cobian can hardly be calculated. The warping factor « is usu-             Frequency Warping Procedures” Proc. Int. Conf. on Acous-
    ally determined by a maximum likelihood criterion. If the cor-            tic, Speech and Signal Processing, Vol. 1, pp. 353-356, At-
    rect normalization is neglected, systematic errors in estimating          lanta, GA, May 1996.
    « may occur.                                                          [4] J. McDonough, “Speaker Normalization With All-
         Expressing VTN as a matrix transformation of the acoustic            Pass Transforms”, Technical Report No. 28, Center
    vector ( Ü      Ü) enables us to take the Jacobian into account:          for Language Speech Prcessing, The Johns Hop-
                        Æ ´Ü ¦µ     Æ ´ Ü ¦µ                                  kins University, Baltimore, MD, USA, Sep. 1998
                                    Æ ´Ü  ½        ½ Ì ¦  ½ µ                                             u
                                                                          [5] S. Molau, M. Pitz, R. Schl¨ ter, H. Ney, “Computing Mel-
                                                                              Frequency Cepstral Coefficients on the Power Spectrum”
                                    Ô         ½     ÜÔ                        Proc. Int. Conf. on Acoustic, Speech and Signal Processing,
                                      Ø ¾  ½ Ì ¦  ½                           Salt Lake City, UT, June 2001, to appear.

                                    Ô Ø
                                                                          [6] F. K. Nocerino, L. R. Rabiner and D. H. Klatt, “Com-
                                           ÜÔ                                 parative Study of Several Distortion Measures for Speech
                                      ؾ ¦                                    Recognition”, Proc. Int. Conf. on Acoustic, Speech and Sig-
    where in the last step is assumed to be square. The practical             nal Processing, pp. 25-28, Atlanta, GA, Apr. 1985.
    influence of the Jacobian is subject of current research. A quali-     [7] A. Sankar, C.-H. Lee, “A Maximum-Likelihood Approach
    tative plot showing the dependency of the Jacobian determinant            to Stochastic Matching for Robust Speech Recognition”,
    on the warping factor alpha has been computed numerically for             IEEE Trans. on Acoustics, Speech and Signal Processing,
    piece-wise linear warping (Figure 3).                                     Vol 4, No. 3, May 1996.
         The dependency of       Ø ´«µ on « can be used for a re-
    fined estimation of « in speaker normalization.                        [8] L.F. Uebel, P.C. Woodland, “An Investigation into Vocal
                                                                              Tract Length Normalisation”, Proc. 6th Europ. Conf. on
                                                                              Speech Communication and Technology, Vol. 6, pp. 2527-
                                                                              2530, Budapest, Hungary, Sep. 1999.
                                                                          [9] H. Wakita: “Normalization of Vowels by Vocal Tract
                                                                              Length and its Application to Vowel Identification.” IEEE
                                                                              Trans. on Acoustics, Speech and Signal Processing, Vol.
−log |det A(α)|

                                                                              ASSP-25, No. 2, pp. 183-192, April 1977.
                                                                          [10] S. Wegmann, D. McAllaster, J. Orloff, B. Peskin,
                                                                              “Speaker Normalization on Conversational Telephone
                                                                              Speech,” Proc. Int. Conf. on Acoustic, Speech and Signal
                                                                              Processing, Vol. 1, pp. 339-341, Atlanta, GA, May 1996.
                                                                          [11] L. Welling, S. Kanthak, H. Ney, “Improved Methods for
                                                                              Vocal Tract Normalization,” Proc. Int. Conf. on Acous-
                                                                              tic, Speech and Signal Processing, Vol. 2, pp. 761–764,
                                                                              Phoenix, AZ, April 1999.
                  0.8         0.9             1          1.1        1.2

    Figure 7: Plot of ÐÓ        Ø ´«µ for piece-wise linear warp-
    ing as function of «. The scaling of the ordinate is intentionally
    left out as it depends on the number of cepstral coefficients.

                                    6. Conclusion
    We have shown that vocal tract normalization can be expressed
    as a linear transformation of the cepstral vector for arbitrary in-
    vertible warping functions. For the case of piece-wise linear
    warping we derived an analytic solution for the transformation
    matrix. This allows us to re-normalize the probability distribu-
    tion with the Jacobian of the transformation.

Shared By: