Vocal Tract Normalization Equals Linear Transformation in Cepstral Space
Michael Pitz, Sirko Molau, Ralf Schl¨ ter, Hermann Ney
Lehrstuhl f¨ r Informatik VI, Computer Science Department,
RWTH Aachen – University of Technology, 52056 Aachen, Germany
Abstract In contrast, we will show that there is a general equivalence
of VTN frequency warping and a linear transformation of the
We show that vocal tract normalization (VTN) frequency warp- cepstral vector, independent of these assumptions. A related re-
ing results in a linear transformation in the cepstral domain. For sult has been reported in  in the context of spectral distortion
the special case of a piece-wise linear warping function, the measures.
transformation matrix is analytically calculated. This approach The remainder of the paper is organized as follows: In the
enables us to compute the Jacobian determinant of the trans- second paragraph we show that VTN amounts to a linear trans-
formation matrix, which allows the normalization of the proba- formation of the acoustic vector. The transformation matrices
bility distributions used in speaker-normalization for automatic for the cases of linear and piece-wise linear warping are analyti-
speech recognition. cally derived in the third paragraph, followed by some examples
obtained by warping a given spectrum with our approach. Then
1. Introduction we discuss the implications for the normalization of probabil-
ity distributions when transforming the random variables. The
Vocal tract normalization (VTN) tries to compensate for the ef- paper is summarized in section 6.
fect of speaker dependent vocal tract lengths by warping the
frequency axis of the power spectrum [2, 5, 3, 9, 10]:
2. Cepstral Representation of VTN
« ¼ ℄ ¼ ℄ (1) Frequency Warping
«´ µ We consider cepstral coefﬁcients
The warping function « is assumed to be invertible, i.e. strictly
¬ ´ µ¬
monotonic and continuous (see Figure 1). ¾
Ó×´ µ Ð ¬
π ¬ ¬ (2)
where may either denote the true physical or the Mel fre-
quency scale. Note that the conventional deﬁnition of
ω by a factor of 2.
The Ò th cepstral coefﬁcient of the warped spectrum is
Ò ´«µ ½ Ð ´ « ½µ ´
µ ¾ ¡
ω π In order to obtain the value of the warped power spectrum for a
given frequency, we access the unwarped spectrum at the fre-
Figure 1: Example of VTN warping functions « for different quency determined by the inverse warping function. This is
values of «. necessary as in practice only the discrete unwarped spectrum
is given. Explicit spectral interpolation for warping is avoided
The relationship between VTN frequency warping and lin- ¬ ´ ½µ
ear transformations in the cepstral domain has been studied be- Now we expand the spectrum Ð ¬ ´ « ´ µ µ¬ in a
fore [1, p.199],. However, these investigations were based Fourier series:
on special assumptions:
¯ The VTN frequency warping is restricted to a bilinear ¬
transformation [1, p.119],. Ð ¬ ´ µ¬
¯ The cepstral representation is based on an all-pass or
LPC model . where
denotes the -th cepstral coefﬁcient of the unwarped
spectrum. Interchanging integration and summation yields:
Ò ´«µ ¾
Ó×´ « ½µ ´ µ µ
´ We choose the inﬂexion point ¼ where the slope of the warping
¼ function changes as follows:
Ó×´ « ½µ ´ µ µ
´ « ½
¡« « ½
depends solely on «.
Hence, ´ « ¼µ
Ó×´ ´ ½µ´ µ µ π
Thus, the vector of warped cepstral coefﬁcients is a linear trans- α>1
formation of the original cepstral coefﬁcients with a transfor-
mation matrix ´«µ of dimension Æ Ã . In the case of con- ¢ ∼
tinuous spectra there may be no upper limit for Æ and Ã . In
practice, however, we work with discrete spectra. Hence, Æ
and Ã will be ﬁnite, but not necessarily have the same value.
Choosing a smaller value of Æ results in a smoothing of the
power spectrum and eliminates the pitch.
ω ω0 π
3. Analytic Calculation of the
Figure 2: Piece-wise linear warping functions for different val-
3.1. Linear Warping Function ues of «
In order to apply a piece-wise linear warping, we ﬁrst compute
the solution for a strictly linear warping function: The transformation matrix Ò ´«µ is computed similar to the
« «¡ linear case:
´ ½µ « ½ ¡ ¼ ½
The entries Ò ´«µ of the transformation matrix can be com- Ò ´« ¼ µ
Ó×´« ½ µ
puted by elementary integration. For « ½ we obtain: ¼ ¼
Ó×´« ½ µ «¡
Ò with ¼ ¼.
Noting that the solution for « remains the same as
Ó×´ Ò · « ½ µ ·
Ó×´ Ò « ½ µ
in the linear case, we obtain for « ½:
½ · ½
× Ò ´Ò · « ½ µ ℄ · × Ò ´Ò « ½ µ ℄ Ò ´«µ × Ò´Ò «« ½ µµ ¼ · × Ò´Ò· «« ½ µµ
´Ò · « ½ µ ´Ò « ½ µ
½ this simpliﬁes to
× Ò´Ò «« µ
× Ò´Ò ·«« µ
¾ Ò ¼ Ò ½
Ò ´½µ ¼ ¼
ÆÒ else (4)
because of the orthonormality of the cosine function. Note that
the value for Ò ¼ results from our special deﬁnition of This matrix can now be used for VTN alternatively to explicit
the zeroth cepstral coefﬁcient
¼ . warping the discrete-frequency power spectrum or the inte-
grated approach described in .
3.2. Piece-wise Linear Warping Function
3.3. General Warping Functions
To meet the requirement of invertibility, we now consider a
piece-wise linear warping function [10, 11] with two parame- We would like to stress again that VTN can always be written
ters ´« ¼ µ as shown in Figure 2: as a linear transformation in the cepstral domain independent
of the functional form of the invertible warping function (see
eqn. (3)). The analytic calculation of the transformation matrix
« ¼ · « ¼´
« for a non–linear warping function, however, is not as straight-
¼ ¼ forward as in the piece-wise linear case presented above.
4. Examples by calculating only the ﬁrst 16 cepstral coefﬁcients and warp
In this section we will show some examples of spectra obtained
hereafter using a ½ ¢ ½ matrix, we obtain slightly different re-
sults. The difference between both methods is shown in Figure
by applying the linear transformation to the cepstral vectors. A
sample spectrum (Figure 3, « ½ ¼) with Æ ½¾ spectral 6.
lines was transformed into Ã ½¾ cepstral coefﬁcients by a
discrete cosine transform (DCT):
Æ ¾ ½ ¬
¬ Ò ¬¾ Ò µ
Æ Ò ¼
Then the cepstral vector has been transformed into a piece-wise
linearly warped (4) cepstral vector of 512 coefﬁcients for warp-
ing factors « ¼ and « ½ ¾, respectively. Afterwards,
the inverse DCT has been applied to the warped cepstral vec-
tor in order to obtain a warped spectrum. This last transfor-
mation has been carried out for demonstration only; in practice
the warped cepstral vector is used for further processing. A
0 2000 4000 6000 8000
comparison of the warped cepstral coefﬁcients obtained by the frequency [Hz]
method presented here with those computed from the spectrum
as described in  reveals no differences.
Figure 4: Example of a smoothed spectrum; the cepstrum was
warped with a ½¾ ¢ ½¾ matrix (« ¼ ) and subsequently
reduced to 16 coefﬁcients.
α=1.2 0 2000 4000 6000 8000
Figure 5: Example of a smoothed spectrum; the cepstrum was
warped with a ½¾ ¢ ½¾ matrix (« ½ ¾) and subsequently
reduced to 16 coefﬁcients.
0 2000 4000 6000 8000
frequency [Hz] α=0.8
Figure 3: Example of warped spectra with warping factors «
¼ and « ½ ¾ . first warp, then smooth
first smooth, then warp
As an additional example we show the effect of cepstral
smoothing in Figures 4 and 5. Again, the spectrum shown in
Figure 3 has been transformed into 512 cepstral coefﬁcents and
has now been smoothed by transforming back with only the α=1.2
ﬁrst 16 cepstral coefﬁcients (« ½ in Figs. 4, 5). The warped
spectra have been obtained by calculating 512 cepstral coefﬁ-
cients, transforming them with (4) into 512 warped cepstral co-
efﬁcients, and subsequent smoothing by transforming back with
only the ﬁrst 16 warped cepstral coefﬁcients. It should be noted
0 2000 4000 6000 8000
that this time we can exactly reproduce the warping obtained frequency [Hz]
from  only if we ﬁrst compute all 512 cepstral coefﬁcients,
warp them using (4), and smooth at this point using only the
ﬁrst 16 of the obtained cepstral coefﬁcients. If we ﬁrst smooth Figure 6: Effect of different order of warping and smoothing
5. Speaker Normalization 7. References
In speaker normalization the acoustic observation vector is  A. Acero, “Acoustical and Environmental Robustness in
modiﬁed, whereas speaker adaptation modiﬁes the acoustic Automatic Speech Recognition”, Ph. D. Thesis, Carnegie
model parameters. This will cause the probability distribution Mellon University, Pittsburgh, PA, USA, September 1990.
to be not properly normalized anymore. To re-normalize the  E. Eide, H. Gish, “A Parametric Approach to Vocal
transformed distributions, the Jacobian of the transformation Tract Length Normalization,” Proc. Int. Conf. on Acoustic,
must be taken into account [4, 7]. Speech and Signal Processing, Vol. 1, pp. 346-349, Atlanta,
In VTN the speaker normalization is usually not performed GA, May 1996.
as a transformation of the acoustic vectors but by warping the
power spectrum during signal analysis instead. Hence, the Ja-  L. Lee, R. Rose “Speaker Normalization Using Efﬁcient
cobian can hardly be calculated. The warping factor « is usu- Frequency Warping Procedures” Proc. Int. Conf. on Acous-
ally determined by a maximum likelihood criterion. If the cor- tic, Speech and Signal Processing, Vol. 1, pp. 353-356, At-
rect normalization is neglected, systematic errors in estimating lanta, GA, May 1996.
« may occur.  J. McDonough, “Speaker Normalization With All-
Expressing VTN as a matrix transformation of the acoustic Pass Transforms”, Technical Report No. 28, Center
vector ( Ü Ü) enables us to take the Jacobian into account: for Language Speech Prcessing, The Johns Hop-
Æ ´Ü ¦µ Æ ´ Ü ¦µ kins University, Baltimore, MD, USA, Sep. 1998
Æ ´Ü ½ ½ Ì ¦ ½ µ u
 S. Molau, M. Pitz, R. Schl¨ ter, H. Ney, “Computing Mel-
Frequency Cepstral Coefﬁcients on the Power Spectrum”
Ô ½ ÜÔ Proc. Int. Conf. on Acoustic, Speech and Signal Processing,
Ø ¾ ½ Ì ¦ ½ Salt Lake City, UT, June 2001, to appear.
 F. K. Nocerino, L. R. Rabiner and D. H. Klatt, “Com-
ÜÔ parative Study of Several Distortion Measures for Speech
Ø¾ ¦ Recognition”, Proc. Int. Conf. on Acoustic, Speech and Sig-
where in the last step is assumed to be square. The practical nal Processing, pp. 25-28, Atlanta, GA, Apr. 1985.
inﬂuence of the Jacobian is subject of current research. A quali-  A. Sankar, C.-H. Lee, “A Maximum-Likelihood Approach
tative plot showing the dependency of the Jacobian determinant to Stochastic Matching for Robust Speech Recognition”,
on the warping factor alpha has been computed numerically for IEEE Trans. on Acoustics, Speech and Signal Processing,
piece-wise linear warping (Figure 3). Vol 4, No. 3, May 1996.
The dependency of Ø ´«µ on « can be used for a re-
ﬁned estimation of « in speaker normalization.  L.F. Uebel, P.C. Woodland, “An Investigation into Vocal
Tract Length Normalisation”, Proc. 6th Europ. Conf. on
Speech Communication and Technology, Vol. 6, pp. 2527-
2530, Budapest, Hungary, Sep. 1999.
 H. Wakita: “Normalization of Vowels by Vocal Tract
Length and its Application to Vowel Identiﬁcation.” IEEE
Trans. on Acoustics, Speech and Signal Processing, Vol.
−log |det A(α)|
ASSP-25, No. 2, pp. 183-192, April 1977.
 S. Wegmann, D. McAllaster, J. Orloff, B. Peskin,
“Speaker Normalization on Conversational Telephone
Speech,” Proc. Int. Conf. on Acoustic, Speech and Signal
Processing, Vol. 1, pp. 339-341, Atlanta, GA, May 1996.
 L. Welling, S. Kanthak, H. Ney, “Improved Methods for
Vocal Tract Normalization,” Proc. Int. Conf. on Acous-
tic, Speech and Signal Processing, Vol. 2, pp. 761–764,
Phoenix, AZ, April 1999.
0.8 0.9 1 1.1 1.2
Figure 7: Plot of ÐÓ Ø ´«µ for piece-wise linear warp-
ing as function of «. The scaling of the ordinate is intentionally
left out as it depends on the number of cepstral coefﬁcients.
We have shown that vocal tract normalization can be expressed
as a linear transformation of the cepstral vector for arbitrary in-
vertible warping functions. For the case of piece-wise linear
warping we derived an analytic solution for the transformation
matrix. This allows us to re-normalize the probability distribu-
tion with the Jacobian of the transformation.